Hey there!

After my post about sentiment analysis using the Viralheat API I found another service. Datumbox ist offering special sentiment analysis for Twitter. But this API doesn´t just offer sentiment analysis, it offers a much more detailed analysis. „The currently supported API functions are: Sentiment Analysis, Twitter Sentiment Analysis, Subjectivity Analysis, Topic Classification, Spam Detection, Adult Content Detection, Readability Assessment, Language Detection, Commercial Detection, Educational Detection, Gender Detection, Keyword Extraction, Text Extraction and Document Similarity.“

But note:
Datumbox just offers Sentiment analysis for tweets. All the other classifiers like gender or topic are build for longer texts and not for short tweets as they have too less chars. So the results for tweets can be inaccurately.

But these are very interesting features and so I wanted to test them with R.

But before we start you should take a look at the authentication tutorial and go through the steps.

The API Key

In the first step you need an API key. So go to the Datumbox website http://www.datumbox.com/ and register yourself. After you have logged in you can see your free API key here: http://www.datumbox.com/apikeys/view/

Ok, let´s go on with R.

Functions

The getSentiment() function

First import the needed packages for our analysis:

# load packages
library(twitteR)
library(RCurl)
library(RJSONIO)
library(stringr)

# load packages

library(twitteR)

library(RCurl)

library(RJSONIO)

library(stringr)

getSentiment <- function (text, key){

text <- URLencode(text);

#save all the spaces, then get rid of the weird characters that break the API, then convert back the URL-encoded spaces.
text <- str_replace_all(text, "%20", " ");
text <- str_replace_all(text, "%\d\d", "");
text <- str_replace_all(text, " ", "%20");

if (str_length(text) > 360){
text <- substr(text, 0, 359);
}
##########################################

data <- getURL(paste("http://api.datumbox.com/1.0/TwitterSentimentAnalysis.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability
sentiment = js$output$result

###################################

data <- getURL(paste("http://api.datumbox.com/1.0/SubjectivityAnalysis.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability
subject = js$output$result

##################################

data <- getURL(paste("http://api.datumbox.com/1.0/TopicClassification.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability
topic = js$output$result

##################################
data <- getURL(paste("http://api.datumbox.com/1.0/GenderDetection.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability
gender = js$output$result

return(list(sentiment=sentiment,subject=subject,topic=topic,gender=gender))
}

getSentiment <- function (text, key){

text <- URLencode(text);

#save all the spaces, then get rid of the weird characters that break the API, then convert back the URL-encoded spaces.

text <- str_replace_all(text, "%20", " ");

text <- str_replace_all(text, "%\d\d", "");

text <- str_replace_all(text, " ", "%20");

if (str_length(text) > 360){

text <- substr(text, 0, 359);

}

##########################################

data <- getURL(paste("http://api.datumbox.com/1.0/TwitterSentimentAnalysis.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability

sentiment = js$output$result

###################################

data <- getURL(paste("http://api.datumbox.com/1.0/SubjectivityAnalysis.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability

subject = js$output$result

##################################

data <- getURL(paste("http://api.datumbox.com/1.0/TopicClassification.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability

topic = js$output$result

##################################

data <- getURL(paste("http://api.datumbox.com/1.0/GenderDetection.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability

gender = js$output$result

return(list(sentiment=sentiment,subject=subject,topic=topic,gender=gender))

}

The getSentiment() function handles the queries we send to the API. It saves all the results we want to have like sentiment, subject, topic and gender and returns them as a list. For every request we have the same structure and the API is always requesting the API-Key and the text to be analyzed. It then returns a JSON object of the structure

{

output: {

status: 1,

result: positive

}

}

{

output: {

status: 1,

result: positive

}

So what we want to have is the “result”. We extract it with js$output$result where js is the saved JSON response.

The clean.text() function

We need this function because of the problems occurring when the tweets contain some certain characters and to remove characters like “@” and “RT”.

clean.text <- function(some_txt)
{
some_txt = gsub("(RT|via)((?:\b\W*@\w+)+)", "", some_txt)
some_txt = gsub("@\w+", "", some_txt)
some_txt = gsub("[[:punct:]]", "", some_txt)
some_txt = gsub("[[:digit:]]", "", some_txt)
some_txt = gsub("http\w+", "", some_txt)
some_txt = gsub("[ t]{2,}", "", some_txt)
some_txt = gsub("^\s+|\s+$", "", some_txt)

# define "tolower error handling" function
try.tolower = function(x)
{
y = NA
try_error = tryCatch(tolower(x), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(x)
return(y)
}

some_txt = sapply(some_txt, try.tolower)
some_txt = some_txt[some_txt != ""]
names(some_txt) = NULL
return(some_txt)
}

clean.text <- function(some_txt)

{

some_txt = gsub("(RT|via)((?:\b\W*@\w+)+)", "", some_txt)

some_txt = gsub("@\w+", "", some_txt)

some_txt = gsub("[[:punct:]]", "", some_txt)

some_txt = gsub("[[:digit:]]", "", some_txt)

some_txt = gsub("http\w+", "", some_txt)

some_txt = gsub("[ t]{2,}", "", some_txt)

some_txt = gsub("^\s+|\s+$", "", some_txt)

# define "tolower error handling" function

try.tolower = function(x)

{

y = NA

try_error = tryCatch(tolower(x), error=function(e) e)

if (!inherits(try_error, "error"))

y = tolower(x)

return(y)

}

some_txt = sapply(some_txt, try.tolower)

some_txt = some_txt[some_txt != ""]

names(some_txt) = NULL

return(some_txt)

}

Let´s start

Ok now we have our functions, all packages and the API key.

In the first step we need the tweets. We do this with searchTwitter() function as usual.

# harvest tweets
tweets = searchTwitter("iPhone", n=200, lang="en")

1 2	# harvest tweets tweets = searchTwitter("iPhone", n=200, lang="en")

In my example I used the keyword “iphone5″. Of course you can use whatever you want.

In the next steps we have to extract the text from the text and remove the characters with the clean_tweet() function. We just call these functions with:

# get text

tweet_txt = sapply(tweets, function(x) x$getText())

# clean text
tweet_clean = clean.text(tweet_txt)

# get text

tweet_txt = sapply(tweets, function(x) x$getText())

# clean text

tweet_clean = clean.text(tweet_txt)

Then we need to count our tweets and based on this information we build a data frame we will fill with the information from our analysis

# how many tweets

tweet_num = length(tweet_clean)

# data frame (text, sentiment, score)
tweet_df = data.frame(text=tweet_clean, sentiment=rep("", tweet_num),
subject=1:tweet_num, topic=1:tweet_num, gender=1:tweet_num, stringsAsFactors=FALSE)

# how many tweets

tweet_num = length(tweet_clean)

# data frame (text, sentiment, score)

tweet_df = data.frame(text=tweet_clean, sentiment=rep("", tweet_num),

subject=1:tweet_num, topic=1:tweet_num, gender=1:tweet_num, stringsAsFactors=FALSE)

Do the analysis

We come to our final step: the analysis. We call the getSentiment() with the text of every tweet and wait for the answer to save it to a list. So this can cost some time. Just replace API-KEY with your Datumbox API key.

# apply function getSentiment
sentiment = rep(0, tweet_num)
for (i in 1:tweet_num)
{
tmp = getSentiment(tweet_clean[i], "API_KEY")

tweet_df$sentiment[i] = tmp$sentiment

tweet_df$subject[i] = tmp$subject
tweet_df$topic[i] = tmp$topic
tweet_df$gender[i] = tmp$gender
}

# apply function getSentiment

sentiment = rep(0, tweet_num)

for (i in 1:tweet_num)

{

tmp = getSentiment(tweet_clean[i], "API_KEY")

tweet_df$sentiment[i] = tmp$sentiment

tweet_df$subject[i] = tmp$subject

tweet_df$topic[i] = tmp$topic

tweet_df$gender[i] = tmp$gender

}

That´s it! We saved all our parameters in a list and can take a look at our Analysis.

text	sentiment	subject	topic	gender
shit your phone man wtf all ur memories and its a freaking iphone is it in the schl or with ur teacher	negative	subjective	Arts	male
fuck iphone i want the s then o	negative	subjective	Home & Domestic Life	female
stay home saturday night vscocam iphone picarts bored saturday stay postive reoverlay	negative	objective	Sports	female
why i love the mornings sunrise pic iphone now lets get crossfit wod goingcompass fitness	positive	subjective	Home & Domestic Life	female
iphone or stick with my bbhelp	positive	subjective	Home & Domestic Life	female

You can just display your data frame in R with:

tweet_df

tweet_df

Or you can save it to a CSV File with:

write.table(tweet_df,file=Analysis.csv,sep=",",row.names=F)

1	write.table(tweet_df,file=Analysis.csv,sep=",",row.names=F)

Analysis api datumbox R sentiment Social Media Twitter

Recent posts

Analyze Face Emotions with R

The Data Mining Process: Modeling

Revolution R Open Performance Improvements

Get The Best R Books For Just $5

Just published: Mastering RStudio [Free Sample]

ThinkToStart is looking for R Authors

Any questions left? Just send me a message!

Email Newsletter

Sentiment Analysis on Twitter with Datumbox API

The API Key

Functions

The getSentiment() function

The clean.text() function

Let´s start

Do the analysis

Julian Hillebrand

You may also like

Recent posts

Any questions left? Just send me a message!

Email Newsletter