Create a wordcloud with Twitter Data and R

First follow the steps described in my tutorial about Sentiment Analysis with Twitter but stop before the point “The Analyzing”.

By this step we got our tweets

tweets = searchTwitter("sustainable development", n=200, cainfo="cacert.pem")

1	tweets = searchTwitter("sustainable development", n=200, cainfo="cacert.pem")

We now have to get the Text from our tweets to analyze them. We do this with:

tweets.text = laply(tweets,function(t)t$getText())

1	tweets.text = laply(tweets,function(t)t$getText())

Sometimes this text has invalid characters in it which will make our API crash; so we have to remove them.

We can use a function of the site Viralheat to do so:

clean.text <- function(some_txt)
{
some_txt = gsub("&amp", "", some_txt)

some_txt = gsub("(RT|via)((?:\b\\W*@\\w+)+)", "", some_txt)

some_txt = gsub("@\\w+", "", some_txt)

some_txt = gsub("[[:punct:]]", "", some_txt)

some_txt = gsub("[[:digit:]]", "", some_txt)

some_txt = gsub("http\\w+", "", some_txt)

some_txt = gsub("[ t]{2,}", "", some_txt)

some_txt = gsub("^\\s+|\\s+$", "", some_txt)

# define "tolower error handling" function

try.tolower = function(x)

{

y = NA

try_error = tryCatch(tolower(x), error=function(e) e)

if (!inherits(try_error, "error"))

y = tolower(x)

return(y)

}

some_txt = sapply(some_txt, try.tolower)

some_txt = some_txt[some_txt != ""]

names(some_txt) = NULL

return(some_txt)

}

clean.text <- function(some_txt)

{

some_txt = gsub("&amp", "", some_txt)

some_txt = gsub("(RT|via)((?:\b\\W*@\\w+)+)", "", some_txt)

some_txt = gsub("@\\w+", "", some_txt)

some_txt = gsub("[[:punct:]]", "", some_txt)

some_txt = gsub("[[:digit:]]", "", some_txt)

some_txt = gsub("http\\w+", "", some_txt)

some_txt = gsub("[ t]{2,}", "", some_txt)

some_txt = gsub("^\\s+|\\s+$", "", some_txt)

# define "tolower error handling" function

try.tolower = function(x)

{

y = NA

try_error = tryCatch(tolower(x), error=function(e) e)

if (!inherits(try_error, "error"))

y = tolower(x)

return(y)

}

some_txt = sapply(some_txt, try.tolower)

some_txt = some_txt[some_txt != ""]

names(some_txt) = NULL

return(some_txt)

}

You just have to copy-past this code and hit enter in R and you can use this function by letting it analyze our text extracted out of the tweets.

clean_text = clean.text(tweets.text)

1	clean_text = clean.text(tweets.text)

We add this clean text to a so called Corpus, this is the main structure in the tool tm to save collections of text documents. To fill this Vector we have to use the VectorSource attribute. This looks like this:

tweet_corpus = Corpus(VectorSource(clean_text))

1	tweet_corpus = Corpus(VectorSource(clean_text))

To go on we have to transform this Corpus in a so-called Term-document Matrix. This matrix describes the frequency of terms that occur in a collection of documents.

tdm = TermDocumentMatrix(tweet_corpus, control = list(removePunctuation = TRUE,stopwords = c("machine", "learning", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))

1	tdm = TermDocumentMatrix(tweet_corpus, control = list(removePunctuation = TRUE,stopwords = c("machine", "learning", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))

Ok now we have our tdm. We have to do now is arrange our words by frequencies and put them in the wordcloud.

But before we have to install the wordcloud tool:

install.packages(c("wordcloud","tm"),repos="http://cran.r-project.org")

library(wordcloud)

install.packages(c("wordcloud","tm"),repos="http://cran.r-project.org")

library(wordcloud)

require(plyr)

m = as.matrix(tdm) #we define tdm as matrix

word_freqs = sort(rowSums(m), decreasing=TRUE) #now we get the word orders in decreasing order

dm = data.frame(word=names(word_freqs), freq=word_freqs) #we create our data set

wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2")) #and we visualize our data

require(plyr)

m = as.matrix(tdm) #we define tdm as matrix

word_freqs = sort(rowSums(m), decreasing=TRUE) #now we get the word orders in decreasing order

dm = data.frame(word=names(word_freqs), freq=word_freqs) #we create our data set

wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2")) #and we visualize our data

Ok here we have our wordcloud. If you want to save it to your computer you can do it with:

png("Cloud.png", width=12, height=8, units="in", res=300)

wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

dev.off()

png("Cloud.png", width=12, height=8, units="in", res=300)

wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

dev.off()

Now you can find the file Cloud.png on your Computer. Enjoy your own clouds!

And I would like to see some cool clouds on my facebook page 🙂

Info: In the cloud picture you can see that the word “amp” was often used. This is a small mistake and you have to add this keyword to the clean.text() function which can remove it.

Create a wordcloud with Twitter Data and R

Julian Hillebrand

Recent posts

Analyze Face Emotions with R

The Data Mining Process: Modeling

Revolution R Open Performance Improvements

Get The Best R Books For Just $5

Just published: Mastering RStudio [Free Sample]

ThinkToStart is looking for R Authors

Any questions left? Just send me a message!

Email Newsletter

Create a wordcloud with Twitter Data and R

Julian Hillebrand

You may also like

Recent posts

Any questions left? Just send me a message!

Email Newsletter