First follow the steps described in my tutorial about Sentiment Analysis with Twitter but stop before the point “The Analyzing”.
By this step we got our tweets
1 |
tweets = searchTwitter("sustainable development", n=200, cainfo="cacert.pem") |
We now have to get the Text from our tweets to analyze them. We do this with:
1 |
tweets.text = laply(tweets,function(t)t$getText()) |
Sometimes this text has invalid characters in it which will make our API crash; so we have to remove them.
We can use a function of the site Viralheat to do so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
clean.text <- function(some_txt) { some_txt = gsub("&", "", some_txt) some_txt = gsub("(RT|via)((?:\b\\W*@\\w+)+)", "", some_txt) some_txt = gsub("@\\w+", "", some_txt) some_txt = gsub("[[:punct:]]", "", some_txt) some_txt = gsub("[[:digit:]]", "", some_txt) some_txt = gsub("http\\w+", "", some_txt) some_txt = gsub("[ t]{2,}", "", some_txt) some_txt = gsub("^\\s+|\\s+$", "", some_txt) # define "tolower error handling" function try.tolower = function(x) { y = NA try_error = tryCatch(tolower(x), error=function(e) e) if (!inherits(try_error, "error")) y = tolower(x) return(y) } some_txt = sapply(some_txt, try.tolower) some_txt = some_txt[some_txt != ""] names(some_txt) = NULL return(some_txt) } |
1 |
clean_text = clean.text(tweets.text) |
1 |
tweet_corpus = Corpus(VectorSource(clean_text)) |
1 |
tdm = TermDocumentMatrix(tweet_corpus, control = list(removePunctuation = TRUE,stopwords = c("machine", "learning", stopwords("english")), removeNumbers = TRUE, tolower = TRUE)) |
But before we have to install the wordcloud tool:
1 2 3 |
install.packages(c("wordcloud","tm"),repos="http://cran.r-project.org") library(wordcloud) |
1 2 3 4 5 6 7 8 9 |
require(plyr) m = as.matrix(tdm) #we define tdm as matrix word_freqs = sort(rowSums(m), decreasing=TRUE) #now we get the word orders in decreasing order dm = data.frame(word=names(word_freqs), freq=word_freqs) #we create our data set wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2")) #and we visualize our data |
Ok here we have our wordcloud. If you want to save it to your computer you can do it with:
1 2 3 4 5 |
png("Cloud.png", width=12, height=8, units="in", res=300) wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2")) dev.off() |