Cluster Twitter Data with R and k-means

Cluster Twitter Data

Hello everbody!

Today  I want to show you how you can get deeper insights into your Twitter followers with the help of R and show you how to cluster Twitter data. Because I just completed the course “Machine Learning” by Prof. Andrew Ng on Coursera I will use the k-means algorithm and cluster my Twitter followers by the ratio of their followers and how many people they follow.


Before we can get the Twitter data we have to authorize like I described here. But before we go one we make sure that we have installed rcharts:

If it is not installed you have to install it with:

Get the Twitter Data

We get our data in three steps. First we get the user object object, then we get the followers and friends of this user and then we merge it to a dataframe containing all these users and their information.

If you would plot this data now I couldn´t give you a lot of insight as the most data points would be in the left corner. It would look like this:


Plot non log

So what we do is a log transformation and use the log of all the values for our analysis.

Do do so we need to replace all 0 values in our dataframe with 1, because log(0) would result in -Inf values inside our dateframe.

Now we can apply the log transformation. Therefore we add the columns logFollowersCount and logFriendsCount to our dataframe which contain the log() values of the original columns followersCount and friendsCount we received from Twitter.

 Cluster the Twitter data

Now that we have our data we can start cluster Twitter data. Before we can use the k-means algorithm we have to decide how many clusters we want to have in the end. Therefore we can use the so called elbow method. It runs the k-means algorithm with different numbers of clusters and shows the results. Based on this we can decide how many clusters we should choose.

First we extract the relevant columns out of our dataframe and create a new one for our algorithm.

Then we can create the elbow chart and you will see why it is actually called elbow chart:


The best number of clusters it now at the “elbow” of the graph. In this case the best number of clusters would we something around 4. So we will try to find 4 clusters in our data.

Plot the data

Ok now that we have our data we can plot it with the rCharts library.

This has the advantage that we can create easily an interactive graph which provides us additional information like the actual number of followers and friends as we can´t see it directly on the axes after the log transformation.

You can find an interactive example of such a plot here

rcharts twitter

You can find the code on my github and if you have any questions feel free to follow me on Twitter or write a comment 🙂



This post was inspired by

Julian Hillebrand

During my time at university and learning about the basics of economics I started heavily exploring the possibilities and changes caused by digital disruptions and the process of digital transformation, whereby I focused on the importance of data and data analytics and combination with marketing and management.
My personal focus of interest lies heavily on technology, digital marketing and data analytics. I made early acquaintance with programming and digital technology and never stop being interested in following the newest innovations.

I am an open, communicative and curious person. I enjoy writing, blogging and speaking about technology.

  • Pingback: Cluster your Twitter Data with R and k-means ← Patient 2 Earn()

  • Nice post. Quick question, was there adequate support for students of the course that chose to use R (vs. Octave). I saw this class was enrolling, but decided against enrolling due to my concerns around adequate R support for labs/hw and such.

    • Hey,
      in the lectures there was no support for R. Andrew Ng focused on Octave/Matlab.
      There we some threads in the forum about doing the assignments with R. But this course is more about the theoretical approaches of the algorithms behind all that.
      But I think this course should focus on R programming:


      • Smita

        Hi Julian, Nice post indeed ! I wanted to know if I can get the number of followers and friends from Facebook . If, yes ! should I refer the same code given by you for Twitter?

        Thanks in advance for your answer!

  • Hi Julian,

    When I reached the command
    user <- getUser("my_user name")

    I got the following error. Any ideas?

    Error in twInterfaceObj$doAPICall(paste("users", "show", sep = "/"), params = params, :
    Error: SSL certificate problem, verify that the CA cert is OK. Details:
    error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

    • Hey George,
      I think you are using windows?
      please try it with:
      user <- getUser("JulianHi",cainfo="cacert.pem")
      and tell me if it worked.


  • Interesting.But I’m not sure why when you say 2centers and then you put 4 in the code. Could you tell me if it is a typo?

    • Hey Jessica,
      yeah sorry it was a typo. It should also be 4.
      I just changed it.


  • Jacob

    Hey, great piece! How did you build the interactive version?

    • Hey Jacob,
      I saved it with

      Was it that what you meant?


      • Jacob

        Thanks for that, had no idea it would be so simple!

  • Very Nice! I’ve tried to make something for my twitter account and I have 2 question for you:

    1. the plot that I have obtained is not with 4 cluster but have just a color up on the right that is called as “Undefined”
    2. How you can export the plot in html file that is visible in google drive? (i should try to put the diagram in my wordpress blog in embedded form).

    Thanks a lot!

    • Hey Matteo,
      thank you!
      1) The “Undefined” error is mostly causes by a not correctly formatted dataframe. Did you check your code? Did you execute all the mentioned steps?

      2) You can save your graph with:
      Then you just have to modify the links to the sources in your html file if you want to upload it. Take a look at the html code of my graph.

      If you have further questions feel free to ask.


  • NJ

    Hi there. I am getting a rate limit issue. I have tweaked the code as follows: userFollowers <- user$getFollowers(n=10000) since they ID I have chosen has 200k + followers.

    Any idea how I can add a wait clause to ensure that the rate limit is not reached?

    Thanks and good work with the blog!

    • Hey
      this depends on what kind of rate limit makes problems.
      Could you please post the code you are using? And where
      exactly does the rate limit make problems?


  • Tina

    Hello Julian,

    when I reach
    the following error occurs:
    Error in as.POSIXlt.character(x, tz, …) :
    character string is not in a standard unambiguous format

    Any ideas why or how to solve it?

    Thank you!

    • Hey
      actually no. It had some users posting the same problem.
      I will take a look at it and see how to solve this problem.


    • Tina

      I have the same error… Did you find a solution?

      • Edrit Franquiz

        I had the same error and what I did was to remove the date column, that´s the one called: “created”, because it has issues with transforming dates from what I see. When you remove that column, you can then change the 0 into 1 and do the whole procedure

        • Julian Hillebrand

          Thanks for the hint Edrit. I will take a look at it

  • sanchit shaleen

    Hey guys,

    i am facing one error.
    When I type in the command: userNeighbors.df[userNeighbors.df==”0″]<-1 , I get the following error :

    Error in as.POSIXlt.character(x, tz, …) :
    character string is not in a standard unambiguous format

    Please help me to resolve.


    • Hey
      I will take a look at it. I think in one of the used packages something changed.


  • Lukasz

    hey, I have a problem. after line:
    I have error:
    Error in as.POSIXlt.character(x, tz, …) :
    character string is not in a standard unambiguous format. Do you know what I should doing ?

    • Hey
      there is a comma behind before the bracket that should´t be there.


  • Alex

    Hi, The first step (library(rCharts)) doesn’t go through. When I tried: install.packages(‘rCharts’), I got a warning msg: Warning in install.packages :
    package ‘rCharts’ is not available (for R version 3.2.1)

    Then, I checked, and found that the package is not installed. Any idea how to solve it?

    • You should install rCharts with devtools from github with:

      install_github(‘rCharts’, ‘ramnathv’)

      Did this solve your problem?


  • Sam

    Really useful article Julian! Going to be trying on my twitter account. Is it possible to get a explanation of the graph for example what your findings were…It looks like there is some pattern from the graph 🙂


    HI Julian,

    when I reach


    the following error occurs:

    Error in as.POSIXlt.character(x, tz, …) :

    character string is not in a standard unambiguous format

    Any ideas why or how to solve it?

  • Jimmy

    I’m curious — why did you change 0 to 1, rather than adding 1 to all data points? The former changes the relationship between the points, not a desirable change in data transformation.