twitter infrastructure databse

Build your own Twitter Archive and Analyzing Infrastructure with MongoDB, Java and R [Part 1] [Update]

UPDATE: The JAVA script is now also available with the streaming API. You can find the script on my github account

Hey everybody,
you sure know the problems which appear when you want to work with the Twitter API. Twitter created a lot of different restrictions minimizing the fun of the Data Mining process.
Another problem is that you Twitter provides no way to analyze your data at a later time. You can´t just start a Twitter search, which gives you all the tweets ever written about your topic. And you can´t get all tweets related to a special event for example if there are a lot. So i always dreamed of my own archive filled with Twitter Data. And then i saw MongoDB. The Mongo database, which comes from “humongous”, is an open-source document database, and the leading NoSQL database.
And this document oriented structure makes it very easy to use especially for our purpose because it all concentrates on the JSON format. And that´s of course the format we get directly from Twitter. So we don´t need to process our tweets, we can just save them into our database.

Structure

So let´s take a closer look at our structure

structure

As you can see we need different steps. First we need to get the Twitter Data and store it in the Database and then we need to find a way to get this data into R and start analyzing.

In this first tutorial I will show you how to set up this first part. We set up a mongoDB locally on your computer and write a Java crawler, getting the Data directly from the Twitter API and storing them.

MongoDB

Installing the MongoDB is as easy as using it. You just have to go to the mongoDB website and select the right precompiled files for your operating system.
http://www.mongodb.org/downloads
Ok after downloading unpack the folder. And that´s it.
Then you just have to go to the folder and take a look into “bin” subfolder. Here you can see the different scripts. For our purposes we need the mongod and the mongo file.
The mongod is the mongo daemon. So it is basically the server and we need to start this script every time we want to work with database.
But we can do it even easier:
Just download the IntelliJ IDEA Java IDE http://www.jetbrains.com/idea/download/index.html
This cool and lightweight IDE has a nice third party mongoDB plugin available which will help you a lot working with the database. Of course there are plugins available for Eclipse or NetBeans but i haven´t tried them yet. Maybe you did?

Ok after you installed the IDE download the mongoDB plugin and install it as well http://plugins.jetbrains.com/plugin/7141
Then you can find the mongo  explorer on the right side of your working space.IDE

Go to the settings of this plugin.
There we have to add the path to the Mongo executable. Then you have to add a server connection by clicking on the + at the end of your server list. Just leave all the settings as they are and click ok.

path

connection

Now we established the connection to our server. If you can´t connect, start restarting the IDE or start the mongod script manually.
That was basically the mongoDB part. Let´s take a look at our Java crawler.

Java

I will go through the code step by step. You can find the complete code on my github.
But before we start we have to download some additional jar libraries helping us to work with the mongoDB on the one hand and the Twitter API on the other hand.
MongoDB Java driver: http://central.maven.org/maven2/org/mongodb/mongo-java-driver/2.11.3/mongo-java-driver-2.11.3.jar

Twitter4j package: http://twitter4j.org/archive/twitter4j-3.0.4.zip

Now add the Twitter4j-core. Twitter4j-stream and the mongoDB driver to your project.

libs

Our Java program starts with a small menu giving us the chance to insert the keyword we want to look for. So this Java program works like a loop searching Twitter every few seconds for new Tweets and saves new Tweets to our database. So it just saves tweets while it´s running but this is perfect for example to monitor a certain event.

 

 

 

So after the program received a keyword it connects to the database with connectdb(keyword);

 

 

The initMongoDB function connects to our local mongoDB server an creates a database instance called “db”. And here something cool happens: You can type in whatever name of the database you want it to be called. If this database doesn´t exit mongoDB automatically creates it just for you and you can work with it like nothing happened.
This effect also appears when we call the db.getCollection(keyword)
It automatically creates a Collection if it doesn´t exists. So no error messages anymore 😉
MongoDB is structured like:

DATABASE –> Collections —> Documents
You could compare a Collection to a table in a SQL Database and the Documents as the elements in this table; in this case our tweets.

But there also two very important lines of code:

 

 

Here we create an index in our database. So it just saves a tweet if the tweet_ID isn´t already in our database. Otherwise we would have doubled entries.

But let´s get some tweets!

If you take a look at our main function again it know creates a ConfigurationBuilder and sets the login details of our twitter API access. This configurationBuilder is needed for the TwitterFactory provided by the Twitter4j package we use in the next function: getTweetByQuery

 

 

After creating our connection to Twitter we get some tweets. Then we save the content of these tweets to our Database. But we select what we want to save, as we don´t need all the information delivered by Twitter.
We loop through our tweets with the help of our tweets List. We put it all in a BasicDBObject with the help of Twitter4j and finally insert this object in our database if it doesn´t exist.

Then the program sleeps a few seconds and starts the whole loop again.

Settings and Usage

If you want to monitor an event you have two options you can adjust to your needs. You can change the time the loop waits and the number of tweets the searchTwitter functions return with every time.
So if you monitor an event which will create a huge amount of tweets, you can increase the number of tweets returned and decrease the time the loop waits. But be careful, because if the program connects to often, Twitter will deny the access.

So start monitoring some events or just random keywords by running the program and typing in our keyword. The program will automatically create a Collection for you where the tweets are stored in.

This was just the first step for building our own Twitter archive and analyzing structure. In the next part I will talk about how to connect with R to your Twitter database and start analyzing your saved tweets.
I hope you enjoyed this first part and please feel free ask questions.
If you want to stay up to date about my blog please give me a like on Facebook, a +1 on Google+ or follow me on Twitter.

Part 2

Julian Hillebrand

During my time at university and learning about the basics of economics I started heavily exploring the possibilities and changes caused by digital disruptions and the process of digital transformation, whereby I focused on the importance of data and data analytics and combination with marketing and management.
My personal focus of interest lies heavily on technology, digital marketing and data analytics. I made early acquaintance with programming and digital technology and never stop being interested in following the newest innovations.

I am an open, communicative and curious person. I enjoy writing, blogging and speaking about technology.

  • asadtaj

    Hi,
    I was trying to implement the code here with the help one of my java friend @imvkcbit1, seems that main function is missing in the blog although I have created one but the function getTweetsRecords() is not available in blog.

    Regards,
    Asad

  • Hi julianhi,

    I am working on a similar tool, I am very much interested in the analytics part ..
    any updates on the second part in this series ?

    Regards

    • Hi Vamsi,
      currently im playing with the REST API of the MongoDB. But this is not the coolest way. I will try some libraries to connect R directly with the database.
      Do you have some information on your project? I´d be very interested in that.

      Regards

  • Thanks a lot for sharing this tutorial. I have found it to be very useful.

  • So u implemented the “streaming” by looping the REST? If you can use the streaming API, it’ll be much cooler! I’m currently looking for some tutorial for using stream API.

    • Hey Yang Zhang
      yes i did it with a loop as it is the way the people understand better. But i also thought about doing it with the streaming API and it is also very easy to integrate it. I will update my post soon with the Streaming API integration.

      Regards

  • Abhishek

    Hi julianhi….

    I am able to do all the process mentioned

    But i am unable to connect with twitter

    As I am not able to retrieve Access token and AccessToken Secret??

    cb.setOAuthAccessToken(“XXX”);?????
    cb.setOAuthAccessTokenSecret(“XXX”);????

    Can please help me

    Thanks in advance

    Regards
    Abhishek

    • Hey Abhishek
      So did you register the app at Twitter? And replace the XXX with your own credentials you received from Twitter?

      Regards

  • Abhishek

    Hi julianhi…

    Got it ….now every thing are working fine.

    Thanks a lot for sharing this tutorial. I have found it to be very useful.
    Now I will be waiting for second part in this series.

    Thanks

    Regards
    Abhishek

  • Pingback: Build your own Twitter Archive and Analyzing Infrastructure with MongoDB, Java and R [Part 2] | julianhi's Blog()

  • minh

    Hello Julian,
    Thanks for the beautiful teaching article on how to build twitter application from MongoDB, Java , and R. I followed the article and it worked well but I have a question which I would like to seek your expertise. I have a question related to the function getTweetByQuery. As I understood, this function got called every minute . In this function, we have (loadRecords) {
    getTweetsRecords();
    }
    , and get tweet record just print the out the record from the mongo table.
    As I understand getTweetByQuery got call every minute, it did its job by insert new tweet to Mongo, and then it called getTweetRecords to print out the just inserted tweets. In case , if there is no new tweet , then getTweetRecord just throws an exception.
    My question is that how could we modify the code so that the program still running after it got a runtime exception from getTweetRecords when there was no new tweet .
    Thanks,
    Tom

  • Anne

    Hey! I have tried to follow these instructions. However, after this step I am stuck:

    “Ok after you installed the IDE download the mongoDB plugin and install it as well http://plugins.jetbrains.com/plugin/7141
    Then you can find the mongo explorer on the right side of your working space.”

    What working space are you referring to? I can’t seem to find the window you have opened. I am sorry I am an absolute newbie, I might not recognize obvious interfaces. Thank you in advance to anyone reaching out!

  • Chris

    Hi,

    thank you for sharing your knowledge! It helps me very much doing my own analysis on twitter! 🙂

    Kind regards
    Chris