What is H2O?

H2O is an awesome machine learning framework. It is really great for data scientists and business analysts “who need scalable and fast machine learning”. H2O is completely open source and what makes it important is that works right of the box. There seems to be no easier way to start with scalable machine learning. It hast support for R, Python, Scala, Java and also has a REST API and a own WebUI. So you can use it perfectly for research but also in production environments.

H2O is based on Apache Hadoop and Apache Spark which gives it enormous power with in-memory parallel processing.


H2O is the #1 Java-based open-source Machine Learning project on GitHub and is used by really a lot of well-known companies like PayPal.

Get the data

The data we will use to train our machine learning model is from a Kaggle competition from the year 2013. The challenge was organized by Data Science London and the UK Windows Azure Users Group who partnered with Microsoft and Peerindex.

The data was described as: “The dataset, provided by Peerindex, comprises a standard, pair-wise preference learning task. Each datapoint describes two individuals, A and B. For each person, 11 pre-computed, non-negative numeric features based on twitter activity (such as volume of interactions, number of followers, etc) are provided.

The binary label represents a human judgement about which one of the two individuals is more influential. A label ‘1’ means A is more influential than B. 0 means B is more influential than A. The goal of the challenge is to train a machine learning model which, for pairs of individuals, predicts the human judgement on who is more influential with high accuracy. Labels for the dataset have been collected by PeerIndex using an application similar to the one described in this post.”

Theory to build the ensemble model

With the h2oEnsemble package we can create ensembles with all available machine learning models in H2O. As the package authors explain: “This type of ensemble learning is called “super learning”, “stacked regression” or “stacking.” The Super Learner algorithm learns the optimal combination of the base learner fits.”

This is based on the article “Super Learner” published in the magazine “Statistical Applications in Genetics and Molecular Biology” by Mark J. van der Laan, Eric C Polley and Alan E. Hubbard from the University of California, Berkeley in 2007. They also created the “SuperLearner” package based on that: SuperLearner

Install the dependencies

First of all we need the h2o package. This is available at CRAN and so are the packages SuperLearner and cvAUC we also need to build our model.

The other and maybe most important package is the h2oEnsemble package. We can download it via devtools from the h2o GitHub repository.

Then we just have to load all the packages:


Preparing the data

First we have to load the dataset with

and split it into a train and a test data set. The test dataset provided by the Kaggle challenge does not include output labels so we can not use it to test our machine learning model.

We split it by creating a train index that chooses 4000 line numbers from the data frame. We then subset it into train and test:

Preparing the ensemble model

We begin with starting an h2o instance directly out of our R console

Console Output H2O


After starting the H2O instance, we have to transform our data set a little bit further. We have to transform it into H2O data objects, define the output column and set the family variable to binomial as we want to do a binomial classification.

That is also the reason why we have to turn the output columns into factor variables.

Then we can define the machine learning methods we want to include in our ensemble and set the meta learner from the SuperLearner package. in this case we will use a general linear model.

Build the model to predict Social Network Influence

After defining all the parameters we can build our model.

R h2o console output
R h2o console output


Evaluate the performance

Evaluating the performance of a H2O model is pretty easy. But as the ensemble method is not fully implemented in the H2O package yet, we have to create a small workaround to get the accuracy of our model.

In my case the Accuracy was nearly 88%, which is a pretty good result.

You can also look at the accuracy of the the single models you used in your ensemble:


Previous articleAnnouncing: Mastering RStudio
Next articleUsing Google Analytics with R
Profile photo of Julian Hillebrand
During my time at university and learning about the basics of economics I started heavily exploring the possibilities and changes caused by digital disruptions and the process of digital transformation, whereby I focused on the importance of data and data analytics and combination with marketing and management. My personal focus of interest lies heavily on technology, digital marketing and data analytics. I made early acquaintance with programming and digital technology and never stop being interested in following the newest innovations. I am an open, communicative and curious person. I enjoy writing, blogging and speaking about technology.
  • Pingback: Distilled News | Data Analytics & R()

  • Icaro Bombonato

    This is awesome!

    Very easy and simple show of how to use the H2O!

    I already tried it and got very good results!


  • adam

    Dear Sir,

    Is there a weblink so I can download the train.csv dataset?
    Thank you for your help.

  • adam

    How can I save the model so I can reuse it on other data?
    I tried saveing using
    save(fit, file = “modelh201.rda”)

    But when I load the model again using

    and then use fit to predict I get an error.

    • Hey adam
      there is currently no real way to save the model as the package is still in beta.
      But I am sure that there will be a way in the future and I will then update the tutorial.


      • Aakash Gupta

        Julian, thanks for sharing this update. I am using the h2o package, but I didn’t know that they were developing an ensemble option in it. (they do not mention it on their documentation nor is it discussed on the google group)

        h2o.saveModel/ h2o.loadModel might work. I have not tried it for a model created using h2o.ensemble method but it works well for the models prepared using the standard h2o functions. Also h2o.download_pojo could give you the java class file for your model. Which you can use for predictions.

  • Ed_f

    Thanks for taking the time to post this. I’m running into an error when the h2o ensemble function is called:

    “Cross-validating and training base learner 1: h2o.glm.wrapper”

    unexpected argument “fold_column”, is this legacy code? Try ?h2o.shim

    stepping into the function it fails on line 86 of the h2oensemble() function,

    “mm <- .make_Z(x = x, y = y, training_frame = training_frame,
    family = family, learner = learner, parallel = parallel,
    seed = seed, V = V, L = L, idxs = idxs, metalearner_type = metalearner_type)"

    if I step into this, it fails on:

    fitlist <- sapply(X = 1:L, FUN = .fitWrapper, y = y, xcols = x,
    training_frame = training_frame, validation_frame = NULL,
    family = family, learner = learner, seed = seed, fold_column = "fold_id",
    simplify = FALSE)

    if I switch the order of the "learners" and put random forest before glm in the "learner" variable it runs to 100% and then exits on the same error when it reaches the glm.

    I'm running R 3.2.2 on Vista home premium (32 bit) with the packages installed as shown in the blog post. Any suggestions (apart from "buy a new computer") welcome!


  • Patrick Phiri

    I am intrigued with this project and tried it out, but am getting an error:
    Error: is.character(key) && length(key) == 1L && !is.na(key) is not TRUE

    when I am setting the training_frame. I have followed steps, not sure why the error.