The Data Mining Process: Data Preparation

Our last post about the data mining process discussed the requirements of understanding the business problem that we are trying to solve as well as understanding the data that needs to be analyzed. This post addresses the next step in the data mining process – preparing data.

Practitioners in the field often remark that data preparation accounts for the majority of time spent in a data mining project, usually accounting for approximately 80% of the entire project. There are many reasons why data preparation is such a cumbersome process. Among them:

  • Data sources often have to be merged from spreadsheets, SQL databases, survey responses, social media sources, etc. and can consist of numbers, text, and even emojis ūüôā
  • Data can be coded inconsistently or just plain incorrectly. For example, some date fields might be expressed as¬†01/01/2015¬†while others are in the form of January 1, 2015. Or, we might encounter a purchase date of 08/01/2018 (the person inputting this date could have meant to have pressed “5” on the keyboard and accidentally hit “8”).
  • Data values are suspect in terms of accuracy (I recently engaged in a modeling project for a bank in which the customer database listed¬†a number of customers over the age of 100 that was much higher than what seemed possible based on national averages).
  • Data is missing (this particular topic will be covered in a separate post due to how prominent, complex, and important it is).

Looking at the examples above, it is clear that human judgement and decision-making are critical components of the data preparation process. Let’s take the example of the customers in the bank database that were over the age of 100. Was it possible that this bank had an unusually high number of centenarians? Sure. But, was it worth a little extra digging into the data? I thought so, so I dug a little deeper. As it turned out, there were a few customers over the age of 100. But, whoever entered the data in the “date of birth” field inputted “01/01/1900” whenever he/she didn’t have the actual date of birth for the customer. Perhaps, the database was programmed so that the data entry person couldn’t save records with missing data. Or, maybe the data entry person knew enough about data mining to understand that missing data is a huge problem in a modeling effort and was trying to be helpful. Whatever the case and no matter how good the intentions, these entries of “01/01/1900” were initially skewing my model because age was a significant variable in the analyses of customer profitability and churn. The algorithm did not catch the error and could not have. Only a human would realize that seeing 5% of a customer database with customers aged 115 was problematic. This example illustrates the fact that people must spend time looking carefully at data, making decisions (which aren’t always clear-cut) about what to do with anomalies or ambiguities, and then cleansing the data appropriately. While there are certain data mining algorithms that are inherently better at handling missing data and outliers, there is simply no substitute for spending the necessary time required to adequately analyze and prepare your data.


James Vineburgh

I never imagined that I would ever want to be a data scientist (or be remotely qualified to be considered one). Scared of statistics as a student, a non-coder, and convinced that I peaked in math as a sixth grader, it wasn't until I had real-world opportunities to tackle concrete business challenges when I became interested in data mining. Now, I enjoy solving business problems and developing new products in the fields of advertising technology and market research in my work with Campus Explorer. Over the past few years, I have learned how to use Rapidminer, some R and Shiny, and several other tools that don't require programming expertise. I read software manuals and online tutorials (such as ThinkToStart!), watch videos, work with mentors, etc. to learn what I need to know to address the problem at hand. I am living proof that anyone who is curious, willing to fail over and over while learning, and interested in how to harness and make sense of data from all kinds of sources (including social media) can become a data scientist.