Our last post about the data mining process discussed the requirements of understanding the business problem that we are trying to solve as well as understanding the data that needs to be analyzed. This post addresses the next step in the data mining process – preparing data.
Practitioners in the field often remark that data preparation accounts for the majority of time spent in a data mining project, usually accounting for approximately 80% of the entire project. There are many reasons why data preparation is such a cumbersome process. Among them:
- Data sources often have to be merged from spreadsheets, SQL databases, survey responses, social media sources, etc. and can consist of numbers, text, and even emojis 🙂
- Data can be coded inconsistently or just plain incorrectly. For example, some date fields might be expressed as 01/01/2015 while others are in the form of January 1, 2015. Or, we might encounter a purchase date of 08/01/2018 (the person inputting this date could have meant to have pressed “5” on the keyboard and accidentally hit “8”).
- Data values are suspect in terms of accuracy (I recently engaged in a modeling project for a bank in which the customer database listed a number of customers over the age of 100 that was much higher than what seemed possible based on national averages).
- Data is missing (this particular topic will be covered in a separate post due to how prominent, complex, and important it is).
Looking at the examples above, it is clear that human judgement and decision-making are critical components of the data preparation process. Let’s take the example of the customers in the bank database that were over the age of 100. Was it possible that this bank had an unusually high number of centenarians? Sure. But, was it worth a little extra digging into the data? I thought so, so I dug a little deeper. As it turned out, there were a few customers over the age of 100. But, whoever entered the data in the “date of birth” field inputted “01/01/1900” whenever he/she didn’t have the actual date of birth for the customer. Perhaps, the database was programmed so that the data entry person couldn’t save records with missing data. Or, maybe the data entry person knew enough about data mining to understand that missing data is a huge problem in a modeling effort and was trying to be helpful. Whatever the case and no matter how good the intentions, these entries of “01/01/1900” were initially skewing my model because age was a significant variable in the analyses of customer profitability and churn. The algorithm did not catch the error and could not have. Only a human would realize that seeing 5% of a customer database with customers aged 115 was problematic. This example illustrates the fact that people must spend time looking carefully at data, making decisions (which aren’t always clear-cut) about what to do with anomalies or ambiguities, and then cleansing the data appropriately. While there are certain data mining algorithms that are inherently better at handling missing data and outliers, there is simply no substitute for spending the necessary time required to adequately analyze and prepare your data.