The Data Mining Process: Modeling

Happy new year, everyone! Continuing this series on the data mining process that has previously examined understanding business problems and associated data as well as data preparation,  this post focuses on modeling.

Developing models calls for using specific algorithms to explore, recognize, and ultimately output any patterns or themes in your data.  The two goals of modeling are to classify or predict. Some algorithms specialize in either classifying or predicting while others can be applied to do both. Choosing which algorithms to employ in developing models depends on the goals of the business, the nature of the data (structured versus unstructured), and the quantity as well as quality of the data.

There are many popular algorithms that are often seen to develop models for specific types of business problems:

  • Predicting binary outcomes (yes/no) often utilizes logistic regression and determines a percentage likelihood that each person or thing will or will not do something. I use this algorithm frequently when building models for colleges and universities that predict prospective students’ likelihoods of enrolling at the institution.
  • FP-growth is often used to develop association rules. A grocery store might use this algorithm to determine where to place specific items that are frequently bought together, such as milk and cookies. This type of analysis is also referred to as market basket analysis.
  • K-means clustering is typically a good algorithm to try when looking for connections among attributes that make it easier to create groupings of people or things. For example, health professionals might develop models based on observations of patients’ weight, cholesterol, and habits related to eating, smoking, and exercise to create buckets of those patients who are considered as high, moderate, or low risk for heart disease.
  • Linear regression is popular when companies attempt to predict consumption of their products. For example, a home electricity provider might look at the number of people in a household, past consumption, outdoor temperature patterns, and other variables to predict how much electricity a homeowner is likely to use.
  • Naive bayesian and support vector machine algorithms are used for fraud detection and text analytics.
  • Decision trees can be used for many of the tasks mentioned above and are one of the most popular and flexible algorithms for predicting and/or classifying. Decision trees are particularly beneficial for reporting purposes because their visual nature makes it easy for all members of organizations to understand relationships among variables and what variables are most important in the analyses of various types of business problems.

It is important to mention that, while specific algorithms are often used to develop specific types of models, there are typically multiple types of algorithms that might be the best option in any specific instance. Choosing which algorithm is best for any specific modeling task requires evaluation. Evaluating models will be covered in the next post.

James Vineburgh

I never imagined that I would ever want to be a data scientist (or be remotely qualified to be considered one). Scared of statistics as a student, a non-coder, and convinced that I peaked in math as a sixth grader, it wasn't until I had real-world opportunities to tackle concrete business challenges when I became interested in data mining. Now, I enjoy solving business problems and developing new products in the fields of advertising technology and market research in my work with Campus Explorer. Over the past few years, I have learned how to use Rapidminer, some R and Shiny, and several other tools that don't require programming expertise. I read software manuals and online tutorials (such as ThinkToStart!), watch videos, work with mentors, etc. to learn what I need to know to address the problem at hand. I am living proof that anyone who is curious, willing to fail over and over while learning, and interested in how to harness and make sense of data from all kinds of sources (including social media) can become a data scientist.