DATA MINING

...is nothing else than torturing the data until it confesses…and if you torture it enough, you can get it to confess to anything (Fred Menger)

Data Mining Tutorial - Part 1
Introduction to Data Mining

    If you are a beginner in data mining and want to become at least familiar with the main concepts and terminologies, maybe the first step would be to acquire a clear eye-bird view about the whole domain - definition, inception, classification, influences, trends - but without diving into too deep and scholastic details. And, of course, you do that by searching on Internet and skimming whatever books you have at hand. You might say that one day would be more than enough to get an overall understanding about this domain. But you don't have a clue about the trouble you're getting into. These are some of the questions that I'm sure you'll start asking yourself: How is data mining related to predictive analysis? How is it related to knowledge discovery? What about data mining vs machine learning and artificial intelligence? What is the difference between predictive/descriptive and supervized/unsupervized? Can we compare these 2 classifications first of all? And this is just the beginning...
This article wants to shed some light on this questions and present all these concepts in a very simple manner.


Data Mining Definition

    Let's quickly start with a definition, of course. Actually, let's choose two from a couple of hundreds of available definitions. The first one, classic and well-known, says that data mining is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data (W. Frawley) . The second one, spark and rebel, says that data mining is nothing else than torturing the data until it confesses… and if you torture it enough, you can get it to confess to anything (Fred Menger).

Data Mining - The annual Bill of Mortality for London and its environs, 1665

The annual Bill of Mortality for London and its environs, 1665

Data Mining Inception

    Unfortunately, we don't have a clear date to celebrate data mining birthday every year. It emerged in early 80s when an the amount of data generated and stored in databases became overwhelming and there was a strong need for tools and methods to extract useful and task-oriented knowledge. However, some people say that the actual birth of data mining happened in London in 1662 when John Graunt wrote Natural and Political Observations Made upon the Bills of Mortality. It was an impressing work for those times. He actually did a thorough analysis of mortality in those years and tried to build a model to predict the next bubonic plague in the city. Well, if that was a data mining project or not, this is still a debatable subject even now. The data used was definitely not big and the model was never actually built. Some conclusions have been drawn though, like correlations between plague years and new kings and population dynamics in London. I would rather say this was a statistical project with large implications in demography. As a matter of fact, some statisticians consider him the founder of the science of demography.

Infuences

    Here's where the pain starts. If don't already know it by now, I can tell you that data mining has influences from a lot of other domains like statistics, machine learning, pattern recognition, artificial intelligence, data visualization and so on...That's the reason some people say that data mining doesn't really have any relevance by itself and it cannot be considered as a separate discipline. They say it just a fancy name invented just to squeeze some more bucks from the business communities by applying the same old techniques used in the above mentioned domains. But data mining sounds so cool, doesn't it? So when you calculate a covariance of 2 variables, don't call it statistics because you actually mine the data there! Is this really true? It would actually make sense at a first glance but let's take a closer look.

        Data Mining vs Statistics

    Now, when we know what data mining is, let's see what the science of statistics is and how it is related to data mining. A very good definition of statistics says that "Statistics is the science and practice of developing knowledge through the use of empirical data expressed in quantitative form. It is based on statistical theory which is a branch of applied mathematics. Within statistical theory, randomness and uncertainty are modeled by probability theory." Please note that everything is expressed in quantitative form. Statistics science is primarily oriented towards the extraction of quantitative and statistical data characteristics. Let's stick to the example above. When we determine the covariance of 2 variables, we actually see if these variables vary together and measure the strength of the relationship. But we'll never be able to characterize this dependency at a conceptual level, and produce a casual explanation and a qualitative description of this relationship. You cannot "see" the reasons of this dependency because they are related to factors that are not explicitly provided in the data. Furthermore, the data mining process is interactive, iterative and exploratory. Not to mention the data pre-processing which is essential for any data mining project. Data reduction and compression, data cleaning and transformation are very important for data mining but they definitely don't go with statistics.
Verdict: We can definitely say they are different.

        Data mining vs Machine Learning

    Machine learning represents a sub-field of artificial intelligence and it was conceived in early 60s with the clear objective to design and develop algorithms and techniques that implement various types of learning, mechanisms capable of inducing knowledge from examples of data. Machine learning has a wide spectrum of applications including natural language processing, search engines, medical diagnosis, bio-informatics, speech and handwriting recognition, object recognition in computer vision, game playing and robot locomotion. The general framework for machine learning is as follows: The learning system aims at determining a description of a given concept from a set of concept examples provided by the teacher and from the background knowledge. Concept examples can be positive ( iron, when teaching the concept of metals) or negative (marble). Background knowledge contains the information about the language used to describe the examples and concepts - possible values of variables (domain), hierarchies, predicates, rules, etc. The learning algorithm then builds on the type of examples and on the size and relevance of the background knowledge (there are some other factors involved here but we won't discuss them here). The main types of learning systems are supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, transduction and learning to learn. We'll see later on in this article how the first 2 systems influenced data mining to a great extent. So the machine learning process emphasizes the development of the algorithms and usually assumes data is already residing on the main memory. On the other hand, the first condition for a data mining project to succeed is to have data, large amounts of data. Think about a chess computer game. It doesn't require a huge database to play against you. It only needs the examples and the knowledge. Teach the system what a counter-gambit is and it will know how to respond.
Verdict: They are different.

        Data Mining vs KDD

    What is KDD? It is not a syndrome (as I first thought when I heard about it) and it is not the name of a DJ either. And don't you dare to associate it with the annual conference of Knowledge Discovery and Data Mining organized by SIGKDD. It actually means Knowledge Discovery from Databases and the concept emerged in 1989 to refer to the broad process of finding knowledge in data. It is referring to the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. Knowledge discovery differs from machine learning in that the task is more general and is concerned with issues specific to databases. Oops! Does it sound familiar to you? I bet it does. Then what the bleep is data mining? Actually, these 2 terms have been interchangeably used for several years. No distinction was made. Until a kind of consensus has been made within the community. We'll still have 2 terms but with slightly different understandings. The term KDD is now viewed as the overall process of discovering useful knowledge from data, while data mining is viewed as an application of some particular algorithms for extracting patterns from data without the additional steps of the KDD process, like data cleaning, data reduction, concept hierarchies generation and it can even go to the infrastructure of the project. To me, it sounds a bit fishy. I mean, how on earth would a new comer know this difference between KDD and data mining? The name doesn't tell you anything.
Verdict: Well, I would say they are quite the same.

       Data Mining vs Predictive Analytics

    A definition of predictive analytics says that it is an area of statistical analysis that deals with extracting information from data and using it to predict future trends and behavior patterns. The core of predictive analytics relies on capturing relationships between explanatory variables and the predicted variables from past occurrences, and exploiting it to predict future outcomes. So what's the news? Didn't I see that in data mining as well? And like all the respected sciences, a central element of predictive analytics was also invented - that is the predictor, a variable that can be measured for an individual or other entity to predict future behavior. So here's what happened. People got bored with data mining term, it was so much worn-out and on everyone's lips, so they invented another term - predictive analytics or predictive modeling. Or maybe, as the president of The Modeling Agency, Eric King, once said, a shift was required to rapidly move towards far more descriptive and accurate nomenclature such as predictive modeling and predictive analytics, rather the old data mining.
Verdict: Definitely the same. I would be very happy (and surprised) if anyone who reads this article can bring arguments against this opinion.

       Predictive/Descriptive Data Mining vs Supervised/Unsupervised Learning

    Let's see now what's the catch with these 4 terms that are so much used by data mining (or whatever you want to call it) community. And, as usual, let's start with definitions.
Supervised learning is a machine learning technique for creating a function from training data. The training data consist of pairs of input objects, and desired outputs. The output of the function can be a continuous value (called regression), or can predict a class label of the input object (called classification). The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations in a "reasonable" way. 
Unsupervised learning is a method of machine learning where a model is fit to observations. It is distinguished from supervised learning by the fact that there is no a priori output. In unsupervised learning, a data set of input objects is gathered. Unsupervised learning then typically treats input objects as a set of random variables. A joint density model is then built for the data set.
Predictive data mining can be used to forecast explicit values, based on patterns determined from known results. What did the definition say about supervised learning? That it predicts the value of a function for any valid input object after having seen a number of training examples. Isn't this the same?
Descriptive data mining describes a data set in a concise way and presents interesting characteristics of the data without having any predefined target. What did the definition say about unsupervised learning? That it has no a priori output? Again, isn't this the same?
As a matter of fact, these 2 categories of terms are largely interchangeably used. But they can be compared only to some extend. Predictive and descriptive data mining are concepts whereas supervised and unsupervised learning are methods. It's true, supervised learning method is used in predictive data mining and unsupervised learning is used in descriptive data mining but these are not the only 2 methods used in data mining (they are the major ones though). Some people say that methods like sequential prediction or interpolation are also used in predictive data mining without being supervised learning techniques. And methods like correlation, associations, dependencies are used in descriptive data mining without being unsupervised learning techniques. However, the supervised/unsupervised learning techniques are the most used ones in data mining. It's funny, but sad in the same time, when you find and read articles about predictive learning and descriptive learning or supervised/unsupervised data mining. It is an incredible mixture of terms that perhaps stems from the same continuous and not always understandable desire to invent new terms for this, otherwise straightforward, discipline.

       Data Mining vs Business Intelligence

    Business Intelligence represents a category of applications, technologies and concepts for gathering, storing, analyzing, and providing access to data to help business users make better decisions. So business intelligence is a way broader category than the data mining is. Being able to create a report with monthly sales activity by salesperson and product is called BI. Dashboards and scorecards are also BI tools. Data extraction, transformation, OLAP, slicing, dicing, filtering are all related to BI processes. And data mining too is a BI tool. The most advanced one I would say because it involves not only preparing and displaying the data in various ways but also extracting implicit information that otherwise wouldn’t have been displayed.
Verdict: Different.

Classification

   This is again a sensitive subject in data mining and, honestly, I'm kind of afraid to tackle it to a great and detailed extent now. There are thousands of classifications of data mining with classes, sub-classes and sub-sub-classes. And sometimes a particular class may have dozens of names. So I'd rather be cautious and come with a very succinct classification. However, despite of his briefness, this classification has all the major data mining techniques that we'll talk about in some future articles.

Association Rules
Classification
Prediction
Clustering

Allow me not to say anything about them for now since this is just an introduction to data mining and I want to keep it as simple as possible. It's always beneficial to put the things in a very simple and clear way. If you are able to explain an IT concept in such a way that your grandpa will be able to understand it, then you're really good.

So if you are a beginner in data mining and want to become at least familiar with the main concepts and terminologies, I'd like to hope that your first step would be to acquire a clear eye-bird view about the whole domain by reading this article. And why don't you ask your grandpa as well for a non-biased opinion?

Bibliography


Computational Intelligence for Decision Support By Zhengxin Chen
Data Warehousing And Knowledge Discovery: 6th International Conference By Mukesh. Mohania, Yahiko Kambayashi, Wolfram Wöss
Machine Learning and Data Mining: Methods and Applications Ryszad S Michalski
Demystifying Data Mining, Carol Brennan Baldan
Subgroup Discovery Techniques and Applications, Nada Lavrac
Introduction to Data Mining, José Hernández-Orallo
Data Warehousing And Knowledge Discovery: 6th International Conference By Mukesh. Mohania, Yahiko Kambayashi, Wolfram Wöss
How to Buy Data Mining: A Framework for Avoiding Costly Project Pitfalls in Predictive Analytics, Eric A. King
Wikipedia
http://www.ac.wwu.edu/~stephan/Graunt/graunt.html

The Author


Radu Lovin is a Business Intelligence professional with expertise in financial systems. You can contact him at radu@dataminingarticles.com.

Data Mining Books

Data Mining Tutorial
Part 1 - Introduction to Data Mining

Part 2 - Mining Frequent Patterns (Maximal and Closed Frequent Itemsets)

Part 3 - Associations