If you are a beginner in data mining
and want to become at least familiar with the main concepts and
terminologies, maybe the first step would be to acquire a clear
bird's eye view about the whole domain - definition, inception,
classification, influences, trends - but without diving into too deep
and scholastic details. And, of course, you do that by searching on
Internet and skimming whatever books you have at hand. You might say
that one day would be more than enough to get an overall understanding
about this domain. But you don't have a clue about the trouble you're
getting into. These are some of the questions that I'm sure you'll
start asking yourself: How is data mining related to predictive
analysis? How is it related to knowledge discovery? What about data
mining vs machine learning and artificial intelligence? What is the
difference between predictive/descriptive and supervized/unsupervized?
Can we compare these 2 classifications first of all? And this is just
the beginning...
This article wants to shed some light on this questions and present all
these concepts in a very simple manner.
Data
Mining Definition
Let's quickly start with a
definition, of course. Actually, let's choose two from a couple of
hundreds of available definitions. The first one, classic and
well-known, says that data mining is the nontrivial extraction of
implicit, previously unknown, and potentially useful information from
data (W. Frawley) . The second one, spark and rebel, says that data
mining is nothing else than torturing the data until it
confesses… and if you torture it enough, you can get it to
confess to anything (Fred Menger).
The annual Bill of
Mortality for London and its
environs, 1665
Data
Mining Inception
Unfortunately, we don't have a clear
date to celebrate data mining birthday every year. It emerged in early
80s when an the amount of data generated and stored in databases became
overwhelming and there was a strong need for tools and methods to
extract useful and task-oriented knowledge. However, some people say
that the actual birth of data mining happened in London in 1662 when
John Graunt wrote Natural and Political Observations Made upon the
Bills of Mortality. It was an impressing work for those times. He
actually did a thorough analysis of mortality in those years and tried
to build a model to predict the next bubonic plague in the city. Well,
if that was a data mining project or not, this is still a debatable
subject even now. The data used was definitely not big and the model
was never actually built. Some conclusions have been drawn though, like
correlations between plague years and new kings and population dynamics
in London. I would rather say this was a statistical project with large
implications in demography. As a matter of fact, some statisticians
consider him the founder of the science of demography.
Influences
Here's where the pain starts. If you
don't already know it by now, I can tell you that data mining has
influences from a lot of other domains like statistics, machine
learning, pattern recognition, artificial intelligence, data
visualization and so on...That's the reason some people say that data
mining doesn't really have any relevance by itself and it cannot be
considered as a separate discipline. They say it is just another fancy
name invented just to squeeze some more bucks from the business
communities by applying the same old techniques used in the above
mentioned domains. But data mining sounds so cool, doesn't it? So when
you calculate a covariance of 2 variables, don't call it statistics
because you actually mine the data there! Is this really true? It would
actually make sense at a first glance but let's take a closer look.
Data
Mining vs Statistics
Now, since we know what data mining
is, let's see what the science of statistics is and how it is related
to data mining. A very good definition of statistics says that
"Statistics is the science and practice of developing knowledge through
the use of empirical data expressed in quantitative form. It is based
on statistical theory which is a branch of applied mathematics. Within
statistical theory, randomness and uncertainty are modeled by
probability theory." Please note that everything is expressed in
quantitative form. Statistics science is primarily oriented towards the
extraction of quantitative and statistical data characteristics. Let's
stick to the example above. When we determine the covariance of 2
variables, we actually see if these variables vary together and measure
the strength of the relationship. But we'll never be able to
characterize this dependency at a conceptual level, and produce a
casual explanation and a qualitative description of this relationship.
You cannot "see" the reasons of this dependency because they are
related to factors that are not explicitly provided in the data.
Furthermore, the data mining process is interactive, iterative and
exploratory. Not to mention the data pre-processing which is essential
for any data mining project. Data reduction and compression, data
cleaning and transformation are very important for data mining but they
definitely don't go with statistics.
Verdict:
We can definitely say they are different.
Data mining vs Machine Learning
Machine learning represents a
sub-field of artificial intelligence and it was conceived in early 60s
with the clear objective to design and develop algorithms and
techniques that implement various types of learning, mechanisms capable
of inducing knowledge from examples of data. Machine learning has a
wide spectrum of applications including natural language processing,
search engines, medical diagnosis, bio-informatics, speech and
handwriting recognition, object recognition in computer vision, game
playing and robot locomotion. The general framework for machine
learning is as follows: The learning system aims at determining a
description of a given concept from a set of concept examples provided
by the teacher and from the background knowledge. Concept examples can
be positive ( iron, when teaching the concept of metals) or negative
(marble). Background knowledge contains the information about the
language used to describe the examples and concepts - possible values
of variables (domain), hierarchies, predicates, rules, etc. The
learning algorithm then builds on the type of examples and on the size
and relevance of the background knowledge (there are some other factors
involved here but we won't discuss them here). The main types of
learning systems are supervised learning, unsupervised learning,
semi-supervised learning, reinforcement learning, transduction and
learning to learn. We'll see later on in this article how the first 2
systems influenced data mining to a great extent. So the machine
learning process emphasizes the development of the algorithms and
usually assumes data is already residing on the main memory. On the
other hand, the first condition for a data mining project to succeed is
to have data, large amounts of data. Think about a chess computer game.
It doesn't require a huge database to play against you. It only needs
the examples and the knowledge. Teach the system what a counter-gambit
is and it will know how to respond.
Verdict:
They are different.
Data Mining vs KDD
What is KDD? It is not a syndrome (as
I first thought when I heard about it) and it is not the name of a DJ
either. And don't you dare to associate it with the annual conference
of Knowledge Discovery and Data Mining organized by SIGKDD. It actually
means Knowledge Discovery from Databases and the concept emerged in
1989 to refer to the broad process of finding knowledge in data. It is
referring to the nontrivial extraction of implicit, previously unknown,
and potentially useful information from data. Knowledge discovery
differs from machine learning in that the task is more general and is
concerned with issues specific to databases. Oops! Does it sound
familiar to you? I bet it does. Then what the bleep is data mining?
Actually, these 2 terms have been interchangeably used for several
years. No distinction was made. Until a kind of consensus has been made
within the community. We'll still have 2 terms but with slightly
different understandings. The term KDD is now viewed as the overall
process of discovering useful knowledge from data, while data mining is
viewed as an application of some particular algorithms for extracting
patterns from data without the additional steps of the KDD process,
like data cleaning, data reduction, concept hierarchies generation and
it can even go to the infrastructure of the project. To me, it sounds a
bit fishy. I mean, how on earth would a new comer know this difference
between KDD and data mining? The name doesn't tell you anything.
Verdict:
Well, I would say they are quite the same.
Data Mining
vs Predictive Analytics
A definition of predictive analytics
says that it is an area of statistical analysis that deals with
extracting information from data and using it to predict future trends
and behavior patterns. The core of predictive analytics relies on
capturing relationships between explanatory variables and the predicted
variables from past occurrences, and exploiting it to predict future
outcomes. So what's the news? Didn't I see that in data mining as well?
And like all the respected sciences, a central element of predictive
analytics was also invented - that is the predictor, a variable that
can be measured for an individual or other entity to predict future
behavior. So here's what happened. People got bored with data mining
term, it was so much worn-out and on everyone's lips, so they invented
another term - predictive analytics or predictive modeling. Or maybe,
as the president of The Modeling Agency, Eric King, once said, a shift
was required to rapidly move towards far more descriptive and accurate
nomenclature such as predictive modeling and predictive analytics,
rather the old data mining. There are also voices that say that
data mining has 2 major branches: predictive and descriptive (see the next section below), so predictive analytics would be part of data mining.
To me, descriptive data mining is another overblown term. It's nothing else than pure statistics. But that's another story and controversy
altogether.
Verdict:
The same.
Predictive/Descriptive
Data Mining vs Supervised/Unsupervised Learning
Let's see now what's the catch with
these 4 terms that are so much used by data mining (or whatever you
want to call it) community. And, as usual, let's start with
definitions.
Supervised learning is a machine learning technique for creating a
function from training data. The training data consist of pairs of
input objects, and desired outputs. The output of the function can be a
continuous value (called regression), or can predict a class label of
the input object (called classification). The task of the supervised
learner is to predict the value of the function for any valid input
object after having seen a number of training examples (i.e. pairs of
input and target output). To achieve this, the learner has to
generalize from the presented data to unseen situations in a
"reasonable" way.
Unsupervised learning is a method of machine learning where a model is
fit to observations. It is distinguished from supervised learning by
the fact that there is no a priori output. In unsupervised learning, a
data set of input objects is gathered. Unsupervised learning then
typically treats input objects as a set of random variables. A joint
density model is then built for the data set.
Predictive data mining can be used to forecast explicit values, based
on patterns determined from known results. What did the definition say
about supervised learning? That it predicts the value of a function for
any valid input object after having seen a number of training examples.
Isn't this the same?
Descriptive data mining describes a data set in a concise way and
presents interesting characteristics of the data without having any
predefined target. What did the definition say about unsupervised
learning? That it has no a priori output? Again, isn't this the same?
As a matter of fact, these 2 categories of terms are largely
interchangeably used. But they can be compared only to some extent.
Predictive and descriptive data mining are concepts whereas supervised
and unsupervised learning are methods. It's true, supervised learning
method is used in predictive data mining and unsupervised learning is
used in descriptive data mining but these are not the only 2 methods
used in data mining (they are the major ones though). Some people say
that methods like sequential prediction or interpolation are also used
in predictive data mining without being supervised learning techniques.
And methods like correlation, associations, dependencies are used in
descriptive data mining without being unsupervised learning techniques.
However, the supervised/unsupervised learning techniques are the most
used ones in data mining. It's funny, but sad in the same time, when
you find and read articles about predictive learning and descriptive
learning or supervised/unsupervised data mining. It is an incredible
mixture of terms that perhaps stems from the same continuous and not
always understandable desire to invent new terms for this, otherwise
straightforward, discipline.
Verdict:
Different.
Data
Mining
vs Business Intelligence
Business Intelligence represents a
category of applications, technologies and concepts for gathering,
storing, analyzing, and providing access to data to help business users
make better decisions. So business intelligence is a way broader
category than the data mining is. Being able to create a report with
monthly sales activity by salesperson and product is called BI.
Dashboards and scorecards are also BI tools. Data extraction,
transformation, OLAP, slicing, dicing, filtering are all related to BI
processes. And data mining too is a BI tool. The most advanced one I
would say because it involves not only preparing and displaying the
data in various ways but also extracting implicit information that
otherwise wouldn’t have been displayed.
Verdict:
Different.
So if you are a
beginner in data mining and want to
become at least familiar with the main concepts and terminologies, I'd
like to hope that your first step would be to acquire a clear bird's eye
view about the whole domain by reading this article.
Reference
Computational Intelligence for Decision Support By Zhengxin Chen
Data Warehousing And Knowledge Discovery: 6th International Conference
By Mukesh. Mohania, Yahiko Kambayashi, Wolfram Wöss
Machine Learning and Data Mining: Methods and Applications Ryszad S
Michalski
Demystifying Data Mining, Carol Brennan Baldan
Subgroup Discovery Techniques and Applications, Nada Lavrac
Introduction to Data Mining, José
Hernández-Orallo
Data Warehousing And Knowledge Discovery: 6th International Conference
By Mukesh. Mohania, Yahiko Kambayashi, Wolfram Wöss
How to Buy Data Mining: A Framework for Avoiding Costly Project
Pitfalls in Predictive Analytics, Eric A. King
Wikipedia
http://www.ac.wwu.edu/~stephan/Graunt/graunt.html