Data Mining Tutorial - Part 1
Introduction to Data Mining
If you are a beginner in data mining and want to become at least familiar with the
main concepts and terminologies, maybe the first step would be to acquire a clear eye-bird view about the whole domain - definition, inception, classification,
influences, trends - but without diving into too deep and scholastic details. And, of course, you do that by searching on Internet and skimming whatever books
you have at hand. You might say that one day would be more than enough to get an overall understanding about this domain. But you don't have a clue about
the trouble you're getting into. These are some of the questions that I'm sure you'll start asking yourself: How is data mining related to predictive analysis?
How is it related to knowledge discovery? What about data mining vs machine learning and artificial intelligence? What is the difference between predictive/descriptive
and supervized/unsupervized? Can we compare these 2 classifications first of all? And this is just the beginning... This article wants to shed some
light on this questions and present all these concepts in a very simple manner.
Data Mining Definition Let's quickly start with a definition, of course.
Actually, let's choose two from a couple of hundreds of available definitions. The first one, classic and well-known, says that data mining is the nontrivial extraction
of implicit, previously unknown, and potentially useful information from data (W. Frawley) . The second one, spark and rebel, says that data mining is nothing else
than torturing the data until it confesses… and if you torture it enough, you can get it to confess to anything (Fred Menger).
The annual Bill of Mortality for London and its environs, 1665
Data Mining Inception
Unfortunately, we don't have a clear date to celebrate data mining birthday every year. It emerged in early 80s when an the amount of data
generated and stored in databases became overwhelming and there was a strong need for tools and methods to extract useful and task-oriented knowledge.
However, some people say that the actual birth of data mining happened in London in 1662 when John Graunt wrote Natural and Political Observations Made upon the
Bills of Mortality. It was an impressing work for those times. He actually did a thorough analysis of mortality in those years and tried to build a model to predict
the next bubonic plague in the city. Well, if that was a data mining project or not, this is still a debatable subject even now. The data used was definitely
not big and the model was never actually built. Some conclusions have been drawn though, like correlations between plague years and new kings and population
dynamics in London. I would rather say this was a statistical project with large implications in demography. As a matter of fact, some statisticians
consider him the founder of the science of demography.
Infuences
Here's where the pain starts. If don't already know it by now, I can tell you that data mining has influences from a lot of other domains
like statistics, machine learning, pattern recognition, artificial intelligence, data visualization and so on...That's the reason some people say that data mining
doesn't really have any relevance by itself and it cannot be considered as a separate discipline. They say it just a fancy name invented just to squeeze some
more bucks from the business communities by applying the same old techniques used in the above mentioned domains. But data mining sounds so cool,
doesn't it? So when you calculate a covariance of 2 variables, don't call it statistics because you actually mine the data there! Is this really true?
It would actually make sense at a first glance but let's take a closer look.
Data Mining vs Statistics
Now, when we know what data mining is, let's see what the science of statistics is and how it is related to data mining. A very good definition
of statistics says that "Statistics is the science and practice of developing knowledge through the use of empirical data expressed in quantitative form.
It is based on statistical theory which is a branch of applied mathematics. Within statistical theory, randomness and uncertainty are modeled by probability theory."
Please note that everything is expressed in quantitative form. Statistics science is primarily oriented towards the extraction of quantitative and statistical
data characteristics. Let's stick to the example above. When we determine the covariance of 2 variables, we actually see if these variables vary together and
measure the strength of the relationship. But we'll never be able to characterize this dependency at a conceptual level, and produce a casual explanation and
a qualitative description of this relationship. You cannot "see" the reasons of this dependency because they are related to factors that are not explicitly provided
in the data. Furthermore, the data mining process is interactive, iterative and exploratory. Not to mention the data pre-processing which is essential for
any data mining project. Data reduction and compression, data cleaning and transformation are very important for data mining but they definitely don't
go with statistics. Verdict: We can definitely say they are different.
Data mining vs Machine Learning
Machine learning represents a sub-field of artificial intelligence and it was conceived in early 60s with the clear objective to design and
develop algorithms and techniques that implement various types of learning, mechanisms capable of inducing knowledge from examples of data. Machine learning
has a wide spectrum of applications including natural language processing, search engines, medical diagnosis, bio-informatics, speech and handwriting recognition,
object recognition in computer vision, game playing and robot locomotion. The general framework for machine learning is as follows: The learning system aims
at determining a description of a given concept from a set of concept examples provided by the teacher and from the background knowledge. Concept examples
can be positive ( iron, when teaching the concept of metals) or negative (marble). Background knowledge contains the information about the language used to
describe the examples and concepts - possible values of variables (domain), hierarchies, predicates, rules, etc. The learning algorithm then builds on the type
of examples and on the size and relevance of the background knowledge (there are some other factors involved here but we won't discuss them here). The main types
of learning systems are supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, transduction and learning to learn. We'll see
later on in this article how the first 2 systems influenced data mining to a great extent. So the machine learning process emphasizes the development of the algorithms
and usually assumes data is already residing on the main memory. On the other hand, the first condition for a data mining project to succeed is to have data,
large amounts of data. Think about a chess computer game. It doesn't require a huge database to play against you. It only needs the examples and the knowledge.
Teach the system what a counter-gambit is and it will know how to respond. Verdict: They are different.
Data Mining vs KDD
What is KDD? It is not a syndrome (as I first thought when I heard about it) and it is not the name of a DJ either. And don't you dare
to associate it with the annual conference of Knowledge Discovery and Data Mining organized by SIGKDD. It actually means Knowledge Discovery from Databases
and the concept emerged in 1989 to refer to the broad process of finding knowledge in data. It is referring to the nontrivial extraction of implicit,
previously unknown, and potentially useful information from data. Knowledge discovery differs from machine learning in that the task is more general and is
concerned with issues specific to databases. Oops! Does it sound familiar to you? I bet it does. Then what the bleep is data mining? Actually, these 2 terms have
been interchangeably used for several years. No distinction was made. Until a kind of consensus has been made within the community. We'll still have 2 terms but
with slightly different understandings. The term KDD is now viewed as the overall process of discovering useful knowledge from data, while data mining is viewed
as an application of some particular algorithms for extracting patterns from data without the additional steps of the KDD process, like data cleaning, data reduction,
concept hierarchies generation and it can even go to the infrastructure of the project. To me, it sounds a bit fishy. I mean, how on earth would a new comer
know this difference between KDD and data mining? The name doesn't tell you anything. Verdict: Well, I would say they are quite the same.
Data Mining vs Predictive Analytics
A definition of predictive analytics says that it is an area of statistical analysis that deals with extracting information from data and
using it to predict future trends and behavior patterns. The core of predictive analytics relies on capturing relationships between explanatory variables and
the predicted variables from past occurrences, and exploiting it to predict future outcomes. So what's the news? Didn't I see that in data mining as well?
And like all the respected sciences, a central element of predictive analytics was also invented - that is the predictor, a variable that can be measured for
an individual or other entity to predict future behavior. So here's what happened. People got bored with data mining term, it was so much worn-out and on
everyone's lips, so they invented another term - predictive analytics or predictive modeling. Or maybe, as the president of The Modeling Agency, Eric King,
once said, a shift was required to rapidly move towards far more descriptive and accurate nomenclature such as predictive modeling and predictive analytics,
rather the old data mining. Verdict: Definitely the same. I would be very happy (and surprised) if anyone who reads this article can bring arguments
against this opinion.
Predictive/Descriptive Data Mining vs Supervised/Unsupervised Learning
Let's see now what's the catch with these 4 terms that are so much used by data mining (or whatever you want to call it) community. And,
as usual, let's start with definitions. Supervised learning is a machine learning technique for creating a function from training data. The training data
consist of pairs of input objects, and desired outputs. The output of the function can be a continuous value (called regression), or can predict a class label
of the input object (called classification). The task of the supervised learner is to predict the value of the function for any valid input object after having
seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen
situations in a "reasonable" way. Unsupervised learning is a method of machine learning where a model is fit to observations. It is distinguished from
supervised learning by the fact that there is no a priori output. In unsupervised learning, a data set of input objects is gathered. Unsupervised learning then
typically treats input objects as a set of random variables. A joint density model is then built for the data set. Predictive data mining can be used to forecast
explicit values, based on patterns determined from known results. What did the definition say about supervised learning? That it predicts the value of a function
for any valid input object after having seen a number of training examples. Isn't this the same? Descriptive data mining describes a data set in a concise way
and presents interesting characteristics of the data without having any predefined target. What did the definition say about unsupervised learning? That it has
no a priori output? Again, isn't this the same? As a matter of fact, these 2 categories of terms are largely interchangeably used. But they can be compared
only to some extend. Predictive and descriptive data mining are concepts whereas supervised and unsupervised learning are methods. It's true, supervised learning
method is used in predictive data mining and unsupervised learning is used in descriptive data mining but these are not the only 2 methods used in data mining
(they are the major ones though). Some people say that methods like sequential prediction or interpolation are also used in predictive data mining without
being supervised learning techniques. And methods like correlation, associations, dependencies are used in descriptive data mining without being
unsupervised learning techniques. However, the supervised/unsupervised learning techniques are the most used ones in data mining. It's funny, but sad in the
same time, when you find and read articles about predictive learning and descriptive learning or supervised/unsupervised data mining. It is an incredible
mixture of terms that perhaps stems from the same continuous and not always understandable desire to invent new terms for this, otherwise straightforward, discipline.
Data Mining vs Business Intelligence
Business Intelligence represents a category of applications, technologies and concepts for gathering, storing, analyzing, and providing access to data to help business users make better decisions. So business intelligence is a way broader category than the data mining is. Being able to create a report with monthly sales activity by salesperson and product is called BI. Dashboards and scorecards are also BI tools. Data extraction, transformation, OLAP, slicing, dicing, filtering are all related to BI processes. And data mining too is a BI tool. The most advanced one I would say because it involves not only preparing and displaying the data in various ways but also extracting implicit information that otherwise wouldn’t have been displayed.
Verdict: Different.
Classification
This is again a sensitive subject in data mining and, honestly, I'm kind of afraid to tackle it to a great and detailed extent now. There
are thousands of classifications of data mining with classes, sub-classes and sub-sub-classes. And sometimes a particular class may have dozens of names.
So I'd rather be cautious and come with a very succinct classification. However, despite of his briefness, this classification has all the major data mining
techniques that we'll talk about in some future articles.
Association Rules Classification Prediction Clustering
Allow me not to say anything about them for now since this is just an introduction to data mining and I want to keep it as simple as possible. It's always
beneficial to put the things in a very simple and clear way. If you are able to explain an IT concept in such a way that your grandpa will be able to understand it,
then you're really good.
So if you are a beginner in data mining and want to become at least familiar with the main concepts and terminologies, I'd like to hope that your first step
would be to acquire a clear eye-bird view about the whole domain by reading this article. And why don't you ask your grandpa as well for a non-biased opinion?
Bibliography
Computational Intelligence for Decision Support By Zhengxin Chen
Data Warehousing And Knowledge Discovery: 6th International Conference By Mukesh. Mohania, Yahiko Kambayashi, Wolfram Wöss
Machine Learning and Data Mining: Methods and Applications Ryszad S Michalski
Demystifying Data Mining, Carol Brennan Baldan Subgroup Discovery Techniques and Applications, Nada Lavrac
Introduction to Data Mining, José Hernández-Orallo
Data Warehousing And Knowledge Discovery: 6th International Conference By Mukesh. Mohania, Yahiko Kambayashi, Wolfram Wöss
How to Buy Data Mining: A Framework for Avoiding Costly Project Pitfalls in Predictive Analytics, Eric A. King
Wikipedia http://www.ac.wwu.edu/~stephan/Graunt/graunt.html
The Author
Radu Lovin is a Business Intelligence professional with expertise in financial systems. You can contact him at radu@dataminingarticles.com.
|