Data Mining Tutorial - Part 2
Mining Frequent Patterns
Mining frequent patterns is probably one of the most important concepts in data mining. A lot of other data mining tasks and theories stem from
this concept. It should be the beginning of any data mining technical training because, on one hand, it gives a very well shaped idea about what data mining is and,
on the other, it is not extremely technical. In this article we'll talk only about frequent patterns and specifically, about frequent itemsets.
Patterns and Itemsets Let's start from Eve and Adam. What is a pattern? Pattern is an Irish term meaning a saint's feast day,
but of course this will not help us much in data mining. The other definition that we might consider says that a pattern is a form, template, or model
(or, more abstractly, a set of rules) which can be used to make or to generate things or parts of a thing. In data mining we may say that a pattern is a particular
data behavior, arrangement or form that might be of a business interest, even though we're not sure about that yet. But it is a starting point.
What is an itemset?
Well, this word doesn't exist in the dictionary. It was invented specially for data mining and it means, obviously,a set of items, a group of elements that represents
together a single entity. It is actually a type of pattern and the only one that we discuss about here (we'll discuss in other articles about patterns like sequential,
structured, etc ).
Now, as we know the minimal required information, let's formulate a data mining problem. A very simple one for now. And we definitely
won't try to solve it now. Our purpose for now is just to become accustomed to the main important concepts in data mining. Let's suppose we're conducting a
data mining project for an insurance company that sells life insurance policies. We are not interested in what types of policies they sell (there are around 15
main types of life insurance policies on the market, from simple term policies to complicated variable universal life insurance policies and thousands of
insurance products based on these main types). We are only analyzing now the supplemental benefits that are sold along with the policies, or, as they are sometime
called, the riders. The supplemental benefits that this company offers are the following:
S1 Waiver of premium for disability benefit S2 Disability income benefit
S3 Dismemberment benefit S4 Accidental death benefit S5
Waiver of premium for payor benefit S6 Terminal illness benefit S7 Dread disease benefit
S8 Long term care S9 Spouse and Children's Insurance Rider S10
Children's Insurance Rider S11 Second Insured Rider S12 Guaranteed insurability benefit S13
Paid-up additions option benefit
10 customers with the riders that they bought
We are interested in finding combinations of riders that are often sold together. We'll only consider the first 4 riders for now,
to make the things easier. You can see here a sample of 10 customers with the riders that they bought.
Maybe it's time to do a small divagation. I always said that all the data mining techniques are applicable only for large amounts of data.
The larger the dataset is, the better the analysis and results. It doesn't make any sense whatsoever to use a data mining technique for a dataset of 20 rows.
It's like using the 1500 tonnes Kockums Crane to lift a penny. It's non-sense. However, it's good to use a smaller set to understand the algorithms and to effectively
see data mining at work.
Coming back to our research exercise, we want to see what riders are sold together, so that we can come up with a combined rider or
a new attractive product altogether. In other words, we want to find the frequent itemsets. How are we going to do that? The most straightforward and common-sense
method is brute forcing the database. Take all the possible itemsets and check their frequency. So we start with the complete itemset of 4 riders. Has any of the
customers bought all of them? Yes, there was one. Then take all the 3 riders subsets (4 of them) and check the number of times each of these subests occurred.
Note down the counts. Go to the 2 riders subsets and get again the counts. And so on. For this example it sounds very easy. What if we have all 13 riders?
A combinatorial calculation tells us that we'll end up having 2`13 - 1 possible combinations, I don't even dare to calculate this number. After going through
all these combinations and getting their corresponding counts, we pick up the itemsets that occurred most of the times and we declare them frequent! Fabulous!
But totally not efficient. Any other solutions? Hang on 10 more lines.
Frequent Itemsets
As you may noticed, I didn't define the term of frequent itemsets. I didn't omitted, I did it on purpose because I didn't have enough information.
But I'm going to do that now. A frequent itemset is an itemset that occurrs frequently. Don't smile, I know it sounds silly. But here it comes the big question
"How frequent is enough frequent?" How do we know what "frequent" means? 10 occurrences? 20? 100? Well, actually this is a parameters that we, the miners, have to set.
If only 1 customer bought S1, S2, S3, S4 this fact isn't worth any consideration. We call it not frequent. The parameter that we have to decide upon is called
support of an itemset. So we need to concentrate on the problem of finding all itemsets which have the support that we previously set up. We call such itemsets as
frequent itemsets. Let's say we are only interested in the items that have a 40% minimum support - in our example that means these itemsets (or combinations of riders)
should have been purchased by at least 4 customers out of 10. Only in this case an itemset becomes frequent. So the correct definition says that a frequent itemset
is one that occurs in at least a user-specific percentage of the database. Now, when we know what a frequent itemset is, let's list down 2 major properties that
will help us later on in defining algorithms to find the frequent itemsets:
- Every subset of a frequent itemset is also frequent. Also known as Apriori Property or Downward Closure Property, this rule essentially says that we don't need to
find the count of an itemset, if all its subsets are not frequent. Stay tuned for this.
- If we divide the entire database in several partitions, then an itemset can be frequent only if it is frequent in at least one partition. Bear in mind that the
support of an itemset is actually a percentage and if this minimum percentage requirement is not met for at least one individual partitions, it will not be met
for the whole database. This property enables us to apply divide and conquer type of algorithms. Again, stay tuned for this too.
Maximal Frequent Itemsets
Well, setting up a support percentage for an itermset, solved only a part of the problem. Now at least we know what we want. We know how frequent
an itemset should be to become worth considering it. But the toughest part is still unsolved. In order to find a frequent itemset we have to go through all the
sub-itemsets which themselves are frequent due to the Downwart Closure Property mentioned above. So we unavoidably generate an exponential number of subpatterns that we
might not really need. Let's go back to our example and say that we want to find all the frequent itemsets that have a support of 30%. We take advantage of our very
small dataset and observe that (S1, S2, S3) is present in 3 contracts: C7, C9, C10 - that means the itemset is present in 30% of contracts and hence it is frequent.
And the only one superset is (S1, S2, S3, S4) which is not frequent. But what about all the subsets? There are 6 subsets that are obviously present in at least 30%
of the contracts so they are frequent too, and don't need them (well, we don't need them now, we'll see that all the subsets are very important to determine the
associations). But we have to go through all of them to get to the maximum number of items that forms a frequent itemset. This procedure is very time consuming because
the search space is huge. The presence of a frequent itemset of length k implies the presence of 2k -2 additional frequent itemsets. So, wouldn't we be better off
if we could consider only the frequent itemset that has the maximum number of items bypassing all the sub-itemsets? This is how the Maximal Frequent Itemset was invented.
The definition says that an itemset is maximal frequent if none of its immediate supersets is frequent. In our example (S1, S2, S3) is maximal frequent because the only
one superset is not frequent.
Closed Frequent Itemsets
The only one downside of a maximal frequent itemset is that, even though we know that all the sub-itemsets are frequent, we don't know the actual
support of those sub-itemsets. And we'll see how important this is when we'll try to find the association rules within the itemsets. Keep that in mind for now and
let's think of how to get all the frequent itemsets that have the same support with all their subsets. This is how the Closed Frequent Itemsets came into picture:
an itemset is closed if none of its immediate supersets has the same support as the itemset. Finding these closed frequent itemsets can be of a great
Maximal and closed frequent itemsets
help to purge a lot of itemsets that are not needed and to find, as I said above, the right associations rules. That's why a lot of researchers banged their heads
against the walls to find a solution for that. But it happened only in 1999 when 4 French researchers came up with a very acclaimed article (Discovering
Frequent Closed Itemsets for Association Rules by Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal ) where they proposed to mine only closed frequent
itemsets using an algorithm called A-CLOSE. In the following years a lot of other algorithms have been invented (CHARM, CLOSET, etc) that improved the performances
of the initial algorithm. We'll talk about all these algorithms some other time. For now, let see some examples of Closed and Maximal Frequent Itemsets. As you
see in the graph, all individual items S1, S2, S3, S4 are frequent itemsets because their support in greater than 2. But only 3 of them are closed because S1 has the
superset (S1, S3) having the same support. So it contradicts the definition. The itemsets (S1, S2, S3) and (S2, S3, S4) are frequent because they are present in at
least 2 of the contracts, and they are maximal as well because there frequent superset - (S1, S2, S3, S4) is the only one superset and it's not frequent (so,
obviously, it can't have the same support).
To conclude, we have to keep in mind the following important concepts: A frequent itemset is one that occurs in at least a user-specific percentage of the
database. That percentage is called support. An itemset is closed if none of its immediate supersets has the same support as the itemset. An itemset is
maximal frequent if none of its immediate supersets is frequent.
The next important data mining concept is finding associations. We'll use what we already learned about frequent itemsets to define the associations and we'll
find out why the closed frequent itemsets are so important.
Bibliography
Fast Algorithms for Mining Association Rules by Rakesh Agrawal, Ramakrishnan Srikant
CHARM: An Efficient Algorithm for Closed Itemset Mining by Mohammed J. Zaki and Ching-Jui Hsiao
Discovering Frequent Closed Itemsets for Association Rules Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal
Introduction to Data Mining by Tan, Steinbach, Kumar
The Author
Radu Lovin is a Business Intelligence professional with expertise in financial systems. You can contact him at radu@dataminingarticles.com.
|