DATA MINING

...is nothing else than torturing the data until it confesses…and if you torture it enough, you can get it to confess to anything (Fred Menger)

Easy Data Mining - Part 2
Mining Frequent Patterns

    Mining frequent patterns is probably one of the most important concepts in data mining. A lot of other data mining tasks and theories stem from this concept. It should be the beginning of any data mining technical training because, on one hand, it gives a very well shaped idea about what data mining is and, on the other, it is not extremely technical. In this article we'll talk only about frequent patterns and specifically, about frequent itemsets.


Patterns and Itemsets

    Let's start from Eve and Adam. What is a pattern? Pattern is an Irish term meaning a saint's feast day, but of course this will not help us much in data mining. The other definition that we might consider says that a pattern is a form, template, or model (or, more abstractly, a set of rules) which can be used to make or to generate things or parts of a thing. In data mining we may say that a pattern is a particular data behavior, arrangement or form that might be of a business interest, even though we're not sure about that yet. But it is a starting point.

What is an itemset? Well, this word doesn't exist in the dictionary. It was invented specially for data mining and it means, obviously,a set of items, a group of elements that represents together a single entity. It is actually a type of pattern and the only one that we discuss about here (we'll discuss in other articles about patterns like sequential, structured, etc ).

Now, as we know the minimal required information, let's formulate a data mining problem. A very simple one for now. And we definitely won't try to solve it now. Our purpose for now is just to become accustomed to the main important concepts in data mining. Let's suppose we're conducting a data mining project for an insurance company that sells life insurance policies. We are not interested in what types of policies they sell (there are around 15 main types of life insurance policies on the market, from simple term policies to complicated variable universal life insurance policies and thousands of insurance products based on these main types). We are only analyzing now the supplemental benefits that are sold along with the policies, or, as they are sometime called, the riders. The supplemental benefits that this company offers are the following:

S1      Waiver of premium for disability benefit
S2      Disability income benefit
S3      Dismemberment benefit
S4      Accidental death benefit
S5      Waiver of premium for payor benefit
S6      Terminal illness benefit
S7      Dread disease benefit
S8      Long term care
S9      Spouse and Children's Insurance Rider
S10    Children's Insurance Rider
S11    Second Insured Rider
S12    Guaranteed insurability benefit
S13    Paid-up additions option benefit

Data Mining

10 customers with the riders that they bought

    We are interested in finding combinations of riders that are often sold together. We'll only consider the first 4 riders for now, to make the things easier. You can see here a sample of 10 customers with the riders that they bought.

    Maybe it's time to do a small divagation. I always said that all the data mining techniques are applicable only for large amounts of data. The larger the dataset is, the better the analysis and results. It doesn't make any sense whatsoever to use a data mining technique for a dataset of 20 rows. It's like using the 1500 tonnes Kockums Crane to lift a penny. It's non-sense. However, it's good to use a smaller set to understand the algorithms and to effectively see data mining at work.

Coming back to our research exercise, we want to see what riders are sold together, so that we can come up with a combined rider or a new attractive product altogether. In other words, we want to find the frequent itemsets. How are we going to do that? The most straightforward and common-sense method is brute forcing the database. Take all the possible itemsets and check their frequency. So we start with the complete itemset of 4 riders. Has any of the customers bought all of them? Yes, there was one. Then take all the 3 riders subsets (4 of them) and check the number of times each of these subests occurred. Note down the counts. Go to the 2 riders subsets and get again the counts. And so on. For this example it sounds very easy. What if we have all 13 riders? A combinatorial calculation tells us that we'll end up having 2`13 - 1 possible combinations, I don't even dare to calculate this number. After going through all these combinations and getting their corresponding counts, we pick up the itemsets that occurred most of the times and we declare them frequent! Fabulous! But totally not efficient. Any other solutions? Hang on 10 more lines.

Frequent Itemsets

   As you may noticed, I didn't define the term of frequent itemsets. I didn't omitted, I did it on purpose because I didn't have enough information. But I'm going to do that now. A frequent itemset is an itemset that occurrs frequently. Don't smile, I know it sounds silly. But here it comes the big question "How frequent is enough frequent?" How do we know what "frequent" means? 10 occurrences? 20? 100? Well, actually this is a parameters that we, the miners, have to set. If only 1 customer bought S1, S2, S3, S4 this fact isn't worth any consideration. We call it not frequent.
The parameter that we have to decide upon is called support of an itemset. So we need to concentrate on the problem of finding all itemsets which have the support that we previously set up. We call such itemsets as frequent itemsets. Let's say we are only interested in the items that have a 40% minimum support - in our example that means these itemsets (or combinations of riders) should have been purchased by at least 4 customers out of 10. Only in this case an itemset becomes frequent. So the correct definition says that a frequent itemset is one that occurs in at least a user-specific percentage of the database.
Now, when we know what a frequent itemset is, let's list down 2 major properties that will help us later on in defining algorithms to find the frequent itemsets:

  • Every subset of a frequent itemset is also frequent. Also known as Apriori Property or Downward Closure Property, this rule essentially says that we don't need to find the count of an itemset, if all its subsets are not frequent. This is made possible because of the anti-monotone property of support measure - the support for an itemset never exceeds the support for its subsets. Stay tuned for this.
  • If we divide the entire database in several partitions, then an itemset can be frequent only if it is frequent in at least one partition. Bear in mind that the support of an itemset is actually a percentage and if this minimum percentage requirement is not met for at least one individual partitions, it will not be met for the whole database. This property enables us to apply divide and conquer type of algorithms. Again, stay tuned for this too.

Maximal Frequent Itemsets

   Well, setting up a support percentage for an itermset, solved only a part of the problem. Now at least we know what we want. We know how frequent an itemset should be to become worth considering it. But the toughest part is still unsolved. In order to find a frequent itemset we have to go through all the sub-itemsets which themselves are frequent due to the Downwart Closure Property mentioned above. So we unavoidably generate an exponential number of subpatterns that we might not really need. Let's go back to our example and say that we want to find all the frequent itemsets that have a support of 30%. We take advantage of our very small dataset and observe that (S1, S2, S3) is present in 3 contracts: C7, C9, C10 - that means the itemset is present in 30% of contracts and hence it is frequent. And the only one superset is (S1, S2, S3, S4) which is not frequent. But what about all the subsets? There are 6 subsets that are obviously present in at least 30% of the contracts so they are frequent too, and don't need them (well, we don't need them now, we'll see that all the subsets are very important to determine the associations). But we have to go through all of them to get to the maximum number of items that form a frequent itemset. This procedure is very time consuming because the search space is huge. The presence of a frequent itemset of length k implies the presence of 2`k -2 additional frequent itemsets. So, wouldn't we be better off if we could consider only the frequent itemset that has the maximum number of items bypassing all the sub-itemsets? This is how the Maximal Frequent Itemset was invented. The definition says that an itemset is maximal frequent if none of its immediate supersets is frequent. In our example (S1, S2, S3) is maximal frequent because the only one superset is not frequent.

Closed Frequent Itemsets

   The only one downside of a maximal frequent itemset is that, even though we know that all the sub-itemsets are frequent, we don't know the actual support of those sub-itemsets. And we'll see how important this is when we'll try to find the association rules within the itemsets. Keep that in mind for now and let's think of how to get all the frequent itemsets that have the same support with all their subsets. This is how the Closed Frequent Itemsets came into picture: an itemset is closed if none of its immediate supersets has the same support as the itemset. Finding these closed frequent itemsets can be of a great

Data Mining

Maximal and closed frequent itemsets

help to purge a lot of itemsets that are not needed and to find, as I said above, the right associations rules. That's why a lot of researchers banged their heads against the walls to find a solution for that. But it happened only in 1999 when 4 French researchers came up with a very acclaimed article (Discovering Frequent Closed Itemsets for Association Rules by Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal ) where they proposed to mine only closed frequent itemsets using an algorithm called A-CLOSE. In the following years a lot of other algorithms have been invented (CHARM, CLOSET, etc) that improved the performances of the initial algorithm. We'll talk about all these algorithms some other time.
For now, let see some examples of Closed and Maximal Frequent Itemsets. As you see in the graph, all individual items S1, S2, S3, S4 are frequent itemsets because their support in greater than 2. But only 3 of them are closed because S1 has the superset (S1, S3) having the same support. So it contradicts the definition. The itemsets (S1, S2, S3) and (S2, S3, S4) are frequent because they are present in at least 2 of the contracts, and they are maximal as well because their frequent superset - (S1, S2, S3, S4) is the only one superset and it's not frequent (so, obviously, it can't have the same support).

To conclude, we have to keep in mind the following important concepts:
A frequent itemset is one that occurs in at least a user-specific percentage of the database. That percentage is called support.
An itemset is closed if none of its immediate supersets has the same support as the itemset.
An itemset is maximal frequent if none of its immediate supersets is frequent.

The next important data mining concept is finding associations. We'll use what we already learned about frequent itemsets to define the associations and we'll find out why the closed frequent itemsets are so important.

Reference

Fast Algorithms for Mining Association Rules by Rakesh Agrawal, Ramakrishnan Srikant
CHARM: An Efficient Algorithm for Closed Itemset Mining by Mohammed J. Zaki and Ching-Jui Hsiao
Discovering Frequent Closed Itemsets for Association Rules Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal
Introduction to Data Mining by Tan, Steinbach, Kumar

Data Mining Articles
If Only I Knew - by Tim Graettinger


Easy Data Mining
       by Radu Lovin
Part 1 - Introduction to Data Mining

Part 2 - Mining Frequent Patterns (Maximal and Closed Frequent Itemsets)

Part 3 - Association Rules

Part 4 - Algorithms for Mining Frequent Itemsets

Data Mining Books

Home   Contributors   Resources

© 2007-2009, dataminingarticles.com