DATA MINING

...is nothing else than torturing the data until it confesses…and if you torture it enough, you can get it to confess to anything (Fred Menger)

Easy Data Mining - Part 4
Algorithms for Mining Frequent, Closed and Maximal Itemsets

    There are douzines of algorithms used to mine frequent itemsets. Some of them, very well known, started a whole new era in data mining. They made the concept of mining frequent itemsets and association rules possible. Others are variations that bring improvements mainly in terms of processing time. We'll go through some of the most important algorithms first briefly in this article, and then in more detail in the subsequent articles. The algorithms vary mainly in how the candidate itemsets are generated and how the supports for the candidate itemsets are counted.


Algorithms for Mining Frequent Itemsets

   AIS
Authors: Rakesh Agrawal, Tomasz Imielinski, Arun Swami
When: 1993
Paper: Mining association rules between sets of items in large databases
Where: Preceedings of the ACM SIGMOD Conference on Management of Data, Washington, D.C.
There was a real buzz in early 90s about how to emulate the biological immune system in the real world scenarios. The capacity of the immune system to proliferate cells that produce antibodies whenever it detects a high degree of matching with an antigen is, without doubts, fascinating. A series of algortihms were invented and new systems called artificial immune systems were designed. AIS algorithm uses candidate generation to detect the frequent itemsets. The candidates are generated on the fly and are compared with previously found frequent itemsets. The disadvantage of the algorithm is that it generates and counts too many candidate itemsets that turn out to be small. AIS was the first algorithm that introduced the problem of generating association rules.

   SETM
Authors: Maurice Houtsma, Arun Swami
When: 1995
Paper: Set-oriented mining for association rules in relational databases
Where: The 11th International Conference on Data Engineering
This algorithm was actually created by Houtsma and Swami in October 1993 and included in a research report while they were working in the IBM Almaden Research Center but for some reason it was officially released only in 1995. The algorithm also generates candidates on the fly based on the transaction read from the database, just like AIS algorithm. But SETM was more created for SQL computing and uses relational operations. To use standard SQL join operation for candidate generation, SETM separates candidate generation from counting. It first generates the candidates using equi-joins and then it sorts them all and removes the ones that don't meet the minimum support. Like AIS, SETM generates many candidate itemsets that in the end turn out not be frequent.

   APRIORI
Authors: Rakesh Agrawal and Ramakrishnan Srikant
When: 1994
Paper: Fast algorithms for mining association rules
Where: Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile
It is by far the most important data mining algorithms for mining frequent itemsets and associations. It opened new doors and created new modalities to mine the data. Since its inception, many scholars have improved and optimized the Apriori algorithm and have presented new Apriori-like algorithms. The authors became living legends in the data mining communities. They both received masters and PhDs from University of Wisconsin, Madison and both worked for IBM. The IBM's Intelligent Miner was created mainly by them. Once coleagues, they now work for competing companies - Agrawal for Microsoft and Srikant for Google. Apriori uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support.

   DHP - Direct Hashing and Pruning
Authors: J. S. Park, M. Chen, P.S. Yu.
Paper: An effective hash based algorithm for mining association rules.
Where: ACM SIGMOD International Conference on Management of Data
When: May 1995
As its name suggests, DHP uses a hash technique that makes it very efficient for the generation of candidate itemsets, in particular for the large two-itemsets, thus greatly improving the performance bottleneck of the whole process. In addition, DHP employs effective pruning techniques to progressively reduce the transaction database size. It is however a variation of the Apriori algorithm like so many other.

   Partition
Authors: Ashok Savasere, Edward Omiecinski, Shamkant Navathe
Paper: An efficient algorithm for mining association rules in large databases
Where: The 21st VLDB Conference
When: 1995
The algorithm is fundamentally different from all the previous algorithms in that it reads the database at most two times to generate all significant association rules. In the first scan of the database, it generates a set of all potentially large itemsets by scanning the database once and dividing it in a number of non-overlaping partitions. This set is a superset of all frequent itemsets so it may contain itemsets that are not frequent. During the second scan, counters for each of these itemsets are set up and their actual support is measured.

   ECLAT - Echivalence Class Clustering and Bottom-up Lattice Traversal
Authors: Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li
When: 1997
Paper: New Algorithms for Fast Discovery of Assoctiation Rules
Where: The 3rd International Conference on Knowledge Discovery and Data Mining
It is the first algorithm that uses a vertical data (inverted) layout. ECLAT is very efficient for large itemsets but less efficient for small ones. The frequent itemsets are determined using simple tid-list intersections in a depth-first graph. Few other algorithm that use the same techique are presented in the same paper (MAXECLAT, CLIQUE, MAXCLIQUE, TOP-DOWN) but ECLAT remained the best known.

   FP-GROWTH
Authors: Jiawei Han, Jian Pei, Yiwen Yin
When: 2000
Paper: Mining Frequent Patterns without Candidate Generation
Where: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA
The Apriori algorithm has its own disadvantages that made other scholars think of new aproaches to mine frequent patterns. The 2 main downsides are the possible need of generating a huge number of candidates if the number of frequent 1-itemsets is high or if the size of the frequent pattern is big the database has to be scanned repeatedly to match the candidates and determine the support What if we find a way to mine the frequent patterns without candidate generation? This would be a big improvement over Apriori. This is what the frequent-pattern growth (FP-growth) algorithm does. It adopts a divide-and-conquer strategy and a frequent-pattern tree.

   TREE-PROJECTION
Authors: Ramesh C. Agarwal, Charu C. Aggarwal, V.V.V. Prasad
When: 2000
Paper: A tree projection algorithm for generation of frequent itemsets
Where: Journal of Parallel and Distributed Computing
This inovation brought by this algorithm is the use of a lexicograph tree which requires substantially less memory than a hash tree. The support of the frequent itemsets is counted by projecting the transactions onto the nodes of this tree. This improves the performance of counting the number of transactions that have frequent itemsets. The lexicograph tree is traversed in a top-down fashion.

   PASCAL
Authors: Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, L. Lakhal
When: 2000
Paper: Mining Frequent Patterns with Counting Inference
Where: ACM SIGKDD Explorations Newsletter
The algorithm is named after the French mathematician Blaise Pascal who invented an early computing device and it is basically an optimization of the Apriori algorithm. The authors introduce the notion of key patterns and show that other frequent patterns can be infered from the key patterns without access to the database. The algorithm finds both frequent and closed sets and it is twice as fast as Close and 10 times as fast as Apriori but is only practical when the pattern length is short.

   H-MINE
Authors: Jian Pei, Jiawei Han Lu, Shojiro Nishio, Shiwei Tang, Dongqing Yang
When: 2001
Paper: H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases
Where: IEEE International Conference on Data Mining
The algorithm introduces the concept of hyperlinked data structure (H-struct) and uses it to dynamically adjust links in the mining process. The authors actually proposed 2 algorithms. The first one, H-mine(Mem), is a memory based, efficient pattern-growth algorithm. This algorithm can be scaled up to very large databases by database partitioning and this is what actual H-mine algorithm deals with.

   RELIM (Recursive Elimination)
Author: Christian Borgelt
When: 2005
Paper: Keeping Things Simple: Finding Frequent Item Sets by Recursive Elimination
Where: Workshop Open Source Data Mining Software (OSDM'05, Chicago, IL)
As the author mentioned in the paper's abstract, the algorithm is strongly inspired by FP-growth (it's much simpler though) and very similar to H-mine. It doesn't use prefix trees and any other complicated structures. The work is done in one simple recursive function, which can be written with relatively few lines of code.

Algorithms for Mining Maximal Itemsets

   MAX-MINER
Authors: R.J. Bayardo
When: 1998
Paper: Efficiently Mining Long Patterns from Databases
Where: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, United States The algorithm extracts only the maximal frequent itemsets (you know by now that an itemset is maximal frequent if it has no superset that is frequent). By extracting the maximal frequent itemsets and because any frequent itemset is a subset of a maximal frequent itemset, Max-Miner generates all the frequent itemsets. The algorithm combines a levelwise bottom-up traversal with a top-down traversal in order ot quickly find the maximal frequent patterns. Then, all frequent patterns are derived from these ones and one last database scan is carried on to count their support.

   DepthProject
Authors: Ramesh Agrawal, Charu Aggarwal, and V.V.V. Prasad
When: August 2000
Paper: Depth first generation of long patterns
Where: The 7th International Conference on Knowledge Discovery and Data Mining
The algorithm finds frequent itemsets by using depth first search on a lexicographic tree of itemsets. The authors claimed and proved that this algorithm is faster than MaxMiner.

   MAFIA
Authors: Doug Burdick, Manuel Calimlim, Johannes Gehrke
When: 2001
Paper: MAFIA: A maximal frequent itemset algorithm for transactional databases
Where: Proceedings of the 17th International Conference on Data Engineering, Heidelberg, Germany
MAFIA is an algorithm for mining maximal frequent itemsets from a transactional database (it has however the option to mine the closed sets as well). It is especially efficient when the itemsets in the database are very long. The search strategy integrates a depth-first traversal of the itemset lattice.

   GenMax
Authors: Karam Gouda and Mohammed J. Zaki
When: November 2001
Paper: Efficiently mining maximal frequent itemsets
Where: The 1st IEEE International Conference on Data Mining
GenMax is a backtrack search based algorithm for mining maximal frequent itemsets. It uses progressive focusing to perform maximality checking, and diffset propagation to perform fast frequency computation.

Algorithms for Mining Closed Frequent Itemsets

   CLOSE
Authors: Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal
When: 1999
Paper: Efficient mining of association rules using closed itemset lattices
Where: Information System, Vol 24
Close is another Apriori-like algorithm that directly mines frequent closed itemsets. The first of this algorithm is to use bottom-up search to identify the smallest frequent itemset that determines a closed itemset. After finding the frequent k-itemsets, Close compares the support of each set with its subsets at the previous level. If the support of an itemset matches the support of any of its subsets, the itemset is pruned. The second step in Close is to compute the closure of all the itemsets found in the first step. The authors also created a variation of this algorithm, called A-CLOSE, that also generates a reduced set of association rules without having to determine all frequent itemsets, thus lowering the algorithm computation costs.

   CLOSET
Authors: Jian Pei, Jiawei Han, Runying Mao
When: 2000
Paper: CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Where: Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery
It is an algorithm for minng closed itemsets that uses an FP-tree strucutre for mining closed itemsets without candidate generation. It also uses a recursive divide-and-conquer strategy and a database projection approach to mine long patterns.

   CHARM
Authors: Mohammed Javeed Zaki, Ching-Jiu Hsiao
When: 2002
Paper: CHARM: An Efficient Algorithm for Closed Itemset Mining
Where: SDM - SIAM International Conference on Data Mining, Chicago
It is another algorithm for mining all frequent closed itemsets. There are some innovative ideas developed by this algorithm that are of a certain value. CHARM simultaneously explores both the itemset space and transaction space, rather than only the itermset search space. It also uses a highly efficient hybrid search method that skips many levels of the IT-tree ( itemset-tidset tree) to quickly identify the frequent closed itemsets, instead of having to enumerate many possible subsets.

Reference

Mining association rules between sets of items in large databases - Rakesh Agrawal, Tomasz Imielinski, Arun Swami
Set-oriented mining for association rules in relational databases - Maurice Houtsma, Arun Swami
Fast algorithms for mining association rules - Rakesh Agrawal, Ramakrishnan Srikant
An effective hash based algorithm for mining association rules - J. S. Park, M. Chen, P.S. Yu.
An efficient algorithm for mining association rules in large databases - Ashok Savasere, Edward Omiecinski, Shamkant Navathe
New Algorithms for Fast Discovery of Assoctiation Rules - Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li
Mining Frequent Patterns without Candidate Generation - Jiawei Han, Jian Pei, Yiwen Yin
H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases - Jian Pei, Jiawei Han Lu, Shojiro Nishio, Shiwei Tang, Dongqing Yang
Keeping Things Simple: Finding Frequent Item Sets by Recursive Elimination - Christian Borgelt
A tree projection algorithm for generation of frequent itemsets - Ramesh C. Agarwal, Charu C. Aggarwal, V.V.V. Prasad
Mining Frequent Patterns with Counting Inference - Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, L. Lakhal
MAFIA: A maximal frequent itemset algorithm for transactional databases - Doug Burdick, Manuel Calimlim, Johannes Gehrke
Efficiently Mining Long Patterns from Databases - R.J. Bayardo
Depth first generation of long patterns - Ramesh Agrawal, Charu Aggarwal, and V.V.V. Prasad
Efficiently mining maximal frequent itemsets - Karam Gouda, Mohammed J. Zaki
Efficient mining of association rules using closed itemset lattices - Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets - Jian Pei, Jiawei Han, Runying Mao
CHARM: An Efficient Algorithm for Closed Itemset Mining - Mohammed Javeed Zaki, Ching-Jiu Hsiao

Data Mining Articles
If Only I Knew - by Tim Graettinger


Easy Data Mining
       by Radu Lovin
Part 1 - Introduction to Data Mining

Part 2 - Mining Frequent Patterns (Maximal and Closed Frequent Itemsets)

Part 3 - Association Rules

Part 4 - Algorithms for Mining Frequent Itemsets

Data Mining Books

Home   Contributors   Resources

© 2007-2009, dataminingarticles.com