januari 2008 - Patrick's Playground

Huge datasets, often containing important operational knowledge, defy standard data analysis methods. Traditional data analysis methods do not easily scale from analyzing megabytes of data to analyzing terra- or peta-bytes of data, nor from analyzing low dimensional data to analyzing very high dimensional data. Furthermore, results may become difficult or almost impossible to interpret by the end-user because of their size and complexity. These are several of the problems that novel data mining methods try to solve. Frequent Itemset Mining focuses on deriving association rules which can then be used to classify new incoming data. The classic example is the shopping cart example with the ‘myth’: people who buy diapers also buy beer. If there is a large confidence of the rule D(iapers) => B(eer), it will actually be there in the ouput of the algorithm. For a concern this is perhaps nice to know, so that they can adjust their shop to it (like putting the diapers close to the beer or something) and their sales (if you buy an extra pack of diapers, you get a 50% discount on beer).
There is a large variety of Frequent Itemset Mining algorithms available on the internet. Because none of them was of direct use, I’ve made an implementation itself. It has its known downsides (see below), but it hopefully provides a start for people who want to do more with FIM.
With this Frequent Itemset Mining implementation I’ve implemented 2 algorithms which are capable of doing this: The Apriori algorithm and the FP-Tree algorithm. Note that these packages are not created by me.

Input of the program / code:
A “.data” comma separated file.

Known downsides of the program / code:
> The user is asked for two classes when performing the scanning algorithm.
There are a lot of cases where there are more than two classes in a data file.
For making it possible to handle these files, the code just has to be adjusted
slightly: the algorithm must look itself for the number of different classes
and perform the partitionscans on all these different classes.
> File rewriting is necessary at this point for the algorithm to work. The
speed could be much improved if this isn’t necessary any more. For that the
code for these algorithms needs to be rewritten to handle lists directly.
> The hash function which is present in the program is pretty basic and
probably not sufficient for large datasets. This could be solved by
implementing a stronger hash function (or using JAVA’s hash function).
In overall, I’ve tried to keep all functions as generic as possible so that
further extension is actually possible, so in that point of view, the code that
I’ve written is pretty scalable and by adjusting some functions (hashfunction
and scanning parameters) slightly, it can be even more scalable. All CSV files
can be read and the program reacts apropriate on the incoming data. If the data
is correct and it can be handled, the algorithm can start working with this
input and it produces a result file.
I was also surprised to see how many warnings my Eclipse environment generated
for the used source codes. Most of them can be solved directly, but some need
more time.

Download
The JAR file, source and test input file can be found in the download section.

Find more on Frequent Itemset Mining: http://www.google.nl/search?source=ig&hl=nl&rlz=&q=Frequent+Itemset+Mining&btnG=Google+zoeken
More on Association Rule learning:
http://www.google.nl/search?hl=nl&q=Association+Rule+Learning&btnG=Zoeken

Library references:
CSVReader: OpenCSV (http://opencsv.sourceforge.net)
Hash function: http://www.cs.usfca.edu/galles/cs245/hash.java.html
Apriori algorithm: http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_prog1.html
FPGrowth algorithm: http://www.csc.liv.ac.uk/~frans/KDD/Software/FPGrowth/fpGrowth.html

Patrick's Playground

Random thoughts, problems and solutions

Maandelijks archief: januari 2008

Frequent Itemset Mining implementation in JAVA