Huge datasets, often containing important operational knowledge, defy standard data analysis methods. Traditional data analysis methods do not easily scale from analyzing megabytes of data to analyzing terra- or peta-bytes of data, nor from analyzing low dimensional data to analyzing very high dimensional data. Furthermore, results may become difficult or almost impossible to interpret by the end-user because of their size and complexity. These are several of the problems that novel data mining methods try to solve. Frequent Itemset Mining focuses on deriving association rules which can then be used to classify new incoming data. The classic example is the shopping cart example with the ‘myth’: people who buy diapers also buy beer. If there is a large confidence of the rule D(iapers) => B(eer), it will actually be there in the ouput of the algorithm. For a concern this is perhaps nice to know, so that they can adjust their shop to it (like putting the diapers close to the beer or something) and their sales (if you buy an extra pack of diapers, you get a 50% discount on beer).
There is a large variety of Frequent Itemset Mining algorithms available on the internet. Because none of them was of direct use, I’ve made an implementation itself. It has its known downsides (see below), but it hopefully provides a start for people who want to do more with FIM.
With this Frequent Itemset Mining implementation I’ve implemented 2 algorithms which are capable of doing this: The Apriori algorithm and the FP-Tree algorithm. Note that these packages are not created by me.
Input of the program / code:
A “.data” comma separated file.
Known downsides of the program / code:
> The user is asked for two classes when performing the scanning algorithm.
There are a lot of cases where there are more than two classes in a data file.
For making it possible to handle these files, the code just has to be adjusted
slightly: the algorithm must look itself for the number of different classes
and perform the partitionscans on all these different classes.
> File rewriting is necessary at this point for the algorithm to work. The
speed could be much improved if this isn't necessary any more. For that the
code for these algorithms needs to be rewritten to handle lists directly.
> The hash function which is present in the program is pretty basic and
probably not sufficient for large datasets. This could be solved by
implementing a stronger hash function (or using JAVA's hash function).
In overall, I've tried to keep all functions as generic as possible so that
further extension is actually possible, so in that point of view, the code that
I've written is pretty scalable and by adjusting some functions (hashfunction
and scanning parameters) slightly, it can be even more scalable. All CSV files
can be read and the program reacts apropriate on the incoming data. If the data
is correct and it can be handled, the algorithm can start working with this
input and it produces a result file.
I was also surprised to see how many warnings my Eclipse environment generated
for the used source codes. Most of them can be solved directly, but some need
more time.
Download
The JAR file, source and test input file can be found in the download section.
Find more on Frequent Itemset Mining: http://www.google.nl/search?source=ig&hl=nl&rlz=&q=Frequent+Itemset+Mining&btnG=Google+zoeken
More on Association Rule learning:
http://www.google.nl/search?hl=nl&q=Association+Rule+Learning&btnG=Zoeken
Library references:
CSVReader: OpenCSV (http://opencsv.sourceforge.net)
Hash function: http://www.cs.usfca.edu/galles/cs245/hash.java.html
Apriori algorithm: http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_prog1.html
FPGrowth algorithm: http://www.csc.liv.ac.uk/~frans/KDD/Software/FPGrowth/fpGrowth.html
Thanks for working on FIM. I love this both practical and mind expanding topic.
[...] … The frequent pattern analysis described in Frequent Pattern Mining for Discovery is also …Frequent Itemset Mining implementation in JAVA | Patrick's …Huge datasets, often containing important operational knowledge, defy standard data analysis [...]
Great job! I wish to commend you for your very good work on this post. I’m hoping you continue coming up with good posts such as this one.
Hey,
Where did you get the huge data set from?
I have an assignment in university where I have to implement FP Tree algorithm, but I am not able to find a data set.
Please Help.
Well, I actually got it with my assignment!
But basically every kind of large dataset will do. I used genetic data. These data often come in large files.
Perhaps you can find some interesting datasets here.
I have done several JAVA implementations of Frequent Itemset Mining algorithms such as H-Mine, Apriori, ECLAT, RELIM, FP-GROWTH.
If you want to download them, go to my website:
http://www.philippe-fournier-viger.com/spmf/
Really nice Philippe. Thank you very much for your addition! If I find some time I’ll surely have a look into it. Other readers may find this useful too!
Cheers,
Patrick
hey… i need java code for ECLAT algorithm… if some one knows the link or something, plz let me know as soon as possible..
hello everyone my name is Issha..
I want some help regarding implementation of ECLAT in java.. if someone knows the link or has some info, plz let me know as soon as possible..
Thanking you..