Frequent Itemset Mining implementation in JAVA

Huge datasets, often containing important operational knowledge, defy standard data analysis methods. Traditional data analysis methods do not easily scale from analyzing megabytes of data to analyzing terra- or peta-bytes of data, nor from analyzing low dimensional data to analyzing very high dimensional data. Furthermore, results may become difficult or almost impossible to interpret by the end-user because of their size and complexity. These are several of the problems that novel data mining methods try to solve. Frequent Itemset Mining focuses on deriving association rules which can then be used to classify new incoming data. The classic example is the shopping cart example with the ‘myth’: people who buy diapers also buy beer. If there is a large confidence of the rule D(iapers) => B(eer), it will actually be there in the ouput of the algorithm. For a concern this is perhaps nice to know, so that they can adjust their shop to it (like putting the diapers close to the beer or something) and their sales (if you buy an extra pack of diapers, you get a 50% discount on beer).
There is a large variety of Frequent Itemset Mining algorithms available on the internet. Because none of them was of direct use, I’ve made an implementation itself. It has its known downsides (see below), but it hopefully provides a start for people who want to do more with FIM.
With this Frequent Itemset Mining implementation I’ve implemented 2 algorithms which are capable of doing this: The Apriori algorithm and the FP-Tree algorithm. Note that these packages are not created by me.

Input of the program / code:
A “.data” comma separated file.

Known downsides of the program / code:
> The user is asked for two classes when performing the scanning algorithm.
There are a lot of cases where there are more than two classes in a data file.
For making it possible to handle these files, the code just has to be adjusted
slightly: the algorithm must look itself for the number of different classes
and perform the partitionscans on all these different classes.
> File rewriting is necessary at this point for the algorithm to work. The
speed could be much improved if this isn’t necessary any more. For that the
code for these algorithms needs to be rewritten to handle lists directly.
> The hash function which is present in the program is pretty basic and
probably not sufficient for large datasets. This could be solved by
implementing a stronger hash function (or using JAVA’s hash function).
In overall, I’ve tried to keep all functions as generic as possible so that
further extension is actually possible, so in that point of view, the code that
I’ve written is pretty scalable and by adjusting some functions (hashfunction
and scanning parameters) slightly, it can be even more scalable. All CSV files
can be read and the program reacts apropriate on the incoming data. If the data
is correct and it can be handled, the algorithm can start working with this
input and it produces a result file.
I was also surprised to see how many warnings my Eclipse environment generated
for the used source codes. Most of them can be solved directly, but some need
more time.

Download
The JAR file, source and test input file can be found in the download section.

Find more on Frequent Itemset Mining: http://www.google.nl/search?source=ig&hl=nl&rlz=&q=Frequent+Itemset+Mining&btnG=Google+zoeken
More on Association Rule learning:
http://www.google.nl/search?hl=nl&q=Association+Rule+Learning&btnG=Zoeken

Library references:
CSVReader: OpenCSV (http://opencsv.sourceforge.net)
Hash function: http://www.cs.usfca.edu/galles/cs245/hash.java.html
Apriori algorithm: http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_prog1.html
FPGrowth algorithm: http://www.csc.liv.ac.uk/~frans/KDD/Software/FPGrowth/fpGrowth.html

Getagged , , ,

28 reacties op “Frequent Itemset Mining implementation in JAVA

  1. Alex Safronov schreef:

    Thanks for working on FIM. I love this both practical and mind expanding topic.

  2. […] … The frequent pattern analysis described in Frequent Pattern Mining for Discovery is also …Frequent Itemset Mining implementation in JAVA | Patrick's …Huge datasets, often containing important operational knowledge, defy standard data analysis […]

  3. gps reviews schreef:

    Great job! I wish to commend you for your very good work on this post. I’m hoping you continue coming up with good posts such as this one.

  4. where did you get the huge data set from? schreef:

    Hey,
    Where did you get the huge data set from?
    I have an assignment in university where I have to implement FP Tree algorithm, but I am not able to find a data set.
    Please Help.

    • Patrick schreef:

      Well, I actually got it with my assignment!
      But basically every kind of large dataset will do. I used genetic data. These data often come in large files.

      Perhaps you can find some interesting datasets here.

  5. Philippe schreef:

    I have done several JAVA implementations of Frequent Itemset Mining algorithms such as H-Mine, Apriori, ECLAT, RELIM, FP-GROWTH.

    If you want to download them, go to my website:

    http://www.philippe-fournier-viger.com/spmf/

    • Patrick schreef:

      Really nice Philippe. Thank you very much for your addition! If I find some time I’ll surely have a look into it. Other readers may find this useful too!

      Cheers,

      Patrick

  6. issha schreef:

    hey… i need java code for ECLAT algorithm… if some one knows the link or something, plz let me know as soon as possible..

  7. issha schreef:

    hello everyone my name is Issha..
    I want some help regarding implementation of ECLAT in java.. if someone knows the link or has some info, plz let me know as soon as possible..

    Thanking you..

  8. koushik schreef:

    Thanks for the code, Good job. I have doubt in the crx.data file. What do those alphabets stands for b,a,u,g,v,t,f…… and the +(plus sign) and -(minus sign signifies)?

  9. suren schreef:

    hello all currently i am doing my research work about mining of dynamic frequent itemset uing ant colony…please help me out..from where i have to start…

  10. Vuong schreef:

    do you know Hurm algorithm?
    I’m doing a master thesis in the field of information technology, my thesis is name: “Utility-based association rule mining: A marketing solution for cross-selling”. But I don’t understand Hurm algorithm.I know you very good.
    Article Title:High-Utility Rule Mining for Cross-Selling
    Thanks.
    Vuong,

  11. suren schreef:

    how to implement ant colony for mining frequent itemset

  12. karthic schreef:

    Hai i am Karthic I need fp tree construction program in java.. The output must be in GUI format.. whether the program may be in java swing or any thing… If you have any source code can you send it to my mail… It must be helpful for my project…

  13. karthic schreef:

    No Patrick you cant tell like that… Plz help me I need Frequent Pattern(FP) Growth program alone… Can you ask it to your friends Patrick… I need it Patrick… help me….

  14. karthic schreef:

    The output must be User Interface… Please Patrick help me…

  15. Pavan schreef:

    Hello sir, thanks for the codes. Looks like you’re a champion in data mining. Needed your help for one last time. Can you provide the code for hash based technique for improving apriori. ?

    Thanks.

  16. Nitya schreef:

    Sir,
    This seems a bit relative to my project.Am working on generation of frequent itemsets for large database using hadoop mapreduce.Can i get code for that? Please help me.

  17. malarvizhi schreef:

    hello sir,
    I am working on FP-Growth Algorithm and i have seen ur code of FP-gowth algorithm.But it is not working for my large dataset.can u pls help me sir.

  18. srujan schreef:

    sir,
    we are in need of source code for algorithms IWI miner and MIWI miner in any language. If you have please provide us which will helpful for us doing our mini project.

  19. sid schreef:

    Hello friend
    I am Research student i found java code of ECLAT on phillip’s website but how to use it with large data i dont know and which data can be use i dont know please help me.
    PLease reply urgent.

  20. iza schreef:

    sir, i need java code algorithm fp-growth please send me in email
    or if some one knows the link ?

  21. akshu schreef:

    Hey i’m a code for doing a project in data minning.i need the java cod for integrating apriori and FP tree.If u have plz mail me.

Geef een reactie

Het e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *