Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/felipekunzler/frequent-itemset-mining-spark

Sequential and distributed implementations of Apriori and FP-Growth algorithms using Scala and Spark.
https://github.com/felipekunzler/frequent-itemset-mining-spark

apriori dfps fp-growth rapriori scala spark yafim

Last synced: 21 days ago
JSON representation

Sequential and distributed implementations of Apriori and FP-Growth algorithms using Scala and Spark.

Host: GitHub
URL: https://github.com/felipekunzler/frequent-itemset-mining-spark
Owner: felipekunzler
Created: 2019-02-25T20:40:48.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2021-06-04T23:58:43.000Z (about 3 years ago)
Last Synced: 2024-02-17T23:32:19.221Z (4 months ago)
Topics: apriori, dfps, fp-growth, rapriori, scala, spark, yafim
Language: Scala
Homepage:
Size: 5.69 MB
Stars: 15
Watchers: 4
Forks: 8
Open Issues: 0
Metadata Files:
- Readme: README.md

Lists

awesome-stars - felipekunzler/frequent-itemset-mining-spark - Sequential and distributed implementations of Apriori and FP-Growth algorithms using Scala and Spark. (Scala)

README

        # Frequent Itemset Mining in Scala and Spark

Implementation of Apriori and FP-Growth in Scala, as well as their counterpart distributed implementations YAFIM, RApriori and DFPS using Spark.

This work has been developed as part of my [CS undergraduate thesis](./Final%20Paper%20-%20pt_BR.pdf) where three previously proposed distributed algorithms were implemented in Spark and compared on [Amazon EMR](https://aws.amazon.com/emr/).

### Algorithms

* Sequential implementation of Apriori and FP-Growth in Scala.

  * [Apriori.scala](https://github.com/felipekunzler/frequent-itemset-mining-spark/blob/master/src/main/scala/sequential/Apriori.scala)

  * [FPGrowth.scala](https://github.com/felipekunzler/frequent-itemset-mining-spark/tree/master/src/main/scala/sequential/fpgrowth/FPGrowth.scala) and [FPTree.scala](https://github.com/felipekunzler/frequent-itemset-mining-spark/tree/master/src/main/scala/sequential/fpgrowth/FPTree.scala)

  * Apriori using a Hash Tree for filtering: [AprioriHashTree.scala](https://github.com/felipekunzler/frequent-itemset-mining-spark/blob/master/src/main/scala/sequential/hashtree/AprioriHashTree.scala)

  * Hash Tree implementation: [HashTree.scala](https://github.com/felipekunzler/frequent-itemset-mining-spark/blob/master/src/main/scala/sequential/hashtree/HashTree.scala)

* Distributed implementations of Apriori and FP-Growth using Spark.

  * [YAFIM.scala](https://github.com/felipekunzler/frequent-itemset-mining-spark/blob/master/src/main/scala/spark/YAFIM.scala)

  * YAFIM using a Hash Tree for filtering: [YAFIMHashTree.scala](https://github.com/felipekunzler/frequent-itemset-mining-spark/blob/master/src/main/scala/spark/YAFIMHashTree.scala)

  * [RApriori.scala](https://github.com/felipekunzler/frequent-itemset-mining-spark/blob/master/src/main/scala/spark/RApriori.scala)

  * [DFPS.scala](https://github.com/felipekunzler/frequent-itemset-mining-spark/blob/master/src/main/scala/spark/DFPS.scala)

* Distributed algorithms originally proposed by the following papers:

  * [YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark](https://ieeexplore.ieee.org/abstract/document/6969575)

  * [R-Apriori: An Efficient Apriori based Algorithm on Spark](https://dl.acm.org/citation.cfm?id=2809893)

  * [DFPS: Distributed FP-growth algorithm based on Spark](https://ieeexplore.ieee.org/document/8054308)

### Experiments

The three distributed FIM algorithms were compared in terms of execution time and normalized speedup.

All experiments were run on Amazon EMR using m5.xlarge (dual core) machines.

* Execution time in seconds. Datasets replicated 9 times and increasing cluster size.

 

 

* Normalized speedup factor, if the number of cores is increased to 10, the ideal speedup factor is 10 (meaning that the execution time has decreased by 10 times). Datasets replicated 9 times and increasing cluster size (X axis measured in cores).

 

 

### Building and running

* Install and set Java 8 to your path, e.g. `sdk install java 8.0.252.hs-adpt` and `sdk use java 8.0.252.hs-adpt`

* Instsall sbt, e.g. `sdk install sbt`

* Compile the project using `sbt compile`

* Adjust the absolute path for the provided datasets in `src/main/resources/defaultfim.properties`

* Run the experiments with `sbt run` and select the main class `experiments.Runner`

Sample output (execution times in seconds):

```

+-----------------+------------+-------+-------+-------+-------+------+

|           Class |    Dataset | Run 1 | Run 2 | Run 3 |  Mean |   SD |

+-----------------+------------+-------+-------+-------+-------+------+

| AprioriHashTree |   mushroom | 14.90 | 13.96 | 14.65 | 14.50 | 0.40 |

|        FPGrowth |   mushroom |  0.63 |  0.55 |  0.64 |  0.61 | 0.04 |

|   YAFIMHashTree |   mushroom |  7.26 |  6.64 |  7.92 |  7.27 | 0.52 |

|        RApriori |   mushroom |  6.71 |  6.79 |  7.45 |  6.98 | 0.33 |

|            DFPS |   mushroom |  2.49 |  2.31 |  2.27 |  2.36 | 0.10 |

|                 |            |       |       |       |       |      |

| AprioriHashTree | T10I4D100K |  3.68 |  2.39 |  2.42 |  2.83 | 0.60 |

|        FPGrowth | T10I4D100K |  3.26 |  3.27 |  3.06 |  3.20 | 0.10 |

|   YAFIMHashTree | T10I4D100K |  1.96 |  1.90 |  1.89 |  1.92 | 0.03 |

|        RApriori | T10I4D100K |  1.89 |  1.80 |  1.75 |  1.81 | 0.06 |

|            DFPS | T10I4D100K |  4.13 |  4.07 |  4.14 |  4.11 | 0.03 |

+-----------------+------------+-------+-------+-------+-------+------+

+------------+-----------------+----------+---------------+----------+------+

|            | AprioriHashTree | FPGrowth | YAFIMHashTree | RApriori | DFPS |

+------------+-----------------+----------+---------------+----------+------+

|   mushroom |           14.50 |     0.61 |          7.27 |     6.98 | 2.36 |

| T10I4D100K |            2.83 |     3.20 |          1.92 |     1.81 | 4.11 |

+------------+-----------------+----------+---------------+----------+------+

```

Additionally, algorithms can be run individually by instantiating its respective class. Example:

```scala

object Main {

  def main(args: Array[String]): Unit = {

    

    val dfps: SparkFIM = new DFPS()

    

    val fileName = "/projects/spark-scala/frequent-itemset-mining-spark/src/main/resources/datasets/mushroom.txt"

    val frequentItemsets = dfps.execute(fileName, separator = " ", minSupport = 0.65)

    

    Util.printItemsets(frequentItemsets)

    

  }

}

```

Would output:

```

Elapsed time: 2.06 seconds. Class: DFPS.

Found 39 itemsets

[1] {34}, {36}, {39}, {85}, {86}, {90}

[2] {34, 36}, {34, 39}, {34, 85}, {34, 86}, {34, 90}, {36, 85}, {36, 86}, {36, 90}, {39, 85}, {39, 86}, {85, 86}, {85, 90}, {86, 90}

[3] {34, 36, 85}, {34, 36, 86}, {34, 36, 90}, {34, 39, 85}, {34, 39, 86}, {34, 85, 86}, {34, 85, 90}, {34, 86, 90}, {36, 85, 86}, {36, 85, 90}, {36, 86, 90}, {39, 85, 86}, {85, 86, 90}

[4] {34, 36, 85, 86}, {34, 36, 85, 90}, {34, 36, 86, 90}, {34, 39, 85, 86}, {34, 85, 86, 90}, {36, 85, 86, 90}

[5] {34, 36, 85, 86, 90}

```