https://github.com/derlin/bda-lsa-project

BDA project - latent semantic analysis of Wikipedia with spark (SVD and LDA)
https://github.com/derlin/bda-lsa-project

Last synced: 4 months ago
JSON representation

BDA project - latent semantic analysis of Wikipedia with spark (SVD and LDA)

Host: GitHub
URL: https://github.com/derlin/bda-lsa-project
Owner: derlin
Created: 2017-05-14T10:08:55.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2018-08-18T07:53:18.000Z (almost 8 years ago)
Last Synced: 2025-01-20T10:31:09.252Z (over 1 year ago)
Language: Scala
Homepage:
Size: 2.52 MB
Stars: 2
Watchers: 6
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          [![scaladoc](https://img.shields.io/badge/doc-scaladoc-blue.svg)](https://derlin.github.io/bda-lsa-project/api/index.html)

[![gh-pages](https://img.shields.io/badge/doc-gh--pages-blue.svg)](https://derlin.github.io/bda-lsa-project)

[![scaladoc](https://img.shields.io/badge/doc-wiki-blue.svg)](https://github.com/derlin/bda-lsa-project/wiki)

![love](https://img.shields.io/badge/made%20with%20%E2%99%A5-in%20Switzerland-ff69b4.svg)

# Table of Contents

- [About this repository](#about-this-repository)

  * [Structure](#structure)

- [Building and Running](#building-and-running)

  * [Building the jar](#building-the-jar)

  * [Project structure](#project-structure)

  * [Example of pipeline](#example-of-pipeline)

- [Models available](#models-available)

  * [SVD](#svd)

  * [LDA](#lda)

      - [ml.LDA](#mllda)

      - [mllib.LDA](#mlliblda)

  

 Table of contents generated with markdown-toc

 

# About this repository

_Context_: BDA (Big Data Analytics), Master MSE, june 2017.

_Authors_: Lucy Linder, Kewin Dousse, Davide Mazzolini, Christophe Blanquet.

This project is based on Chapter 6 of the book [_Advanced Analytics with Spark_](https://github.com/sryza/aas). It contains code and information on how to apply LSA techniques to the English Wikipedia articles corpus. 

## Structure

- the [source code](src/main/scala) is made of multiple classes thoroughly commented and documented with scaladoc

- the folder [spark-shell-scripts](spark-shell-scripts) contains scripts intended to be loaded into a spark-shell session. Once again, most of them are documented

- READMEs are scattered at each level in order to help you understand the repository structure

- the [wiki](https://github.com/derlin/bda-lsa-project/wiki) contains the results, notes, tips and tricks, how-to etc. This is a good place to start if you just want to see what we did.

- the [gh-pages website](https://github.com/derlin/bda-lsa-project) contains other resources such as `all-steps-from-book` in html as well as the __scaladoc__ 

- a __presentation__ is also available in [PDF (other/bda-lsa-project-slides.pdf)](other/bda-lsa-project-slides.pdf) or in [google slides](https://docs.google.com/presentation/d/1QmcnOZX43cVA3acqbjQT5rEmPPhYmRVuP0GbZjslEhM/edit?usp=sharing)  

# Building and Running

## Building the jar

This project uses sbt. To create the jar, use:

    

    export JAVA_OPTS="-Xms256m -Xmx4g"

    sbt assembly

The jar is now available under `target/scala-2.11/bda-project-lsa-assembly-1.0.jar`.

## Project structure

The project is made of multiple spark programs. Each program stores its output on disk, the actual location depending on the properties set in `config.properties`.

 

 A usual pipeline is:

 

 1. convert wikidump XML into plain text (`bda.lsa.preprocessing.XmlToParquetWriter`)

 2. create the vocabulary and the TF-IDF matrix (`bda.lsa.preprocessing.DocTermMatrixWriter`)

 3. create one of the models (`bda.lsa.svd.RunSVD`, `bda.lsa.lda.mllib.RunLDA`, `bda.lsa.lda.ml.RunLDA`)

 

 At this point, the model is persisted somewhere and you can load it inside a spark-shell to interact with it. The classes `bda.lsa.svd.SVDQueryEngine`, `bda.lsa.lda.mllib.LDAQueryEngine` and `bda.lsa.lda.ml.LDAQueryEngine` implement useful queries to analyse the models. 

## Example of pipeline

1. ensure you have a _wikidump_ somewhere to process.

2. create the jar: 

        export JAVA_OPTS="-Xms256m -Xmx4g"

        sbt assembly

        

3. create a `config.properties` and add the following:

        path.wikidump=wikidump-1500.xml # your xml 

        path.base=/tmp/spark-wiki/   # a base path to store the results

        

4. convert XML to text:  

 

        spark-submit --class bda.lsa.preprocessing.XmlToParquetWriter \

            target/scala-2.11/bda-project-lsa-assembly-1.0.jar

            

    This will save the DataFrame `[(title: String, content: String) ]` in `path.base/wikidump-parquet`.

    __important__: the job will fail if the aforementioned directory already exists !

5. create the TF-IDF matrix:

        spark-submit --class bda.lsa.preprocessing.DocTermMatrixWriter \

                target/scala-2.11/bda-project-lsa-assembly-1.0.jar  \

                 [ ]

                

   The only required argument is `numTerms`: this is the size of the vocabulary.

    

6. Run one of the models. For example, for svd:

    

        spark-submit --class bda.lsa.svd.RunSVD \

                       target/scala-2.11/bda-project-lsa-assembly-1.0.jar  \

                       

     

   The parameter `k` is the number of topics to infer, default to 100.

   

7. open a spark-shell and load the model:

        spark-shell --jars target/scala-2.11/bda-project-lsa-assembly-1.0.jar 

        val data = bda.lsa.getData(spark)

        val model = bda.lsa.svd.RunSVD.loadModel(spark)

        

8. Optionally, use the querier in the shell:

        val q = new bda.lsa.svd.SVDQueryEngine(model, data)

   

   

   

# Models available

## SVD

The class `bda.lsa.svd.RunSVD` makes it easy to compute an SVD model.

The results are saved in `{base.path}/svd`. More information about SVD can be found in the [svd package README](/derlin/bda-lsa-project/blob/master/src/main/scala/bda/lsa/svd/Readme.md).

After creating the model (see steps above), you can use the `bda.lsa.svd.SVDQueryEngine` to discover the results. From a spark-shell:

```

spark-shell --jars bda-project-lsa-assembly-1.0.jar

> val data = bda.lsa.getData(spark)

> val model = bda.lsa.svd.RunSVD.loadModel(spark)

> val q = new bda.lsa.svd.SVDQueryEngine(model, data)

```

See the [wiki](/derlin/bda-lsa-project/wiki) for our results and conclusion.

 

## LDA

 

 LDA models are available in two flavors: with _spark mllib_ and _spark ml_.

 

  We focused on the mllib implementation, mostly because the `org.apache.spark.mllib.clustering.DistributedLDAModel` offer more utility methods than it's ml counterpart. Our ml implementation will creates the model, but does not offer a useful query engine.

  

#### ml.LDA

 

To run the model:

    spark-submit --class bda.lsa.lda.ml.RunLDA \

          target/scala-2.11/bda-project-lsa-assembly-1.0.jar  \

             

          

The model is then saved to `{base.path}/ml-lda`.

After creating the model (see steps above), you can use the `bda.lsa.lda.mllib.LDAQueryEngine` to discover the results. From a spark-shell:

 

 ```

 spark-shell --jars bda-project-lsa-assembly-1.0.jar

 > val data = bda.lsa.getData(spark)

 > val model = bda.lsa.lda.ml.RunLDA.loadModel(spark)

 > val q = new bda.lsa.ml.LDAQueryEngine(model, data)

 ```

Note that the query engine might not be the most efficient... Some queries take time !

 

#### mllib.LDA

 

 

To run the model:

    spark-submit --class bda.lsa.lda.mllib.RunLDA \

          target/scala-2.11/bda-project-lsa-assembly-1.0.jar  \

               

          

The model is then saved to `{base.path}/mllib-lda`.

After creating the model (see steps above), you can use the `bda.lsa.lda.mllib.LDAQueryEngine` to discover the results. From a spark-shell:

 

 ```

 spark-shell --jars bda-project-lsa-assembly-1.0.jar

 > val data = bda.lsa.getData(spark)

 > val model = bda.lsa.lda.mllib.RunLDA.loadModel(spark)

 > val q = new bda.lsa.lda.mllib.LDAQueryEngine(model, data)

 ```

 

 See the [wiki](/derlin/bda-lsa-project/wiki) for our results and conclusion.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/derlin/bda-lsa-project

Awesome Lists containing this project

README