Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nwtgck/wikipedia-word2vec-playground-spark
A playground of word2vec from Wikipedia Dump with Spark
https://github.com/nwtgck/wikipedia-word2vec-playground-spark
scala spark word2vec
Last synced: 9 days ago
JSON representation
A playground of word2vec from Wikipedia Dump with Spark
- Host: GitHub
- URL: https://github.com/nwtgck/wikipedia-word2vec-playground-spark
- Owner: nwtgck
- Created: 2018-01-25T15:57:38.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2018-05-20T06:29:53.000Z (over 6 years ago)
- Last Synced: 2024-12-13T06:45:11.947Z (2 months ago)
- Topics: scala, spark, word2vec
- Language: Scala
- Homepage:
- Size: 4.05 MB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## wikipedia-word2vec-playground
[![Build Status](https://travis-ci.org/nwtgck/wikipedia-word2vec-playground-spark.svg?branch=master)](https://travis-ci.org/nwtgck/wikipedia-word2vec-playground-spark)A playground of word2vec from [Wikipedia Dump](https://dumps.wikimedia.org/) with [Spark](https://spark.apache.org/)
## Synonym
![word2vec_synonym](demo_images/word2vec_synonym.gif)
## Analogy
![word2vec_analogy](demo_images/word2vec_analogy.gif)
## Run Analogy in Cluster Mode in localhost
Here is an example to run Analogy in Cluster Mode in localhost
```bash
# Go to this repo
cd
# Download & Extract spark commands (source: https://spark.apache.org/downloads.html)
curl https://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz | tar zxf -
# Run a master
./spark-2.0.0-bin-hadoop2.7/sbin/start-master.sh -h localhost -p 7077
# Run a slave
./spark-2.0.0-bin-hadoop2.7/sbin/start-slave.sh spark://localhost:7077
# Generate jar
sbt assembly
# Run Analogy
./spark-2.0.0-bin-hadoop2.7/bin/spark-submit --class "io.github.nwtgck.wikipedia_word2vec_playground.Main" --master spark://localhost:7077 target/scala-2.11/wikipedia-word2vec-playground-assembly-0.1.jar --mode=analogy --wikipedia-dump=$HOME/bigfiles/wikipedia_dumps/enwiktionary-20180101-pages-articles.xml --page-limit=1000
```## Option Help
```txt
Usage: Wikipedia Word2Vec Playground [options]--mode play mode (e.g. 'synonym', 'analogy', 'train-only')
--wikipedia-dump
path of Wikipedia dump XML
--page-limit limit of page to use
--word2vec-iterations
the number of iterations of word2vec
--out-dir a path of output directory
```## Wikipedia Dump XML
Location of Wikipedia Dump is .## References
*