https://github.com/src-d/swivel-spark-prep
Distributed equivalent of prep.py and fastprep from Swivel using Apache Spark.
https://github.com/src-d/swivel-spark-prep
apache-spark swivel
Last synced: 6 months ago
JSON representation
Distributed equivalent of prep.py and fastprep from Swivel using Apache Spark.
- Host: GitHub
- URL: https://github.com/src-d/swivel-spark-prep
- Owner: src-d
- License: apache-2.0
- Created: 2017-05-05T08:11:37.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-07-06T16:20:23.000Z (over 8 years ago)
- Last Synced: 2025-05-05T05:05:06.865Z (6 months ago)
- Topics: apache-spark, swivel
- Language: Scala
- Homepage:
- Size: 108 KB
- Stars: 5
- Watchers: 9
- Forks: 3
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Swivel Spark prep [](https://travis-ci.org/src-d/swivel-spark-prep)
> Distributed data preparation for the Swivel model
Distributed equivalent of `prep.py` and `fastprep` from [Swivel](https://github.com/tensorflow/models/blob/master/swivel/) using Apache Spark.
# Development
```
./gradlew idea # if using InteliJ
./gradlew build
./gradlew test
```
# Run
On a single machine, Apache Spark in Local mode
```
./gradlew shadowJar
./sparkprep --help
```
On an Apache Spark standalone cluster
```
./gradlew build
# https://github.com/tensorflow/ecosystem/tree/master/hadoop#build-and-install
cp /target/tensorflow-hadoop-1.0-SNAPSHOT-shaded-protobuf.jar .
# or use un-official build from .m2 or .gradle cache, after ./gradlew shadowJar
cp /tensorflow-hadoop-1.0-01232017-SNAPSHOT-shaded-protobuf.jar .
MASTER="" ./sparkprep-cluster --help
```
# Algorithm
Pre-processing consist of 3 jobs:
1. reading or creating a vocabulary
2. coocurence matrix
- vectorizing input: token->int using the vocaulary
- build full dense coocurence matrix, for given window size
- shard coocurence matrix to N pices (over each dimention)
* encode each shard in a single ProtoBuff
* save N^2 files
3. coocurence matrix: count marginal summs