https://github.com/src-d/swivel-spark-prep

Distributed equivalent of prep.py and fastprep from Swivel using Apache Spark.
https://github.com/src-d/swivel-spark-prep

apache-spark swivel

Last synced: 8 months ago
JSON representation

Distributed equivalent of prep.py and fastprep from Swivel using Apache Spark.

Host: GitHub
URL: https://github.com/src-d/swivel-spark-prep
Owner: src-d
License: apache-2.0
Created: 2017-05-05T08:11:37.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2017-07-06T16:20:23.000Z (over 8 years ago)
Last Synced: 2025-05-05T05:05:06.865Z (8 months ago)
Topics: apache-spark, swivel
Language: Scala
Homepage:
Size: 108 KB
Stars: 5
Watchers: 9
Forks: 3
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Swivel Spark prep [![Build Status](https://travis-ci.org/src-d/swivel-spark-prep.svg?branch=master)](https://travis-ci.org/src-d/swivel-spark-prep)
> Distributed data preparation for the Swivel model

Distributed equivalent of `prep.py` and `fastprep` from [Swivel](https://github.com/tensorflow/models/blob/master/swivel/) using Apache Spark.

# Development
```
./gradlew idea # if using InteliJ
./gradlew build
./gradlew test
```

# Run

On a single machine, Apache Spark in Local mode
```
./gradlew shadowJar
./sparkprep --help
```

On an Apache Spark standalone cluster
```
./gradlew build

# https://github.com/tensorflow/ecosystem/tree/master/hadoop#build-and-install
cp /target/tensorflow-hadoop-1.0-SNAPSHOT-shaded-protobuf.jar .
# or use un-official build from .m2 or .gradle cache, after ./gradlew shadowJar
cp /tensorflow-hadoop-1.0-01232017-SNAPSHOT-shaded-protobuf.jar .

MASTER="" ./sparkprep-cluster --help
```

# Algorithm

Pre-processing consist of 3 jobs:
1. reading or creating a vocabulary
2. coocurence matrix
- vectorizing input: token->int using the vocaulary
- build full dense coocurence matrix, for given window size
- shard coocurence matrix to N pices (over each dimention)
* encode each shard in a single ProtoBuff
* save N^2 files
3. coocurence matrix: count marginal summs

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/src-d/swivel-spark-prep

Awesome Lists containing this project

README