https://github.com/dav009/word2vec-readtext-example
Code for Word2Vec on spark
https://github.com/dav009/word2vec-readtext-example
Last synced: 7 months ago
JSON representation
Code for Word2Vec on spark
- Host: GitHub
- URL: https://github.com/dav009/word2vec-readtext-example
- Owner: dav009
- Created: 2015-05-29T15:13:16.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2015-05-19T01:52:44.000Z (over 10 years ago)
- Last Synced: 2025-02-06T03:32:13.673Z (11 months ago)
- Language: Java
- Homepage:
- Size: 133 KB
- Stars: 0
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Word2Vec Text Input Example
This repo provides code that takes in a text file, converts the file to word vectors using Word2Vec and saves a file of the word vectors.
### Installation
In order to run this example you will need to configure your computer based on the information at this [link](http://nd4j.org/getstarted.html). If you have Java and Maven already installed do the following:
$ git clone https://github.com/deeplearning4j/deeplearning4j.git
$ cd deeplearning4j && mvn clean install
$ git clone https://github.com/deeplearning4j/nd4j.git
$ cd nd4j && mvn clean install
$ git clone https://github.com/SkymindIO/word2vec-readtext-example.git
$ cd word2vec-readtext-example && mvn clean install
### How it Works
At the command line, run the jar file and provide the following arguments:
$ java -cp -input -output
OR
$ java -cp -input -output -serialize -minWords -vectorLength
A specific command example:
$ java -cp target/Word2VecExample-1.0-SNAPSHOT.jar insideview.Word2VecTextReader src/main/resources/raw-sentences.txt output.txt
OR
$ java -cp target/Word2VecExample-1.0-SNAPSHOT.jar insideview.Word2VecTextReader src/main/resources/raw-sentences.txt output.txt -serialize -minWords 2 -vectorLenght 200
Arguments you can pass in to adapt the results are as follows:
- **input** (*required*) = path and name of the text file to vectorize
- **output** (*required*)= path and name of where to store the vectors
- **serialize** = [*boolean, default = false*] enter true if you want a compressed (serialized file) otherwise it will output to a text file
- **minWords** = [*int, default=1*] number of tokens and is based on the tokenizer. In this example its 1 word per token
- **vectorLength** = [*int, default=300*] length of the feature vector token (in this example word)
The vectors per token will be saved to a file based on the path and name provided. Note, the DefaultTolkenizer is what is applied for word tokenization which is standard bag-of-words approach.
### Run UI Server
To see how the vectors function in a k Nearest Neighbors visualization, perform these steps:
- Open deeplearning4j in Intellij
- Navigate to the deeplearning4j-ui module
- Select UiServer.java under src/main/java/org.deeplearning4j.ui
- Right click and choose Run
- Open a browser and enter the following in the address bar:
http://localhost:8080/word2vec
- Follow the directions on the screen to load your word vector output file.