https://github.com/daandouwe/svd-doc2vec
Turn documents into vectors by decomposing a PPMI cooccurence matrix.
https://github.com/daandouwe/svd-doc2vec
doc2vec ppmi svd wikitext
Last synced: 10 days ago
JSON representation
Turn documents into vectors by decomposing a PPMI cooccurence matrix.
- Host: GitHub
- URL: https://github.com/daandouwe/svd-doc2vec
- Owner: daandouwe
- Created: 2018-09-12T16:48:05.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-09-13T14:51:47.000Z (over 7 years ago)
- Last Synced: 2025-03-20T04:01:42.559Z (about 1 year ago)
- Topics: doc2vec, ppmi, svd, wikitext
- Language: HTML
- Size: 133 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Doc2vec with PPMI-SVD
Factor a document-word cooccurence-matrix that is scaled with positive pointwise mutual information (PPMI) using singular value decomposition (SVD).
## Setup
We use the [WikiText dataset](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset).
To extract documents from WikiText and save as json file, run:
```bash
mkdir data
./parse-wikitext.py wikitext-2-raw/wiki.train.raw data/wikitext-2-raw.docs.json
```
## Usage
In the project terminal, run
```bash
mkdir vec
./main.py --data data/wikitext-2-raw.docs.json --outpath vec/wikitext-2-raw.vec.txt \
--lower --num-words 1000 --dim 10
```
for a quick demo. Plots are saved in the folder `plots`.
To rank the documents based on the vectors, use:
```bash
./rank.py vec/wikitext-2-raw.vec.txt > wikitext-2-raw.ranking.txt
```
## Requirements
```
numpy
scipy
tqdm
matplotlib
sklearn
bokeh
```