An open API service indexing awesome lists of open source software.

https://github.com/daandouwe/svd-doc2vec

Turn documents into vectors by decomposing a PPMI cooccurence matrix.
https://github.com/daandouwe/svd-doc2vec

doc2vec ppmi svd wikitext

Last synced: 10 days ago
JSON representation

Turn documents into vectors by decomposing a PPMI cooccurence matrix.

Awesome Lists containing this project

README

          

# Doc2vec with PPMI-SVD
Factor a document-word cooccurence-matrix that is scaled with positive pointwise mutual information (PPMI) using singular value decomposition (SVD).

## Setup
We use the [WikiText dataset](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset).

To extract documents from WikiText and save as json file, run:
```bash
mkdir data
./parse-wikitext.py wikitext-2-raw/wiki.train.raw data/wikitext-2-raw.docs.json
```

## Usage
In the project terminal, run
```bash
mkdir vec
./main.py --data data/wikitext-2-raw.docs.json --outpath vec/wikitext-2-raw.vec.txt \
--lower --num-words 1000 --dim 10
```
for a quick demo. Plots are saved in the folder `plots`.

To rank the documents based on the vectors, use:
```bash
./rank.py vec/wikitext-2-raw.vec.txt > wikitext-2-raw.ranking.txt
```

## Requirements
```
numpy
scipy
tqdm
matplotlib
sklearn
bokeh
```