https://github.com/daandouwe/svd-doc2vec

Turn documents into vectors by decomposing a PPMI cooccurence matrix.
https://github.com/daandouwe/svd-doc2vec

doc2vec ppmi svd wikitext

Last synced: about 2 months ago
JSON representation

Turn documents into vectors by decomposing a PPMI cooccurence matrix.

Host: GitHub
URL: https://github.com/daandouwe/svd-doc2vec
Owner: daandouwe
Created: 2018-09-12T16:48:05.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2018-09-13T14:51:47.000Z (almost 8 years ago)
Last Synced: 2025-03-20T04:01:42.559Z (over 1 year ago)
Topics: doc2vec, ppmi, svd, wikitext
Language: HTML
Size: 133 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Doc2vec with PPMI-SVD
Factor a document-word cooccurence-matrix that is scaled with positive pointwise mutual information (PPMI) using singular value decomposition (SVD).

## Setup
We use the [WikiText dataset](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset).

To extract documents from WikiText and save as json file, run:
```bash
mkdir data
./parse-wikitext.py wikitext-2-raw/wiki.train.raw data/wikitext-2-raw.docs.json
```

## Usage
In the project terminal, run
```bash
mkdir vec
./main.py --data data/wikitext-2-raw.docs.json --outpath vec/wikitext-2-raw.vec.txt \
--lower --num-words 1000 --dim 10
```
for a quick demo. Plots are saved in the folder `plots`.

To rank the documents based on the vectors, use:
```bash
./rank.py vec/wikitext-2-raw.vec.txt > wikitext-2-raw.ranking.txt
```

## Requirements
```
numpy
scipy
tqdm
matplotlib
sklearn
bokeh
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/daandouwe/svd-doc2vec

Awesome Lists containing this project

README