https://github.com/czlee/ee378a-project

Fundamental limits in language modeling
https://github.com/czlee/ee378a-project

Last synced: 4 months ago
JSON representation

Fundamental limits in language modeling

Host: GitHub
URL: https://github.com/czlee/ee378a-project
Owner: czlee
Created: 2017-06-14T11:28:19.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2017-06-15T01:01:22.000Z (about 8 years ago)
Last Synced: 2025-03-12T12:15:43.660Z (4 months ago)
Language: Python
Size: 12.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Fundamental limits in language modeling

*Chuan-Zheng Lee*

*EE 378A, Stanford University*

## Requisites
You need to have all of the following installed:
* **Julia**, with packages GZip, ArgParse and DataStructures
* **Python 2.7 and 3** (most scripts use Python 2, but one uses Python 3, sorry)
* **Matlab**

Then, obtain these estimators. Please pay attention to the instructions about symbolic links: scripts assume those links exist, and imports will fail if they are not set up correctly.

* The **Jiantao–Venkat–Han–Weissman (JVHW) estimator**, in Python, available by [cloning this repository](https://github.com/EEthinker/JVHW_Entropy_Estimators/).
* Symbolic links pointing at `est_entro.py` and `poly_coeff_entro.mat` in the JVHW repository, should be made in the **python** directory, with the same names.
* The **profile maximum likelihood (PML) estimator**, in Python **and** Matlab, available by [cloning this repository](https://github.com/dmitrip/PML/).
* A symbolic link pointing at `pml.py` in the PML repository, should be made in the **python** directory, with the same name.
* For the Matlab version, the relevant scripts should be placed somewhere on the Matlab search path.
* The **Valiant and Valiant (VV) estimator**, in Matlab, available at http://theory.stanford.edu/~valiant/code.html.
* Download the (standard, not-for-large-scale) code under "Estimating the Unseen". Its unzipped contents should be placed somewhere on the Matlab search path.

Of course, you can also copy the files rather than making symbolic links. The important thing is that it looks like the scripts `est_entro.py` and `pml.py`, and the data file `poly_coeff_entro.mat`, look like they are in the **python** directory.

## Obtaining the datasets

To run any of the scripts, you'll need to have the relevant datasets handy.

* **Penn Treebank.** If you're running this on Stanford's AFS, the scripts know where to find this data. If you're not, then you'll need to specify the directory for all scripts below that examine the PTB.
* **Web 1T 5-gram.**
* For unigrams through trigrams, if you're running this on Stanford's AFS, the scripts know where to find this data. If you're not, then you'll need to specify the directory for some of the scripts below.
* For quadrigrams and quintigrams, you'll need to specify the directory, where relevant. (The default option assumes they're in `/home/czlee/gms/`; they presumably won't be on your computer.)
* **One Billion Words.** Follow the instructions to download and preprocess this data at https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark. The scripts can operate on either the raw or processed (shuffled and tokenized) data, but I'd (strongly) recommend running on the processed data, because the processing also removes duplicates.

## Running the scripts

In every snippet, `` should be replaced by the length of the _n_-gram in question, and `grams` should be replaced with unigrams, bigrams, trigrams, _etc._ For each _n_, you need to run the script separately.

### Penn Treebank, naïve approach
``` bash
cd julia
julia ptb_frequencies.jl # this generates files that the naïve estimator then uses
julia naive.jl ptb # this does not write to any file itself
# pipe the last line using `>` if you want to save results
```
If you're not running this on Stanford's AFS, you'll need to specify the location of the data files for `ptb_frequencies.jl` using the `--source-dir` option.

### Penn Treebank, JVHW estimator
Make sure the JVHW symlinks are set up (see above).
``` bash
cd python
mkdirs ptb/jvhw
python penntreebank.py -J ptb/jvhw/grams.txt
```
You can use any output file name you want, but the `results_to_csv.py` script (#ptb-pml-and-ptb-vv)[below] assumes that the files will be called `unigrams.txt`, …, `septigrams.txt`. You can use any output directory you want, `results_to_csv.py` takes the directory name as its second argument. Saving the results to a known file is only important if you want to generate the evolution plots in the additional figure.

### Penn Treebank, PML and VV estimators
These involve two steps: First, preprocessing the data using a Python script; then, using a Matlab script to generate the estimates. (At time of writing, the estimators were only available in Matlab.)

*Step 1 (preprocess the data):*
``` bash
cd python
mkdir -p ptb/indices
python penntreebank.py -Q ptb/indices/grams.csv
```

*Step 2 (generate the estimates):*

Start Matlab in the `matlab` directory and run `ptb_pml.m` for PML or `ptb_vv.m` for VV. After each script has completed, final estimates will be in `estimates` and the progression of the estimate with increasing sample length will be in the cell array `progression`, and the results will be saved to `pml.mat` or `vv.mat` respectively.

The file name `ptb/indices/grams.csv` must be exactly as is: The Matlab scripts assume the file will be in that location.

### Web 1T 5-gram, naïve approach
``` bash
cd julia
julia naive.jl 1t5 # this does not write to any file itself
# pipe using `>` if you want to save results
```

For unigrams through trigrams, if you're not running this on Stanford's AFS, you'll need to specify the location of the data files using the `--directory` option. For quadrigrams and quintigrams, you'll always need to specify the location of the data files using the `--directory` option.

### Web 1T 5-gram, JVHW estimator
``` bash
cd python
python web1t5gram_fingerprint.py grams-fingerprint.tsv
python web1t5gram_entropy.py grams-fingerprint.tsv
```

You can use any file name you want, so long as the file name you pass to the first script (the output file) is the same as the name you pass to the second script (the input file).

For unigrams through trigrams, if you're not running this on Stanford's AFS, you'll need to specify the location of the data files for (only) `web1t5gram_fingerprint.py` using the `--directory` option. For quadrigrams and quintigrams, you'll always need to specify the location of the data files using the `--directory` option.

### One Billion Words, JVHW and PML estimators
The `-t` option gets it to examine the pretokenized data, not the raw data. The raw data is way too slow and contains lots of duplicates.
``` bash
cd python
mkdirs 1bw/entropy
python onebillionwords.py 5 -tJP 1bw/entropy/grams-pretokenized.txt -d
```

`` should be replaced with the directory where you cloned the [1BW Git repository](https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark).

You can use any output file name you want, but the `results_to_csv.py` script [below](#ptb-pml-and-ptb-vv) assumes that the files will be called `unigrams-pretokenized.txt`, …, `sexigrams-pretokenized.txt`. You can use any output directory you want, `results_to_csv.py` takes the directory name as its second argument. Saving the results to a known file is only important if you want to generate the evolution plots in the additional figure.

## Generating evolution plots
*Note: The entropy rate and entropy rate estimate plots in the report aren't done on Matlab, they're just plotted directly in LaTeX using data manually transferred from the results of running the above. These instructions are for the plots in the* Additional Figures *of the report.*

### 1BW JVHW, 1BW PML and PTB JVHW

First, generate the CSV files that Matlab will use to generate these plots (this is the script that uses Python 3):
``` bash
cd python
python3 result_to_csv.py ptb ptb/jvhw jvhw
python3 result_to_csv.py 1bw 1bw/entropy jvhw
python3 result_to_csv.py 1bw 1bw/entropy pml
```

Then, change the first two lines of `entropy_plots.m` to the options you want, and run the script in Matlab.

### PTB PML and PTB VV

The files `ptb_pml.m` and `ptb_vv.m` must have been run before generating plots. (They save the results to `pml.mat` and `vv.mat` respectively.)

Change the first line to `ptb_plots.m` to the option you want, and run the script in Matlab.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/czlee/ee378a-project

Awesome Lists containing this project

README