https://github.com/zphang/usc_dae
Repository for Unsupervised Sentence Compression using Denoising Auto-Encoders
https://github.com/zphang/usc_dae
Last synced: 4 months ago
JSON representation
Repository for Unsupervised Sentence Compression using Denoising Auto-Encoders
- Host: GitHub
- URL: https://github.com/zphang/usc_dae
- Owner: zphang
- Created: 2018-08-24T00:38:42.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2024-07-25T10:12:42.000Z (10 months ago)
- Last Synced: 2025-01-17T11:10:38.514Z (4 months ago)
- Language: Python
- Size: 151 KB
- Stars: 46
- Watchers: 4
- Forks: 15
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Experiments in Unsupervised summarization
This is our [Pytorch](https://github.com/pytorch/pytorch) implementation of the summarization methods described in Unsupervised Sentence Compression using Denoising Autoencoders (CoNLL 2018). It features denoising additive auto-encoders with optional NLI hidden state initialization (based on [Infersent](https://github.com/facebookresearch/InferSent)).

Table of Contents
=================* [Requirements](#requirements)
* [Quickstart](#quickstart)
## Requirements```bash
pip install -r requirements.txt
```## Quickstart
### Step 1: Get the data and create the vocabulary
Gigaword data can be downloaded from : https://github.com/harvardnlp/sent-summary. Then, extract it (```tar -xzf summary.tar.gz ```). Vocabulary can then be created by running ``` python src/datasets/preprocess.py train_data_file output_voc_file```)
### Step 2: Create an environment configuration file
This is used to locate the datasets, embeddings, whether you want to use gpu, etc. on your computer. You can see an example configuration at ```env_configs/env_config.json```. You only need to set nli variables if you use InferSent embeddings.
Then setup the variable `NLU_ENV_CONFIG_PATH` to point to that file (e.g: `export NLU_ENV_CONFIG_PATH="env_configs/env_config.json"`).
### Step 3: Train the model
Simply run:
```bash
python sample_scripts/dae_json.py runs/default/default.json
```### Step 4: Run inference
```bash
python sample_scripts/simple_inference.py model_path test_data_path [output_data_path]
```### Step 5: Evaluate ROUGE scores
To evaluate for rouge, we use [files2rouge](https://github.com/pltrdy/files2rouge), which itself uses
[pythonrouge](https://github.com/tagucci/pythonrouge).Installation instructions:
```bash
pip install git+https://github.com/tagucci/pythonrouge.git
git clone https://github.com/pltrdy/files2rouge.git
cd files2rouge
python setup_rouge.py
python setup.py install
```To run evaluation, simply run:
```bash
files2rouge summaries.txt references.txt
```## FAQ
* **Random seed**: We did not use a random seed nor random restarts for the results in the paper
* **Teacher forcing**: We used teacher forcing in all of our experiments
* **Beam search**: We decoded using greedy decoding only, never using beam search
* **Added noise**: Is done on a sentence-per-sentence basis, not based on the max length in a batch. This is critical for performance