https://github.com/enhuiz/humicroedit
https://github.com/enhuiz/humicroedit
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/enhuiz/humicroedit
- Owner: enhuiz
- Created: 2019-11-05T10:07:06.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-11-29T11:47:28.000Z (over 5 years ago)
- Last Synced: 2025-01-07T18:28:30.307Z (5 months ago)
- Language: Python
- Size: 13.9 MB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Humicroedit
## Directory
The directory is out of date, it will be updated later.
```plain
.
├── data <- dataset, readonly.
│ ├── task-1
│ │ ├── dev.csv
│ │ └── train.csv
│ └── task-2
│ ├── dev.csv
│ └── train.csv
├── humicroedit <- major code
│ ├── __init__.py
│ ├── datasets
│ │ ├── __init__.py
│ │ ├── humicroedit.py <- pytorch loader for the dataset, texts get preprocessed here.
│ │ └── vocab.py <- vocab, which records the index of each word, indices are used for training model instead
│ ├── networks
│ │ ├── __init__.py <- all models are designed here
│ │ ├── encoders
│ │ │ ├── __init__.py
│ │ │ └── lstm.py <- a residue LSTM encoder
│ │ ├── layers.py
│ │ └── losses
│ │ ├── __init__.py
│ │ └── mean_squared_error.py <- mse loss for regression
│ └── utils.py
├── official <- official baseline repo
├── README.md
└── scripts <- helper scripts
├── data
│ └── download.sh
├── test.py <- predict, result will be written to results/
└── train.py <- all training calls this script
```## Setup
### Clone the project
```
git clone --recursive https://github.com/enhuiz/humicroedit
```### Install dependencies & download dataset
```
./scripts/setup/humicroedit.sh
./scripts/setup/comet.sh
```Some datasets are provided in the repo so there is no need to download.
### Download pretrained models for COMET
Please manually download the pretrained model from [here](https://drive.google.com/open?id=1FccEsYPUHnjzmX-Y5vjCBeyRt1pLo8FB) and untar it into `comet/pretrained_model`.
## Preprocess
Apply basic preprocess on the sentence:
```
./data/preprocess.py
```Fetch COMET object given the subject over all relations (this step is not necessary):
```
./data/comet.py
```## Train & test
The model name is in the format of `--`. Currently, the possible choices are list as below:
```
: baseline, bert
: lstm, transformer
: sce, mse
```Where sce stands for soft cross entropy loss and mse for mean squared error loss.
- Train:
```
./scripts/train.py --name bert-transformer-sce
```- Test:
```
./scripts/test.py --name bert-transformer-sce
```## History
- 2019-11-26
1. Add soft cross entropy loss.
2. Add lemmatization using [spacy](https://spacy.io/).
3. Use COMET model trained on ATOMIC to relate our data to the corresponding object in the knowledge graph, see `data/humicroedit/task-1/*.kg.csv`.
4. Add BERT pretraining (only the mask) part.- 2019-11-27
1. Use BERT jointly training instead of BERT pretraining (beat the baseline for the first time by :) 0.001).## Planing
- 2019-11-27
1. Try to incorporate the knowledge graph into the training.
2. Run experiments to see whether there is improvement.
3. Maybe start writing.