Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/pvougiou/Neural-Wikipedian

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/pvougiou/Neural-Wikipedian
Owner: pvougiou
License: apache-2.0
Created: 2017-11-29T19:17:32.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2018-08-26T21:55:00.000Z (almost 6 years ago)
Last Synced: 2024-01-22T18:36:59.814Z (5 months ago)
Language: C++
Size: 500 KB
Stars: 10
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Lists

awesome-nlg - Neural-Wikipedian - The repository contains the code along with the required corpora that were used in order to build a system that "learns" how to generate English biographies for Semantic Web triples. (Datasets)

README

# Neural-Wikipedian
This repository contains the code along with the datasets of the work that has been submitted as a research paper to the Journal of Web Semantics. The work focuses on how an adaptation of the encoder-decoder framework can be used to generate textual summaries for Semantic Web triples.

For a detailed description of the work presented in this repository, please refer to the preprint version of the submitted paper at: .

## Datasets
In order to train our proposed models, we built two datasets of aligned knowledge base triples with text.

* D1: DBpedia triples aligned with Wikipedia biographies
* D2: Wikidata triples aligned with Wikipedia biographies

In a Unix shell environment execute: `sh download_datasets.sh` in order to download and uncompress both of them in their corresponding folders (i.e. `D1` and `D2`). Each dataset folder consists of three different sub-folders:

* `data` contains each aligned dataset in binary-encoded `pickle` files. Each file is a hash table. Each hash table is a Python dictionary of lists.
* `utils` contains each dataset's supporting files, such as hash tables of the frequency with which surface forms in the Wikipedia summaries have been mapped to entity URIs. All the files are binary-encoded in `pickle` files.
* `processed` contains the processed version of each aligned dataset after removal of potential outliers (e.g. instances of the datasets with extremely long Wikipedia summaries or very few triples). The files that are contained in the `processed` folders are the ones that are used for the training and testing of both our neural-network-based systems and the baselines.

[`Inspect-Dataset.ipynb`](Inspect-Dataset.ipynb) is a Python script on iPython Notebook that allows easier inspection of the above aligned datasets. The scripts provides also detailed information regarding the structure of the intermediate parts in `D1/data/` and `D2/data/` and the functionality of the supporting files in `D1/utils/` and `D2/utils/`.

The table below presents the distribution of the 10 most common predicates, and entities in our two datasets, D1 and D2 respectively.

Predicates In Triples
%
Entities In Triples
%
Entities In Summaries
%

dbo:birthDate
12.43
dbr:United_States
0.49
dbr:United_States
2.82

dbo:birthPlace
10.67
dbr:England
0.19
dbr:Actor
2.14

dbo:careerStation
5.47
dbr:United_Kingdom
0.14
dbr:Association_football
1.02

dbo:deathDate
5.11
dbr:France
0.14
dbr:Politician
0.97

dbo:occupation
5.06
dbr:Canada
0.12
dbr:Singing
0.90

dbo:team
4.18
dbr:India
0.11
dbr:United_Kingdom
0.59

dbo:deathPlace
3.51
dbr:Actor
0.10
dbr:England
0.58

dbo:genre
3.22
dbr:Italy
0.10
dbr:Writer
0.53

dbo:associatedBand
2.85
dbr:London
0.10
dbr:Canada
0.50

dbp:associatedMusicalArtist
2.85
dbr:Japan
0.09
dbr:France
0.49

Predicates In Triples
%
Entities In Triples
%
Entities In Summaries
%

wikidata:P569
(place of birth)
14.15
wikidata:Q5
(human)
3.96
wikidata:Q30
(United States of America)
3.20

wikidata:P106
(occupation)
11.63
wikidata:Q6581097
(male)
3.27
wikidata:Q33999
(actor)
1.56

wikidata:P31
(instance of)
8.29
wikidata:Q30
(United States of America)
1.13
wikidata:Q82955
(politician)
1.02

wikidata:P21
(sex or gender)
7.92
wikidata:Q6581072
(female)
0.70
wikidata:Q21
(England)
0.87

wikidata:P570
(date of death)
7.58
wikidata:Q145
(United Kingdom)
0.44
wikidata:Q145
(United Kingdom)
0.85

wikidata:P27
(country of citizenship)
6.75
wikidata:Q82955
(politician)
0.42
wikidata:Q27939
(singing)
0.79

wikidata:P735
(given name)
6.53
wikidata:Q1860
(English)
0.39
wikidata:Q36180
(writer)
0.71

wikidata:P19
(place of birth)
5.20
wikidata:Q33999
(actor)
0.36
wikidata:Q2736
(association football)
0.68

wikidata:P5
(member of sports team)
2.64
wikidata:Q36180
(writer)
0.24
wikidata:Q183
(Germany)
0.61

wikidata:P69
(educated at)
2.58
wikidata:Q177220
(singer)
0.20
wikidata:Q16
(Canada)
0.58

## Our Systems
The `Systems` directory contains all the code to both train and generate summaries for the sets of triples that are located in the validation and test sets of our datasets. It contains our two models in two separate sub-folders (i.e. `Triples2GRU` and `Triples2LSTM`). The neural network models are implemented using the [Torch](http://torch.ch/) package. We conducted our experiments on a single Titan X (Pascal) GPU. Please make sure that Torch along with the [torch-hdf5](https://github.com/deepmind/torch-hdf5) package and the NVIDIA CUDA drivers are installed in your machine before executing any of the `.lua` files in these directories.

* You can train your own Triples2LSTM or Triples2GRU models, by executing `th train.lua` inside each system's directory. You need to have access to a GPU with at least 11 GB of memory in order to train the models with the same hyperparameters that we used in the paper. However, by lowering the `params.batch_size` and `params.rnn_size` variables you can train on NVIDIA GPUs will less amount of dedicated memory. By altering the `dataset_path` and `checkpoint_path` variables in each `train.lua` file, you can select the dataset (i.e. D1 or D2) on which you will be training your model, and whether you will using the surface form tuples or URIs setup. The checkpoint files of the trained models will be saved in the corresponding `checkpoints` directory.

* You can use a checkpoint of a trained model to start generating summaries given input sets of triples from the validation and test sets of the aligned datasets by executing `th beam-sample.lua`. Please make sure that the pre-trained model (i.e. on D1 or D2, with URIs or surface form tuples) matches the dataset that will be loaded in the `beam_sampling_params.dataset_path` variable. You can download all our trained models and generate summaries from them by running the shell scripts located at:
* `Systems/Triples2LSTM/download_trained_models.sh`
* `Systems/Triples2GRU/download_trained_models.sh`

The generated summaries will be saved as HDF5 files in the directory of the pre-trained model. Our trained models use CUDA Tensors. Consequently, the NVIDIA CUDA drivers along with the `cutorch` and `cunn` Lua packages should be installed in your machine. The latter can be installed by running:
```sh
luarocks install cutorch
luarocks install cunn
```

* Execute the Python script `beam-sample.py` in order to create a `.csv` file with the sampled summaries. The following Python packages: (i) `h5py`, (ii) `pandas`, and (iii) `numpy` should be installed in your machine. The script replaces the `` tokens along with the property-type placeholders, and presents the generated summaries along with the input sets of triples and the actual Wikipedia summaries in the resultant `.csv` file. The `.csv` file will by default be saved in the location of the pre-trained model.

For all possible alteration in the parameters of the above files, please consult their corresponding comment sections.

## KenLM
The `KenLM` directory contains all the required code in order to train an n-gram Kneser-Ney language model. The code is based on the [KenLM Language Model Toolkit](https://kheafield.com/code/kenlm/). The binary files that reside in the `./kenlm/build/` directory have been compiled using [Boost](http://www.boost.org/) on a machine running Ubuntu 16.04 (x86_64 Linux 4.4.0-98-generic). In case you wish to experiment with this baseline on a different OS, you need to download and compile the original package according to the instructions at [https://kheafield.com/code/kenlm/](https://kheafield.com/code/kenlm/).

The following Python packages should also be installed in your machine: (i) `numpy`, (ii) `pandas`, and (iii) `kenlm`. The latter can be installed by running: `pip install https://github.com/kpu/kenlm/archive/master.zip` (i.e. [https://github.com/kpu/kenlm](https://github.com/kpu/kenlm)).

* In a Unix shell environment, run: `sh train.sh` in order to train a 5-gram Kneser-Ney language model. The trained model will be saved in the `./KenLM/` directory with the `.klm` extension (e.g. `D1.surf_form_tuples.model.klm` or `D2.surf_form_tuples.model.klm`).
* Execute the Python script `sample.py` in order to sample the most probable summary templates. The summaries are sampled using beam-search. The most probable templates will be saved in a `pickle` file (e.g. `D1.surf_form_tuples.templates.p` or `D2.surf_form_tuples.templates.p`) in the `./KenLM/templates/` directory.
* Run the Python script `process-templates.py` in order to post-process the templates according to each input set of triples from the test or validation set of the selected dataset. The script replaces the `` tokens along with any potential property-type placeholders according to the triples of the input set. The generated `.csv` file with all the generated summaries along with their input sets of triples is saved in the `./KenLM/templates/` directory.

In the default scenario, the model trains on D1 and samples summaries for the sets of triples that have been allocated to the test set. In case you wish to run the files (i.e. `train.sh`, `sample.py` and `process-templates.py`) in a different setup, you can alter them following the guidelines in each file's comment sections.

## License
This project is licensed under the terms of the Apache 2.0 License.