https://github.com/generall/entitycategoryprediction

Model for predicting categories of entities by its mentions
https://github.com/generall/entitycategoryprediction

allennlp classification mentions nlp

Last synced: 6 months ago
JSON representation

Model for predicting categories of entities by its mentions

Host: GitHub
URL: https://github.com/generall/entitycategoryprediction
Owner: generall
Created: 2019-04-11T20:53:23.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2021-06-23T17:01:05.000Z (over 4 years ago)
Last Synced: 2025-04-06T13:47:38.808Z (6 months ago)
Topics: allennlp, classification, mentions, nlp
Language: Jupyter Notebook
Homepage: https://mention.vasnetsov.com/#/
Size: 1.94 MB
Stars: 29
Watchers: 2
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

          # Category prediction model

This repo contains AllenNLP model for prediction of Named Entity categories by its mentions.

# Data

## Fake data

You can generate some fake data using this [Notebook](notebooks/gen_face_data.ipynb)

## Real data (Work in progress)

Filtered [OneShotWikilinks](https://www.kaggle.com/generall/oneshotwikilinks) dataset with manually selected categories.

### Data preparation steps

* Crete category graph [build_category_graph.ipynb](./notebooks/build_category_graph.ipynb)

    * Produces: `category_graph.pkl`

* Obtain the list of Person articles from Ontology [obtain_people_articles.ipynb](/notebooks/obtain_people_articles.ipynb):

    * Requires: `dbpedia_2016-10.owl`

    * Produces: `people_categories.json`

* Build mapping from article to people categories [generate_full_people_categories.ipynb](./notebooks/generate_full_people_categories.ipynb). Requires

    * `people_categories.json`

    * `category_graph.pkl`

    * `projects/categories_prediction/manual_categories.gsheet`

* Filter mentions for people [filter_mentions.ipynb](./notebooks/filter_mentions.ipynb). 

    * Requires: `people_all_categories.json`

    * Produces: `people_mentions.tsv`

Prepare splitted data with:

```bash

!split -n l/10 --verbose ../data/fake_data_train.tsv ../data/fake_data_train.tsv_

```

# Install

```bash

pip install -r requirements.txt

```

# Run

## Train

```bash

rm -rf ./data/vocabulary ; allennlp make-vocab -s ./data/ allen_conf_vocab.json --include-package category_prediction

allennlp train -f -s data/stats allen_conf.json --include-package category_prediction

```

```bash

allennlp train -f -s data/stats allen_conf.json --include-package category_prediction -o '{"trainer": {"cuda_device": 0}}'

```

### Continue training with different params

```bash

rm -rf data/stats2/  # Clear new serialization dir

allennlp fine-tune -s data/stats2/ -c allen_conf.json -m ./data/stats/model.tar.gz --include-package category_prediction -o '{"trainer": {"cuda_device": 0}, "iterator": {"base_iterator": {"batch_size": 64}}}'

```

## Validate

```bash

allennlp evaluate ./data/stats/model.tar.gz ./data/fake_data_test.tsv --include-package category_prediction

```

## Server

### Debug

```bash

MODEL=./data/trained_models/6th_augmented/model.tar.gz python run_server.py

```

### Prod

```bash

gunicorn -c gunicorn_config.py wsgi:application

```

### Docker

Build

```bash

cd docker

docker build --tag mention .

```

Run with passing pyenv into container

```bash

docker run --rm --restart unless-stopped -v $HOME:$HOME -p 8000:8000 \

        -v $HOME/.pyenv:/root/.pyenv \ 

        -e ENV_PATH=$HOME/virtualenv/path \

        -e APP_PATH=$HOME/project/root/path mention

```

# GCE related notes

Fix 100% GPU utilization

```bash

sudo nvidia-smi -pm 1

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/generall/entitycategoryprediction

Awesome Lists containing this project

README