An open API service indexing awesome lists of open source software.

https://github.com/devmount/neural-network-pos-tagger

Train and evaluate neural network language models for POS tagging, tag input sentences according to a trained model.
https://github.com/devmount/neural-network-pos-tagger

embeddings feedforward-neural-network neural-network neural-networks nlp part-of-speech-tagger pos-tagger pos-tagging recurrent-neural-networks word-embeddings

Last synced: 4 months ago
JSON representation

Train and evaluate neural network language models for POS tagging, tag input sentences according to a trained model.

Awesome Lists containing this project

README

          

# Neural Network POS Tagger

> Part-of-Speech Tagging with Neural Networks for a conversational agent

This toolkit was implemented during my master's thesis, that aimed to improve the natural language understanding of an artificial conversational agent. This agent utilized a Hidden Markov Model to calculate Part-of-Speech tags for input words. In order to achieve better results, two different classification architectures are implemented and evaluated: a Feed-forward Neural Network and a Recurrent Neural Network.

This repository contains the toolkit to train and evaluate language models for POS tagging and tag input sentences according to a trained model. It provides the possibility to use the python scripts directly as well as an API and is licensed under GPL-3.0.

## Setup

Check if Python version >= 3.5 is installed:

$ python --version
Python 3.6.3

Install dependencies (consider using a virtual environment):

pip install -r requirements.txt

If the installation was successful, change to the directory of the Tagger and everything should be ready to run properly:

cd fnn-tagger/

## Script Usage

The single python scripts of this toolkit can be called directly, documented in the following.

### Configuration

Static settings are located in the `settings.py` script. It contains the following configuration options:

| option | description |
| ------ | ----------- |
| `ARCHITECTURE` | Neural network architecture that will be used. Possible values: `FNN`, `RNN` |
| `VOCAB_SIZE` | Setup dimension of the vocabulary |
| `N_PAST_WORDS` | Number of preceding words to take into account for the POS tag training of the current word (FNN only) |
| `N_TIMESTEPS` | Number of previous training steps to take into account (RNN only) |
| `EMBEDDING_SIZE` | Dimension of the word embeddings (FNN only) |
| `H_SIZE` | Dimension of the hidden layer |
| `TEST_RATIO` | Ratio of test data extracted from the training data |
| `BATCH_SIZE` | Size of the training batches |
| `N_EPOCHS` | Number of training epochs |
| `CHECKPOINT_EVERY` | Evaluate and save model state after this number of trainings steps |
| `REPLACEMENT_FILE` | Preprocess training data by normalizing terms with the helo of replacements, stored in this file |

Training, evaluation and tagging can be executed using the `tagger.py` script, which represents the core script of this toolkit. Its general usage is:

python tagger.py [-h] [--train TRAIN] [--tag TAG] [--evaluate EVALUATE]
[--reset] [-p PASTWORDS] [-e EMBEDDINGSIZE] [-s HIDDENSIZE]
[-n NEPOCHS] [-t TIMESTEPS] [-f] [-q] [-i]

optional arguments:
-h, --help show this help message and exit
--train TRAIN Invokes training of a language model on given corpus
--tag TAG Tags a given sentence with the pretrained language
model
--evaluate EVALUATE Evaluates pretrained language model with a given
evaluation file
--reset Removes all stored training and log data
-p PASTWORDS, --pastwords PASTWORDS
Number of preceding words to take into account
-e EMBEDDINGSIZE, --embeddingsize EMBEDDINGSIZE
Dimension of the word embeddings
-s HIDDENSIZE, --hiddensize HIDDENSIZE
Dimension of the hidden layer
-n NEPOCHS, --nepochs NEPOCHS
Number of training epochs
-t TIMESTEPS, --timesteps TIMESTEPS
Number of past trained words
-f, --force Force operation without confirmation
-q, --quiet No output messages
-i, --inline Only one line output

However, the following sections explain the usage of the specific flags and their apropriate combination for each action.

### Training

To train the Tagger call the `tagger.py` script with the `--train` flag. According to your static configuration, the batch training will start. Once you reached a sufficient accuracy, you can interrupt the training with CTRL+C or wait till the training process finishes.

$ python tagger.py --train data/test.corpus
Training starts...
Loading training data from "data/test.corpus"...
Generating vocabulary...
Generating tensors...
Initializing model...
Step 100: loss 0.9, accuracy 91% - saved model checkpoint to 'saved/model-100'
Step 200: loss 0.2, accuracy 98% - saved model checkpoint to 'saved/model-200'
Step 300: loss 0.0, accuracy 100% - saved model checkpoint to 'saved/model-300'

You can also call the script with inline configuration. To train a model using the FNN architecture, use the flags `-p`, `-e`, `-s`, and `-n`. It is required to use exactly these 4 flags, otherwise the static configuration from the `settings.py` will be used. An example call would be:

python tagger.py --train data/test.corpus -p 1 -e 250 -s 350 -n 5

To train a model using the RNN architecture, use the flags `-t`, `-e`, `-s`, and `-n`. It is required to use exaclty these 4 flags, otherwise the static configuration from the `settings.py` will be used. An example call would be:

python tagger.py --train data/test.corpus -t 8 -e 100 -s 100 -n 5

### Tagging

To tag a sentence with a pretrained model call the `tagger.py` script with the `--tag` parameter followed by sentence to be tagged. Now a tag is attached to every word.

$ python tagger.py --tag "Show all modules of Bachelor Informatics"
The tagged sentence is:
Show all modules of Bachelor Informatics
R_LIST R_LIST M_MTSModule X C_Program:degree C_Program:name

Make sure that the `settings.py` is configured with the same values that were used to train the model, otherwise the tagger cannot load the pretrained model correctly.

If you don't want to be bothered by the `settings.py`, you can also call the script with inline configuration. To tag a sentence using the FNN architecture, use the flags `-p`, `-e` and `-s`. It is required to use exactly these 3 flags, otherwise the static configuration from the `settings.py` will be used. An example call would be:

python tagger.py --tag "Show all modules of Bachelor Informatics" -p 1 -e 250 -s 350

To tag a sentence using the RNN architecture, use the flags `-t`, `-e` and `-s`. It is required to use exactly these 3 flags, otherwise the static configuration from the `settings.py` will be used. An example call would be:

python tagger.py --tag "Show all modules of Bachelor Informatics" -t 8 -e 100 -s 100

### Evaluation

To evaluate a pretrained model on an external test set call the `tagger.py` script with the `--evaluate` parameter with the path to the file which contains the evaluation data. The evaluation data file must contain one sentence per line, containing space separated word/tag tuples.

$ python tagger.py --evaluate data/evaluation.txt
Loading saved vocabulary...
Generating tensors...
Evaluation starts...

# RESULTS:

20 / 29 0.690 sentences correct
207 / 208 0.995 words recognized
197 / 208 0.947 tags correct
0.966 kappa score

# ERRORS:

count expected computed
------------------------------------------------------------
4 bachelor/C_Program:degree bachelor/C_Program:name
4 master/C_Program:degree master/C_Program:name
1 institute/X_Chair:name institute/C_Program:name
1 quality/C_Chair:name quality/C_Program:name
1 and/C_Chair:name /X

Make sure that the `settings.py` is configured with the same values that were used to train the model, otherwise the evaluation cannot load the pretrained model.

If you don't want to be bothered by the `settings.py`, you can also call the script with inline configuration. To tag a sentence using the FNN architecture, use the flags `-p`, `-e` and `-s`. It is required to use exactly these 3 flags, otherwise the static configuration from the `settings.py` will be used. An example call would be:

python tagger.py --evaluate data/evaluation.txt -p 1 -e 250 -s 350

To tag a sentence using the RNN architecture, use the flags `-t`, `-e` and `-s`. It is required to use exactly these 3 flags, otherwise the static configuration from the `settings.py` will be used. An example call would be:

python tagger.py --evaluate data/evaluation.txt -t 8 -e 100 -s 100

If you don't need the list of errors in the evaluation result, you can also print it in one line, adding the `-i` flag, i.e.:

$ python tagger.py --evaluate data/evaluation.txt -p 1 -e 250 -s 350 -i
data/evaluation_known.txt: 197/208 (94.7%) tags correct, 0.966 kappa score

### Reset

To reset the tagger and delete all previously created files call the `tagger.py` script with the `--reset` flag and confirm with 'Yes' (or Enter):

$ python tagger.py --reset
Really delete all training data and log files? [Yes/no]
Reset was executed. All files successfully deleted.

If you don't want to be bothered by a security question, you can use the `-f` flag to force a direct deletion:

python tagger.py --reset -f

If you even don't want to be bothered by any output messages, you can use the `-q` flag to force a quiet deletion:

python tagger.py --reset -q

## API Usage

This toolkit is designed to be used in other applictions. The API is documented in the following.

### Configuration

Import the tagging script properly according to your directory structure, i.e.:

import tagger as nn

To instantiate the tagger, just call the `Tagger()` class. Without any parameters, the static configuration from the `settings.py` will be used:

t = nn.Tagger()

If you prefer inline configuration, pass the corresponding parameters according to the neural network architecture (the FNN needs `n_past_words`, the RNN needs `n_timesteps`). See these two examples, one for each architecture:

t = nn.Tagger('FNN', n_past_words=1, embedding_size=250, h_size=350, n_epochs=5)
t = nn.Tagger('RNN', n_timesteps=8, embedding_size=100, h_size=100, n_epochs=5)

### Training

To train the initialized tagger, just call the `train()` method with the path to the corpus file, i.e.:

t.train('data/test.corpus')

The trained model will be stored in the `saved/` directory.

### Tagging

A sentence can by tagged with a pretrained model by calling the `tag()` method. You have additional parameters to print the tagging output in tabular form to the console (`pretty_print`) or mute console messages concerning model loading completely (`silent`).

tagged_sentence = t.tag('Show all modules of Bachelor Informatics', format_list=False, pretty_print=True, silent=False)

If you want to process lists instead of strings, you can use `format_list=True` as parameter. Now the input sentence has to be a list of words, an the output will be a list of word tag tuples:

$ sentence = ['Show', 'all', 'modules', 'of', 'Bachelor', 'Informatics']
$ tagged_sentence = t.tag(sentence, format_list=True, pretty_print=False, silent=True)
$ print list(tagged_sentece)
[('Show', 'R_LIST'), ('all', 'R_LIST'), ('modules', 'M_MTSModule'), ('of', 'X'), ('Bachelor', 'C_Program:degree'), ('Informatics', 'C_Program:name')]

### Evaluation

To evaluate a pretrained model, calle the `evaluate()` method. You have an additional parameter to print the main evaluation results in one single line.

t.evaluate('data/evaluation.txt', print_inline=False)

## Resources

### TensorFlow

-
-

### POS Tagger

-
-

---

If you like and use this POS tagger and want to give some love back, feel free to...



Buy me a coffee