Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/adamouization/pos-tagging-and-unknown-words

:pencil: Part-Of-Speech Tagger with unknown word handling using the Viterbi algorithm.
https://github.com/adamouization/pos-tagging-and-unknown-words

natural-language-processing nlp part-of-speech part-of-speech-tagger part-of-speech-tagging pos tagging viterbi viterbi-algorithm

Last synced: 5 days ago
JSON representation

:pencil: Part-Of-Speech Tagger with unknown word handling using the Viterbi algorithm.

Awesome Lists containing this project

README

        

# POS Tagger with Unknown Words Handling [![HitCount](http://hits.dwyl.com/Adamouization/POS-Tagging-and-Unknown-Words.svg)](http://hits.dwyl.com/Adamouization/POS-Tagging-and-Unknown-Words) [![GitHub license](https://img.shields.io/github/license/Adamouization/POS-Tagging-and-Unknown-Words)](https://github.com/Adamouization/POS-Tagging-and-Unknown-Words/blob/master/LICENSE)

This repository contains code developed for a Part Of Speech (POS) tagger using the Viberbi algorithm to predict POS tags in sentences in the Brown corpus, which is a common Natural Language Processing (NLP) task. It contains the following features:
* HMM word emission frequency smoothing;
* Unknown word handling;
* Extra unknown words rules based on their morphological idiosyncrasies;
* HMM training data saving for quicker program execution.

The evolution of the tagger's accuracy using different methods can be seen below. The report can be visited [here](https://github.com/Adamouization/POS-Tagging-and-Unknown-Words/blob/master/report/report.pdf).

![alt text](https://raw.githubusercontent.com/Adamouization/POS-Tagging-and-Unknown-Words/master/report/accuracy_evolution.png?token=AEI7XLFLK2XMOVTE2HPSKRC7R5M6A)

## Usage

Before running the program, create a new virtual environment to install Python libraries such as NLTK and run the following command:

```
pip install -r requirements.txt
```

To run the POS tagger in Python, move to the `src` directory and run the following command:

```
python main.py [-corpus ] [-r] [-d]
```

where:

* `-corpus`: is the name of corpus to use, which can be either `brown` or `floresta`. This is an optional argument that defaults to `brown` if nothing is specified.

* `-r`: is a flag that forces the program to recompute the HMM’s tag transition and word emission probabilities rather than loading previously computed versions into memory.

* `-d` is a flag that enters debugging mode, printing additional statements on the command line.

## License
* see [LICENSE](https://github.com/Adamouization/POS-Tagging-and-Unknown-Words/blob/master/LICENSE) file.

## Contact
* Email: [email protected]
* Website: www.adam.jaamour.com
* LinkedIn: [linkedin.com/in/adamjaamour](https://www.linkedin.com/in/adamjaamour/)
* Twitter: [@Adamouization](https://twitter.com/Adamouization)