Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/adamouization/pos-tagging-and-unknown-words
:pencil: Part-Of-Speech Tagger with unknown word handling using the Viterbi algorithm.
https://github.com/adamouization/pos-tagging-and-unknown-words
natural-language-processing nlp part-of-speech part-of-speech-tagger part-of-speech-tagging pos tagging viterbi viterbi-algorithm
Last synced: 5 days ago
JSON representation
:pencil: Part-Of-Speech Tagger with unknown word handling using the Viterbi algorithm.
- Host: GitHub
- URL: https://github.com/adamouization/pos-tagging-and-unknown-words
- Owner: Adamouization
- License: gpl-3.0
- Created: 2020-01-30T09:10:50.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T04:27:17.000Z (almost 2 years ago)
- Last Synced: 2023-08-14T18:25:48.899Z (over 1 year ago)
- Topics: natural-language-processing, nlp, part-of-speech, part-of-speech-tagger, part-of-speech-tagging, pos, tagging, viterbi, viterbi-algorithm
- Language: Python
- Homepage:
- Size: 1.75 MB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# POS Tagger with Unknown Words Handling [![HitCount](http://hits.dwyl.com/Adamouization/POS-Tagging-and-Unknown-Words.svg)](http://hits.dwyl.com/Adamouization/POS-Tagging-and-Unknown-Words) [![GitHub license](https://img.shields.io/github/license/Adamouization/POS-Tagging-and-Unknown-Words)](https://github.com/Adamouization/POS-Tagging-and-Unknown-Words/blob/master/LICENSE)
This repository contains code developed for a Part Of Speech (POS) tagger using the Viberbi algorithm to predict POS tags in sentences in the Brown corpus, which is a common Natural Language Processing (NLP) task. It contains the following features:
* HMM word emission frequency smoothing;
* Unknown word handling;
* Extra unknown words rules based on their morphological idiosyncrasies;
* HMM training data saving for quicker program execution.The evolution of the tagger's accuracy using different methods can be seen below. The report can be visited [here](https://github.com/Adamouization/POS-Tagging-and-Unknown-Words/blob/master/report/report.pdf).
![alt text](https://raw.githubusercontent.com/Adamouization/POS-Tagging-and-Unknown-Words/master/report/accuracy_evolution.png?token=AEI7XLFLK2XMOVTE2HPSKRC7R5M6A)
## Usage
Before running the program, create a new virtual environment to install Python libraries such as NLTK and run the following command:
```
pip install -r requirements.txt
```To run the POS tagger in Python, move to the `src` directory and run the following command:
```
python main.py [-corpus ] [-r] [-d]
```where:
* `-corpus`: is the name of corpus to use, which can be either `brown` or `floresta`. This is an optional argument that defaults to `brown` if nothing is specified.
* `-r`: is a flag that forces the program to recompute the HMM’s tag transition and word emission probabilities rather than loading previously computed versions into memory.
* `-d` is a flag that enters debugging mode, printing additional statements on the command line.
## License
* see [LICENSE](https://github.com/Adamouization/POS-Tagging-and-Unknown-Words/blob/master/LICENSE) file.## Contact
* Email: [email protected]
* Website: www.adam.jaamour.com
* LinkedIn: [linkedin.com/in/adamjaamour](https://www.linkedin.com/in/adamjaamour/)
* Twitter: [@Adamouization](https://twitter.com/Adamouization)