https://github.com/adamouization/pos-tagging-and-unknown-words
  
  
    :pencil: Part-Of-Speech Tagger with unknown word handling using the Viterbi algorithm. 
    https://github.com/adamouization/pos-tagging-and-unknown-words
  
natural-language-processing nlp part-of-speech part-of-speech-tagger part-of-speech-tagging pos tagging viterbi viterbi-algorithm
        Last synced: 8 months ago 
        JSON representation
    
:pencil: Part-Of-Speech Tagger with unknown word handling using the Viterbi algorithm.
- Host: GitHub
- URL: https://github.com/adamouization/pos-tagging-and-unknown-words
- Owner: Adamouization
- License: gpl-3.0
- Created: 2020-01-30T09:10:50.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T04:27:17.000Z (almost 3 years ago)
- Last Synced: 2025-01-05T01:43:06.815Z (10 months ago)
- Topics: natural-language-processing, nlp, part-of-speech, part-of-speech-tagger, part-of-speech-tagging, pos, tagging, viterbi, viterbi-algorithm
- Language: Python
- Homepage:
- Size: 1.75 MB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 3
- 
            Metadata Files:
            - Readme: README.md
- License: LICENSE
 
Awesome Lists containing this project
README
          # POS Tagger with Unknown Words Handling [](http://hits.dwyl.com/Adamouization/POS-Tagging-and-Unknown-Words) [](https://github.com/Adamouization/POS-Tagging-and-Unknown-Words/blob/master/LICENSE)
This repository contains code developed for a Part Of Speech (POS) tagger using the Viberbi algorithm to predict POS tags in sentences in the Brown corpus, which is a common Natural Language Processing (NLP) task. It contains the following features:
  * HMM word emission frequency smoothing;
  * Unknown word handling;
  * Extra unknown words rules based on their morphological idiosyncrasies;
  * HMM training data saving for quicker program execution.
The evolution of the tagger's accuracy using different methods can be seen below. The report can be visited [here](https://github.com/Adamouization/POS-Tagging-and-Unknown-Words/blob/master/report/report.pdf).

## Usage
Before running the program, create a new virtual environment to install Python libraries such as NLTK and run the following command:
```
pip install -r requirements.txt
```
To run the POS tagger in Python, move to the `src` directory and run the following command:
```
python main.py [-corpus ] [-r] [-d]
```
where:
* `-corpus`: is the name of corpus to use, which can be either `brown` or `floresta`. This is an optional argument that defaults to `brown` if nothing is specified.
* `-r`: is a flag that forces the program to recompute the HMM’s tag transition and word emission probabilities rather than loading previously computed versions into memory.
* `-d` is a flag that enters debugging mode, printing additional statements on the command line.
## License 
* see [LICENSE](https://github.com/Adamouization/POS-Tagging-and-Unknown-Words/blob/master/LICENSE) file.
## Contact
* Email: adam@jaamour.com
* Website: www.adam.jaamour.com
* LinkedIn: [linkedin.com/in/adamjaamour](https://www.linkedin.com/in/adamjaamour/)
* Twitter: [@Adamouization](https://twitter.com/Adamouization)