https://github.com/iglee/hmms-and-pcfg
POS tagging by using ngram based hidden markov models.
https://github.com/iglee/hmms-and-pcfg
bigrams hmm hmm-viterbi-algorithm ngrams nlp pos trigrams
Last synced: 3 months ago
JSON representation
POS tagging by using ngram based hidden markov models.
- Host: GitHub
- URL: https://github.com/iglee/hmms-and-pcfg
- Owner: iglee
- Created: 2020-02-07T03:09:38.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-02-17T18:57:55.000Z (over 5 years ago)
- Last Synced: 2025-05-14T18:54:18.773Z (5 months ago)
- Topics: bigrams, hmm, hmm-viterbi-algorithm, ngrams, nlp, pos, trigrams
- Language: Python
- Homepage:
- Size: 35.2 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## HMMs
### Deliverables -
1. Bigram HMMs
2. Trigram HMMs
3. Evaluations### Assumptions Made
1. Tokens with less than 1 count is converted to \. The POS tag for those tokens was kept the same.
2. Smoothing- trying out k-smoothing and linear interpolation
3. emojis, urls, hashtags, mentions, numbers, dates, etc. were converted to a special token.
### How to run this
First, we need to load the corpus and tokenize. This can be done using `TextProcess.py`
```
From TextProcess import ProcessedCorpus
corpus = ProcessedCorpus("CSE517_HW_HMM_Data/twt.bonus.json",\
"CSE517_HW_HMM_Data/twt.dev.json",\
"CSE517_HW_HMM_Data/twt.test.json",1)
```Then the processed corpus is loaded to a language model using `LanguageModel.py`
```
From LanguageModel import LanguageModel
lm = LanguageModel(corpus)
````LanguageModel` class contains unigram, bigram, and trigram probabilities and count dictionaries for HMMs to use. The smoothing options available are `add-k` and linear interpolation. The smoothing parameters are updated and the necessary calculation updates to probabilities can be done as below:
```
lm.update(0.3,(0.00001,0.99999),(0.001,0.001,0.998))
```
The Bigram and Trigram HMMs are implemented in `BigramHMM.py` and `TrigramHMM.py`, and these can run as follows:
```
bhmm = BigramHMM(corpus, lm)
pis, preds = bhmm.test_viterbi(bhmm.dev, bhmm.dev_emission_probabilities)
accuracy, y_true, y_pred, confusion_matrix_array, normalized= bhmm.analyze_results(bhmm.dev, preds)
```
```
thmm = TrigramHMM(corpus, lm)
pis, preds = thmm.test_trigram_viterbi(thmm.dev, thmm.dev_emission_probabilities)
accuracy, y_true, y_pred, confusion_matrix_array, normalized= thmm.analyze_results(thmm.dev, preds)
```The confusion matrix can be plotted from above output, `normalized` and `confusion_matrix_array`.
```
from sklearn.metrics import confusion_matrix
import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as npdf_cm = pd.DataFrame(normalized, index = list(bhmm.tags),
columns = list(bhmm.tags))plt.figure(figsize = (40,40))
ax = sn.heatmap(df_cm, annot=True, cbar=False)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
sn.set(font_scale=1.9)
```