https://github.com/cltk/enm_models_cltk
Models for Middle English provided by CLTK
https://github.com/cltk/enm_models_cltk
cltk enm middle-english nlp
Last synced: 8 days ago
JSON representation
Models for Middle English provided by CLTK
- Host: GitHub
- URL: https://github.com/cltk/enm_models_cltk
- Owner: cltk
- Created: 2022-09-26T20:06:49.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2023-10-28T23:38:23.000Z (over 1 year ago)
- Last Synced: 2025-02-17T08:41:42.453Z (3 months ago)
- Topics: cltk, enm, middle-english, nlp
- Homepage:
- Size: 4.93 MB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CLTK models for Middle English
## How the word2vec model was trained
We first need to install some mathematics and machine learning packages.
```bash
pip install scipy sklearn gensim
```We import a corpus of Middle English texts
```python
import pickle
with open('./middle_english_txt-.p','rb') as p:
corpus = pickle.load(p)
```The normalization process involves removal of punctuation, special characters and numbers.
```python
CHARACTERS_TO_REMOVE = "..."def remove_punc(text):
for ele in text:
if ele in CHARACTERS_TO_REMOVE:
return text.lower().replace(ele, "")
else:
return text.lower()
new_data = []for texts in corpus:
part_text = []
for word in texts:
part_text.append(remove_punc(word))
new_data.append(part_text)
```The data was in a form of a list of lists of strings or a list of sentences, where a sentence is a list of words.
Then we use `Word2Vec` class from **gensim**
```python
# time to try gensim to create word2vecs
# see NLP in action, 6.2.4
from gensim.models.word2vec import Word2Vecnum_features = 50
min_word_count = 30
num_workers = 2
window_size = 20
subsampling = 1e-3model = Word2Vec(
new_data,
workers=num_workers,
vector_size=num_features,
min_count=min_word_count,
window=window_size,
sample=subsampling)model.init_sims(replace=True)
me_w2v_model = "me_word_embeddings_model.bin"
model.save(me_w2v_model)```
The model is now saved in the file `me_word_embeddings_model.bin`.