https://github.com/frederickroman/etymology-predictor

This is an etymology prediction RNN model that uses data scraped from Wiktionary.
https://github.com/frederickroman/etymology-predictor

etymology etymology-data flask-api full-stack nlp-machine-learning pytorch rnn-pytorch svelte-ts web-scraping

Last synced: about 2 months ago
JSON representation

This is an etymology prediction RNN model that uses data scraped from Wiktionary.

Host: GitHub
URL: https://github.com/frederickroman/etymology-predictor
Owner: FrederickRoman
License: mit
Created: 2022-06-02T22:38:43.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-07-04T03:54:25.000Z (almost 4 years ago)
Last Synced: 2025-03-11T04:32:29.708Z (over 1 year ago)
Topics: etymology, etymology-data, flask-api, full-stack, nlp-machine-learning, pytorch, rnn-pytorch, svelte-ts, web-scraping
Language: Jupyter Notebook
Homepage:
Size: 1.66 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Etymology prediction

## Live website

See [Etymology prediction live website](https://etymology-classifier.herokuapp.com/).

## Character-sequence based English word etymology prediction

This project aims to predict the etymology of an English word while showcasing the whole process end-to-end.

- Data collection:
+ The training data was scraped from Wiktionary.
- Machine Learning:
+ It uses a character-level many-to-one RNN.
- Server:
+ It is REST API that runs the predicts a word's etymology through the endpoin GET /etymology/{word}
- Client:
+ It is performant lightweight reactive UI that connects with the etymolyg prediction server.

## Tech stack used in this project (all is in this repo)

- Date collection:
+ wiktionaryparser (Python)
- Machine Learning :
+ Pytorch
- Server-side:
+ Flask
- Client-side:
+ Svelte (ts)

## Data collection
The training data was scraped from the etymology section of Wiktionary using wiktionaryparser.

Since this etymology section is presented in plain text, the actual etymology labels for training must be extracted. For simplicity sake, I only consider two possible etymologies: germanic and latin. This is, of course, a big oversimplication of the etymology of English words; but I thought that it could yield useful results nonetheless. I scraped the etymology of the words contained in the CMU dictionary.

The raw data collected is under [/collected_etymology_dict.json](https://github.com/FrederickRoman/etymology-predictor/blob/main/machine_learning/preprocessing/CMU_source_dict.json)

If you want to rerun the data collection process (which may yield different results since wiktionary may have changed), run:

```
pip install -r requirements.txt
python machine_learning/preprocessing/web_scrape.py

```
## Machine Learning
For etymology prediction, I used a many-to-one RNN based on the Pytorch example found in the official website. All of the training, can be found under [/train.ipynb](https://github.com/FrederickRoman/etymology-predictor/blob/main/machine_learning/train.ipynb)

Loss over iterations

Confusion matrix

## Server
The prediction of the etymology of a word is offered through a REST API.

To run the API (with cmd) on http://localhost:5000/etymology/{word}

```
pip install -r requirements.txt
cd server
set FLASK_APP=server
flask run
```
To see the API swagger documentation go to http://localhost:5000/doc

## Client
The prediction of the etymology of a word can also be done through an interactive UI. To run it, start the server then go to http://localhost:5000

### Project's client setup

```
cd client
npm install
```

#### Compiles and hot-reloads

```
npm run dev
```

#### Builds for production

```
npm run build
```
## Acknowledgements
#### The etymology prediction model was adapted from NLP From Scratch: Classifying Names with a Character-Level RNN
https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/frederickroman/etymology-predictor

Awesome Lists containing this project

README