{"id":20703496,"url":"https://github.com/frederickroman/etymology-predictor","last_synced_at":"2026-04-20T12:33:12.208Z","repository":{"id":112471940,"uuid":"499310851","full_name":"FrederickRoman/etymology-predictor","owner":"FrederickRoman","description":"This is an etymology prediction RNN model that uses data scraped from Wiktionary.  ","archived":false,"fork":false,"pushed_at":"2022-07-04T03:54:25.000Z","size":1745,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-11T04:32:29.708Z","etag":null,"topics":["etymology","etymology-data","flask-api","full-stack","nlp-machine-learning","pytorch","rnn-pytorch","svelte-ts","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FrederickRoman.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-02T22:38:43.000Z","updated_at":"2022-10-20T04:53:32.000Z","dependencies_parsed_at":"2023-05-15T06:45:33.799Z","dependency_job_id":null,"html_url":"https://github.com/FrederickRoman/etymology-predictor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/FrederickRoman/etymology-predictor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FrederickRoman%2Fetymology-predictor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FrederickRoman%2Fetymology-predictor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FrederickRoman%2Fetymology-predictor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FrederickRoman%2Fetymology-predictor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FrederickRoman","download_url":"https://codeload.github.com/FrederickRoman/etymology-predictor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FrederickRoman%2Fetymology-predictor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32047203,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T11:35:06.609Z","status":"ssl_error","status_checked_at":"2026-04-20T11:34:48.899Z","response_time":94,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["etymology","etymology-data","flask-api","full-stack","nlp-machine-learning","pytorch","rnn-pytorch","svelte-ts","web-scraping"],"created_at":"2024-11-17T01:08:15.744Z","updated_at":"2026-04-20T12:33:12.189Z","avatar_url":"https://github.com/FrederickRoman.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Etymology prediction\n\n## Live website\n\nSee [Etymology prediction live website](https://etymology-classifier.herokuapp.com/).\n\n## Character-sequence based English word etymology prediction\n\nThis project aims to predict the etymology of an English word while showcasing the whole process end-to-end.\n\n- Data collection:\n  + The training data was scraped from Wiktionary. \n- Machine Learning:\n  + It uses a character-level many-to-one RNN.\n- Server:\n  + It is REST API that runs the predicts a word's etymology through the endpoin GET /etymology/{word}\n- Client:\n  + It is performant lightweight reactive UI that connects with the etymolyg prediction server.\n  \n ## Tech stack used in this project (all is in this repo)\n\n- Date collection:\n    + wiktionaryparser (Python)\n- Machine Learning :\n    + Pytorch\n- Server-side:\n    + Flask\n- Client-side:\n    + Svelte (ts)\n \n## Data collection\nThe training data was scraped from the etymology section of Wiktionary using wiktionaryparser. \n\nSince this etymology section is presented in plain text, the actual etymology labels for training must be extracted. For simplicity sake, I only consider two possible etymologies: germanic and latin. This is, of course, a big oversimplication of the etymology of English words; but I thought that it could yield useful results nonetheless. I scraped the etymology of the words contained in the CMU dictionary.\n\nThe raw data collected is under [/collected_etymology_dict.json](https://github.com/FrederickRoman/etymology-predictor/blob/main/machine_learning/preprocessing/CMU_source_dict.json)\n\nIf you want to rerun the data collection process (which may yield different results since wiktionary may have changed), run:\n\n```\npip install -r requirements.txt\npython machine_learning/preprocessing/web_scrape.py\n\n```\n## Machine Learning\nFor etymology prediction, I used a many-to-one RNN based on the Pytorch example found in the official website. All of the training, can be found under [/train.ipynb](https://github.com/FrederickRoman/etymology-predictor/blob/main/machine_learning/train.ipynb)\n\nLoss over iterations\n\n\u003cdiv style=\"display:flex; justify-content:center; align-items:center;\"\u003e\n    \u003cimg src=\"https://github.com/FrederickRoman/etymology-predictor/blob/main/docs/ml/loss_over_iterations.png\" height=\"300\" alt=\"Loss over iterations\"/\u003e\n\u003c/div\u003e\n\nConfusion matrix\n\n\u003cdiv style=\"display:flex; justify-content:center; align-items:center;\"\u003e\n    \u003cimg src=\"https://github.com/FrederickRoman/etymology-predictor/blob/main/docs/ml/confusion_matrix.png\" height=\"300\" alt=\"Loss over iterations\"/\u003e\n\u003c/div\u003e\n\n## Server\nThe prediction of the etymology of a word is offered through a REST API.\n\nTo run the API (with cmd) on http://localhost:5000/etymology/{word}\n\n```\npip install -r requirements.txt\ncd server\nset FLASK_APP=server\nflask run\n```\nTo see the API swagger documentation go to http://localhost:5000/doc\n\n\u003cdiv style=\"display:flex; justify-content:center; align-items:center;\"\u003e\n    \u003cimg src=\"https://github.com/FrederickRoman/etymology-predictor/blob/main/docs/server/api_swagger.png\" height=\"900\" alt=\"Loss over iterations\"/\u003e\n\u003c/div\u003e\n\n## Client\nThe prediction of the etymology of a word can also be done through an interactive UI. To run it, start the server then go to http://localhost:5000\n\n\u003cdiv style=\"display:flex; justify-content:center; align-items:center;\"\u003e\n    \u003cimg src=\"https://github.com/FrederickRoman/etymology-predictor/blob/main/docs/client/client_UI.png\" height=\"600\" alt=\"Loss over iterations\"/\u003e\n\u003c/div\u003e\n\n### Project's client setup\n\n```\ncd client\nnpm install\n```\n\n#### Compiles and hot-reloads\n\n```\nnpm run dev\n```\n\n#### Builds for production\n\n```\nnpm run build\n```\n## Acknowledgements\n#### The etymology prediction model was adapted from NLP From Scratch: Classifying Names with a Character-Level RNN \nhttps://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html \n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffrederickroman%2Fetymology-predictor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffrederickroman%2Fetymology-predictor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffrederickroman%2Fetymology-predictor/lists"}