{"id":42669037,"url":"https://github.com/mdm-code/manx","last_synced_at":"2026-01-29T10:22:35.081Z","repository":{"id":177872475,"uuid":"470648513","full_name":"mdm-code/manx","owner":"mdm-code","description":"Fine-tune LLM for early Middle English lemmatization with data from LAEME.","archived":false,"fork":false,"pushed_at":"2024-01-25T20:58:14.000Z","size":161,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-01-25T21:43:46.206Z","etag":null,"topics":["deep-learning","lemmatization","lemmatizer","low-resource-languages","low-resource-machine-learning","low-resource-nlp","middle-english","neural-network","nlp","nlp-machine-learning","parsing"],"latest_commit_sha":null,"homepage":"https://github.com/mdm-code/manx","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mdm-code.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-03-16T15:45:33.000Z","updated_at":"2023-09-26T12:00:01.000Z","dependencies_parsed_at":"2024-01-25T21:51:07.674Z","dependency_job_id":null,"html_url":"https://github.com/mdm-code/manx","commit_stats":null,"previous_names":["mdm-code/manx"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/mdm-code/manx","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdm-code%2Fmanx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdm-code%2Fmanx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdm-code%2Fmanx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdm-code%2Fmanx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mdm-code","download_url":"https://codeload.github.com/mdm-code/manx/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdm-code%2Fmanx/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28875450,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-29T09:47:23.353Z","status":"ssl_error","status_checked_at":"2026-01-29T09:47:19.357Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","lemmatization","lemmatizer","low-resource-languages","low-resource-machine-learning","low-resource-nlp","middle-english","neural-network","nlp","nlp-machine-learning","parsing"],"created_at":"2026-01-29T10:22:34.300Z","updated_at":"2026-01-29T10:22:35.063Z","avatar_url":"https://github.com/mdm-code.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\n  \u003cdiv\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/mdm-code/mdm-code.github.io/main/manx_logo.png\" alt=\"logo\"/\u003e\n  \u003c/div\u003e\n\u003c/h1\u003e\n\n\u003ch4 align=\"center\"\u003eFine-tune LLM for early Middle English lemmatization\u003c/h4\u003e\n\n\u003cdiv align=\"center\"\u003e\n\u003cp\u003e\n    \u003ca href=\"https://github.com/mdm-code/manx/actions?query=workflow%3ACI\"\u003e\n        \u003cimg alt=\"Build status\" src=\"https://github.com/mdm-code/manx/workflows/CI/badge.svg\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://app.codecov.io/gh/mdm-code/manx\"\u003e\n        \u003cimg alt=\"Code coverage\" src=\"https://codecov.io/gh/mdm-code/manx/branch/main/graphs/badge.svg?branch=main\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://opensource.org/licenses/gpl-3\" rel=\"nofollow\"\u003e\n        \u003cimg alt=\"GPL-3 license\" src=\"https://img.shields.io/github/license/mdm-code/manx\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\u003c/div\u003e\n\nThe `manx` toolkit for early Middle English lemmatization is based on data from\nthe [LAEME](http://www.lel.ed.ac.uk/ihd/laeme2/laeme2.html) corpus.\n\n`manx` lets you fine-tune a ByT5 large language model for the downstream task\nof lemmatization of historical, early Middle English texts. The example\nfine-tuned `google/byt5-small` model published on\n[Huggingface](https://huggingface.co/mdm-code/me-lemmatize-byt5-small) offers\nthe lemma accuracy of 92.5% for the validation part of the data split from the\nLAEME corpus. Manx was developed for research and educational purposes only. It\nshows how corpus data from historical languages can be used to fine-tune large\nlanguage models to support researchers in their daily work.\n\nThe project does not interfere with the copyright statement for LAEME given\n[here](http://www.lel.ed.ac.uk/ihd/laeme2/front_page/laeme_copyright.html). The\nLAEME data is not distributed, and it does not form any part of this project.\nThe toolkit uses the LAEME data only to allow users to fine-tune and use a\nlanguage model. The data is not persisted in any form in the project online\nrepositories. The copyright statement for LAEME still applies to the data\npulled from the LAEME website and persisted in order to fine-tune the model.\n\nThe project is distributed under the GPL-3 license meaning all derivatives of\nwhatever kind are to be distributed under the same GPL-3 license with all its\nparts and source code disclosed in full. Whenever the project is used make sure\nto explicitly reference this repository and the original LAEME corpus. The\nlicense for the toolkit does not apply to the LAEME data, but it does apply to\nany software it operates on and the form of the data output of the Manx parser.\n\n\n## Installation\n\nIn order to use `manx` on your machine, you have to install it first using\nPython. You can install it from this repository with the following command:\n\n```sh\npython3 -m pip install manx@git+https://github.com/mdm-code/manx.git\n```\n\nI am not a big fan of cluttering the Python package index with all sorts of\ncode that folks come up with, and I decided to stick with a simple repository.\n\nAs for the version of Python, use Python `\u003e=3.10` as declared in the\n`pyproject.toml` file.\n\nOnce installed, you should be able to invoke `manx -h` from your terminal.\n\n\n## Usage\n\nYou can use `manx` to fiddle with the data from LAEME, fine-tune a T5 model\nyourself and serve it behind an API. You can key in `manx -h` to see all the\navailable options. There three commands that `manx` supports:\n\n- `download`: It lets you download corpus files and store them on disk.\n- `parse`: It allows you to parse the corpus for model fine-tuning.\n- `api`: It lets you serve the fine-tuned model behind a REST API.\n\nThe `download` command is straightforward: you give it the `-r` root, and files\nare pulled from the website and stored on the drive. The command `parse` lets\nyou parse the corpus from the files you pulled with `download` or parse them\ndirectly from the web using `--from-web` flag meaning files will stored\nin-memory only. You can specify the length of parsed ngrams extracted from the\ncorpus or the size of document chunks later used to shuffle the corpus parts.\nThe two options are useful when `--format` is set to `t5`. The default command\nto get data from LAEME for model fine-tuning would look like this:\n\n```sh\nmanx parse \\\n\t--verbose \\\n\t--from-web \\\n\t--format t5 \\\n\t--ngram-size 11 \\\n\t--chunk-size 200 \\\n\t--t5prefix \"Lemmatize:\" \\\n\t--output t5-laeme-data.csv\n```\n\nYou can `head t5-laeme-data.csv` to get the idea of how the resulting CSV file\nlooks like.\n\nAs for the `api` command, it lets you specify the host and the port to serve the\nAPI. Other environmental variables that can be specified in the `.env` file\nor exported in the local environment are given below, so feel free to tweak them\nto you liking.\n\n```sh\nMANX_API_HOST=localhost\nMANX_API_PORT=8000\nMANX_API_LOG_LEVEL=INFO\nMANX_API_TEXT_PLACEHOLDER=YOUR PLACEHOLDER TEXT\nMANX_MODEL_TYPE=byt5\nMANX_MODEL_DIR=mdm-code/me-lemmatize-byt5-small\nMANX_USE_GPU=False\n```\n\nYou can serve the API locally with default parameters like so: `manx api`. The\ndefault model served on Huggingface used under the hood will be pulled the\nmoment the `/v1/lemmatize` API endpoint is called for the first time. You can\nchange the path through environmental variables to point to your own models\nsorted locally or hosted on Huggingface.\n\nWith `fastapi`, you get a Swagger browser GUI for free. Once the server is\nrunning, it can be accessed under here by default `http://localhost:8000/docs`.\n\n\n## Running a container\n\nYou can serve the Manx API from inside of a container with an engine of your\nchoice. I'm using Podman but Docker works just fine. In order to do that, you\nhave to build the image with this command invoked from the project root\ndirectory:\n\n```sh\npodman build -t manx:latest .\n```\n\nThen you want to run it and `-d` detach it so that it runs in the background.\n\n```sh\npodman run -p 8000:8000 -d manx:latest\n```\n\n\n## Model training\n\nIn order to train the model, have a look at the Jupyter notebook at Google\nColab [byT5-simpleT5-eME-lemmatization-train.ipynb](https://colab.research.google.com/drive/1qpd4F8BoHMGzZnSqrGxZe-1YyX9IhVHc?usp=sharing).\nIt lets you fine-tune the base model checkpoint right off the bat, but you have\nto keep in mind that you'll need to have some compute units available for a better\nGPU option. The free T4 does not have enough memory to accommodate the model.\n\nSince the notebook uses `SimpleT5`, the name of the fine-tuned model is generated\ngiven the number of epochs, the loss value of the training set and the test\nset. Make sure you load it with the right name from the Colab local storage to\nevaluate its precision in terms of how many lemmas are predicted correctly.\n\n\n## Development\n\nYou want to have the package pulled the usual way with `git` and then installed\nfor development purposes with `python3 -m pip install -e .`. To run tests,\nlinters and type checkers, use `make test`. Have a look at the `Makefile` and\n`.github/workflows` to see what is already available and what is expected.\n\n\n## License\n\nCopyright (c) 2023 Michał Adamczyk.\n\nThis project is licensed under the [GPL-3 license](https://opensource.org/licenses/gpl-3-0).\nSee [LICENSE](LICENSE) for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdm-code%2Fmanx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmdm-code%2Fmanx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdm-code%2Fmanx/lists"}