{"id":16724206,"url":"https://github.com/elazarg/nakdimon","last_synced_at":"2025-03-03T06:09:41.577Z","repository":{"id":39851950,"uuid":"222968119","full_name":"elazarg/nakdimon","owner":"elazarg","description":"Hebrew Diacritizer","archived":false,"fork":false,"pushed_at":"2024-09-03T23:18:35.000Z","size":220460,"stargazers_count":36,"open_issues_count":2,"forks_count":7,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-02-24T05:07:00.010Z","etag":null,"topics":["diacritization","hebrew","hebrew-niqqud","machine-learning"],"latest_commit_sha":null,"homepage":"https://nakdimon.org","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elazarg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-11-20T15:26:13.000Z","updated_at":"2025-02-23T14:36:08.000Z","dependencies_parsed_at":"2024-10-27T12:10:29.153Z","dependency_job_id":null,"html_url":"https://github.com/elazarg/nakdimon","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elazarg%2Fnakdimon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elazarg%2Fnakdimon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elazarg%2Fnakdimon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elazarg%2Fnakdimon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elazarg","download_url":"https://codeload.github.com/elazarg/nakdimon/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241616688,"owners_count":19991542,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diacritization","hebrew","hebrew-niqqud","machine-learning"],"created_at":"2024-10-12T22:44:18.419Z","updated_at":"2025-03-03T06:09:41.558Z","avatar_url":"https://github.com/elazarg.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Nakdimon: a simple Hebrew diacritizer\n\nRepository for the paper [Restoring Hebrew Diacritics Without a Dictionary](https://arxiv.org/abs/2105.05209) by Elazar Gershuni and Yuval Pinter.\n\nDemo: https://nakdimon.org/\n\nLocally:\n```\n$ pip install nakdimon\n$ diacritize input_file.txt -o=output_file.txt\n```\n\n## Building and running docker container\nBuild the docker container:\n```\n$ docker build -t nakdimon .\n```\n\nRun the docker container:\n```\n$ docker run --rm --gpus all --user 1000:1000 -it nakdimon /bin/bash\n```\n\nThe `--gpus all` flag is required to run the container with GPU support.\n\n## Training and evaluating\nTo train, test and evaluate the system, run the following commands:\n```\n\u003e python nakdimon train --model=models/Nakdimon.h5\n\u003e python nakdimon run_test --test_set=tests/new --model=models/Nakdimon.h5\n\u003e python nakdimon results --test_set=tests/new --systems Snopi Morfix Dicta MajAllWithDicta Nakdimon\n```\nThe first step trains the model and create a file named `Nakdimon.h5` in the `models` directory.\nBy default, the model is the one described in the paper: `nakdimon/Nakdimon.h5`.\nIf the model already exists, you may skip this step. \n\nThe second step asks the Nakdimon server to predict the diacritics for the test set. You may skip this step.\nA folder for the results is created in the chosen test folder, with the same name as the model; in this case, `tests/new/NakdimonNew`.\nBy default, the test set is the one used in the paper (`tests/new`); you can use `tests/dicta` instead.\nIf the test results already exist, you may skip this step. If you are not sure, you can use the `--skip_existing` flag.\n\nThe third step calculates and prints the results (DEC, CHA, WOR and VOC metrics, as well as OOV_WOR and OOV_VOC).\nBy default, the systems are the folders in the chosen test folder.\nFor the Dicta test set (`/tests/dicta`) you should use `MajAllNoDicta` instead of `MajAllWithDicta`, otherwise the vocabulary for the Majority would include the test set itself.\n\n## Diacritizing a single file\n```\n\u003e python nakdimon predict input_file.txt output_file.txt\n```\n\n## Using other systems\nYou can use the `run_test` command to run the test set on other systems, such as Dicta:\n```\n\u003e python nakdimon run_test --test_set=tests/new --system=Dicta\n```\nThis will create a folder named `Dicta` for the results in the `tests/new` folder.\nNote that `Morfix` cannot be used in this manner, as its license prohibit automatic use.\n\n## Running ablation tests\nYou can use the `--ablation` flag to train different models for the ablation tests and other experiments:\n```\n\u003e python nakdimon train --model=models/SingleLayer.h5 --ablation=SingleLayer\n```\nSee the file `ablation.py` for the list of available ablation parameters.\n\n## Important folders\n* `hebrew_diacritized` is the training set.\n* `tests` contains three tests sets: `new`, `dicta` and `validation`.\n  Each test set has an `expected` folder that describes the ground truth.\n  The results of `python nakdimon run_test` are stored in sibling folder, named after the model.\n* `models` contains the trained model.\n* `nakdimon` holds the source code.\n\n## Citation\n```\n@inproceedings{gershuni2022restoring,\n  title={Restoring Hebrew Diacritics Without a Dictionary},\n  author={Gershuni, Elazar and Pinter, Yuval},\n  booktitle={Findings of the Association for Computational Linguistics: NAACL 2022},\n  pages={1010--1018},\n  year={2022}\n}\n```\n\u003e Gershuni, Elazar, and Yuval Pinter. \"Restoring Hebrew Diacritics Without a Dictionary.\" Findings of the Association for Computational Linguistics: NAACL 2022. 2022.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felazarg%2Fnakdimon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felazarg%2Fnakdimon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felazarg%2Fnakdimon/lists"}