{"id":21151794,"url":"https://github.com/qubitpi/wiktionary-data","last_synced_at":"2025-07-17T02:40:18.922Z","repository":{"id":263726986,"uuid":"891270257","full_name":"QubitPi/wiktionary-data","owner":"QubitPi","description":"Wiktionary data in simple parsable formats hosted on 🤗 Datasets","archived":false,"fork":false,"pushed_at":"2024-12-14T21:30:50.000Z","size":365,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-21T08:22:55.100Z","etag":null,"topics":["ancient-greek","data","german","huggingface","huggingface-datasets","language","latin","natural-language-processing","nlp","old-persian","python","wiktionary","wiktionary-data"],"latest_commit_sha":null,"homepage":"https://huggingface.co/datasets/paion-data/wiktionary-data","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/QubitPi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-20T03:07:07.000Z","updated_at":"2024-12-14T21:29:34.000Z","dependencies_parsed_at":"2024-12-14T13:26:30.903Z","dependency_job_id":"e8da38fd-2d2f-4bb6-82be-9b3225686ce0","html_url":"https://github.com/QubitPi/wiktionary-data","commit_stats":null,"previous_names":["qubitpi/wiktionary-data"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QubitPi%2Fwiktionary-data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QubitPi%2Fwiktionary-data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QubitPi%2Fwiktionary-data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QubitPi%2Fwiktionary-data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/QubitPi","download_url":"https://codeload.github.com/QubitPi/wiktionary-data/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243597779,"owners_count":20316842,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ancient-greek","data","german","huggingface","huggingface-datasets","language","latin","natural-language-processing","nlp","old-persian","python","wiktionary","wiktionary-data"],"created_at":"2024-11-20T10:18:48.563Z","updated_at":"2025-03-14T14:42:59.427Z","avatar_url":"https://github.com/QubitPi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\nlicense: apache-2.0\npretty_name: English Wiktionary Data in JSONL\nlanguage:\n  - en\n  - de\n  - la\n  - grc\n  - ko\n  - peo\n  - akk\n  - elx\n  - sa\nconfigs:\n  - config_name: Wiktionary\n    data_files:\n    - split: German\n      path: german-wiktextract-data.jsonl\n    - split: Latin\n      path: latin-wiktextract-data.jsonl\n    - split: AncientGreek\n      path: ancient-greek-wiktextract-data.jsonl\n    - split: Korean\n      path: korean-wiktextract-data.jsonl\n    - split: OldPersian\n      path: old-persian-wiktextract-data.jsonl\n    - split: Akkadian\n      path: akkadian-wiktextract-data.jsonl\n    - split: Elamite\n      path: elamite-wiktextract-data.jsonl\n    - split: Sanskrit\n      path: sanskrit-wiktextract-data.jsonl\n  - config_name: Knowledge Graph\n    data_files:\n    - split: AllLanguage\n      path: word-definition-graph-data.jsonl\ntags:\n  - Natural Language Processing\n  - NLP\n  - Wiktionary\n  - Vocabulary\n  - German\n  - Latin\n  - Ancient Greek\n  - Korean\n  - Old Persian\n  - Akkadian\n  - Elamite\n  - Sanskrit\n  - Knowledge Graph\nsize_categories:\n  - 100M\u003cn\u003c1B\n---\n\nWiktionary Data on Hugging Face Datasets\n========================================\n\n[![Hugging Face dataset badge]][Hugging Face dataset URL]\n\n![Python Version Badge]\n[![GitHub workflow status badge][GitHub workflow status badge]][GitHub workflow status URL]\n[![Hugging Face sync status badge]][Hugging Face sync status URL]\n[![Apache License Badge]][Apache License, Version 2.0]\n\n[wiktionary-data]() is a sub-data extraction of the [English Wiktionary](https://en.wiktionary.org) that currently\nsupports the following languages:\n\n- __Deutsch__ - German\n- __Latinum__ - Latin\n- __Ἑλληνική__ - Ancient Greek\n- __한국어__ - Korean\n- __𐎠𐎼𐎹__ - [Old Persian](https://en.wikipedia.org/wiki/Old_Persian_cuneiform)\n- __𒀝𒅗𒁺𒌑(𒌝)__ - [Akkadian](https://en.wikipedia.org/wiki/Akkadian_language)\n- [Elamite](https://en.wikipedia.org/wiki/Elamite_language)\n- __संस्कृतम्__ - Sanskrit, or Classical Sanskrit\n\n[wiktionary-data]() was originally a sub-module of [wilhelm-graphdb](https://github.com/QubitPi/wilhelm-graphdb). While\nthe dataset it's getting bigger, I noticed a wave of more exciting potentials this dataset can bring about that\nstretches beyond the scope of the containing project. Therefore I decided to promote it to a dedicated module; and here\ncomes this repo.\n\nThe Wiktionary language data is available on 🤗 [Hugging Face Datasets][Hugging Face dataset URL].\n\n```python\nfrom datasets import load_dataset\ndataset = load_dataset(\"QubitPi/wiktionary-data\")\n```\n\nThere are __two__ data subsets:\n\n1. __Languages__ subset that contains extraction of a subset of supported languages:\n\n   ```console\n   dataset = load_dataset(\"QubitPi/wiktionary-data\", \"Wiktionary\")\n   ```\n   \n   The subset contains the following splits\n\n   - `German`\n   - `Latin`\n   - `AncientGreek`\n   - `Korean`\n   - `OldPersian`\n   - `Akkadian`\n   - `Elamite`\n   - `Sanskrit`\n\n2. __Graph__ subset that is useful for constructing knowledge graphs:\n\n   ```console\n   dataset = load_dataset(\"QubitPi/wiktionary-data\", \"Knowledge Graph\")\n   ```\n   \n   The subset contains the following splits\n\n   - `AllLanguage`: all the languages listed above in a giant graph\n\n   The _Graph_ data ontology is the following:\n\n   \u003cdiv align=\"center\"\u003e\n       \u003cimg src=\"ontology.png\" size=\"50%\" alt=\"Error loading ontology.png\"/\u003e\n   \u003c/div\u003e\n\n\u003e [!TIP]\n\u003e\n\u003e Two words are structurally similar if and only if the two shares the same\n\u003e [stem](https://en.wikipedia.org/wiki/Word_stem)\n\nDevelopment\n-----------\n\n### Data Source\n\nAlthough [the original Wiktionary dump](https://dumps.wikimedia.org/) is available, parsing it from scratch involves\nrather complicated process. For example,\n[acquiring the inflection data of most Indo-European languages on Wiktionary has already triggered some research-level efforts](https://stackoverflow.com/a/62977327).\nWe would probably do it in the future. At present, however, we would simply take the awesome works by\n[tatuylonen](https://github.com/tatuylonen/wiktextract) which has already processed it and presented it in\n[in JSONL format](https://kaikki.org/dictionary/rawdata.html). wiktionary-data sources the data from\n__raw Wiktextract data (JSONL, one object per line)__ option there.\n\n### Environment Setup\n\nGet the source code:\n\n```console\ngit clone git@github.com:QubitPi/wiktionary-data.git\ncd wiktionary-data\n```\n\nIt is strongly recommended to work in an isolated environment. Install virtualenv and create an isolated Python\nenvironment by\n\n```console\npython3 -m pip install --user -U virtualenv\npython3 -m virtualenv .venv\n```\n\nTo activate this environment:\n\n```console\nsource .venv/bin/activate\n```\n\nor, on Windows\n\n```console\n./venv\\Scripts\\activate\n```\n\n\u003e [!TIP]\n\u003e \n\u003e To deactivate this environment, use\n\u003e \n\u003e ```console\n\u003e deactivate\n\u003e ```\n\n### Installing Dependencies\n\n```console\npip3 install -r requirements.txt\n```\n\nLicense\n-------\n\nThe use and distribution terms for [wiktionary-data]() are covered by the [Apache License, Version 2.0].\n\n[Apache License Badge]: https://img.shields.io/badge/Apache%202.0-F25910.svg?style=for-the-badge\u0026logo=Apache\u0026logoColor=white\n[Apache License, Version 2.0]: https://www.apache.org/licenses/LICENSE-2.0\n\n[GitHub workflow status badge]: https://img.shields.io/github/actions/workflow/status/QubitPi/wiktionary-data/ci-cd.yaml?branch=master\u0026style=for-the-badge\u0026logo=github\u0026logoColor=white\u0026label=CI/CD\n[GitHub workflow status URL]: https://github.com/QubitPi/wiktionary-data/actions/workflows/ci-cd.yaml\n\n[Hugging Face dataset badge]: https://img.shields.io/badge/Hugging%20Face%20Dataset-wiktionary--data-FF9D00?style=for-the-badge\u0026logo=huggingface\u0026logoColor=white\u0026labelColor=6B7280\n[Hugging Face dataset URL]: https://huggingface.co/datasets/QubitPi/wiktionary-data\n\n[Hugging Face sync status badge]: https://img.shields.io/github/actions/workflow/status/QubitPi/wiktionary-data/ci-cd.yaml?branch=master\u0026style=for-the-badge\u0026logo=github\u0026logoColor=white\u0026label=Hugging%20Face%20Sync%20Up\n[Hugging Face sync status URL]: https://github.com/QubitPi/wiktionary-data/actions/workflows/ci-cd.yaml\n\n[Python Version Badge]: https://img.shields.io/badge/Python-3.10-FFD845?labelColor=498ABC\u0026style=for-the-badge\u0026logo=python\u0026logoColor=white\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqubitpi%2Fwiktionary-data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqubitpi%2Fwiktionary-data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqubitpi%2Fwiktionary-data/lists"}