Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/qubitpi/wiktionary-data
🤗 English Wiktionary hosted on Hugging Face Datasets
https://github.com/qubitpi/wiktionary-data
ancient-greek data german huggingface huggingface-datasets language latin natural-language-processing nlp old-persian python wiktionary wiktionary-data
Last synced: about 2 months ago
JSON representation
🤗 English Wiktionary hosted on Hugging Face Datasets
- Host: GitHub
- URL: https://github.com/qubitpi/wiktionary-data
- Owner: QubitPi
- License: apache-2.0
- Created: 2024-11-20T03:07:07.000Z (about 2 months ago)
- Default Branch: master
- Last Pushed: 2024-11-20T05:19:40.000Z (about 2 months ago)
- Last Synced: 2024-11-20T05:27:15.244Z (about 2 months ago)
- Topics: ancient-greek, data, german, huggingface, huggingface-datasets, language, latin, natural-language-processing, nlp, old-persian, python, wiktionary, wiktionary-data
- Language: Python
- Homepage: https://huggingface.co/datasets/QubitPi/wiktionary-data
- Size: 31.3 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
---
license: apache-2.0
pretty_name: English Wiktionary Data in JSONL
language:
- en
- de
- la
- grc
- ko
- peo
- akk
- sa
configs:
- config_name: Wiktionary
data_files:
- split: German
path: german-wiktextract-data.jsonl
- split: Latin
path: latin-wiktextract-data.jsonl
- split: AncientGreek
path: ancient-greek-wiktextract-data.jsonl
- split: Korean
path: korean-wiktextract-data.jsonl
- split: OldPersian
path: old-persian-wiktextract-data.jsonl
- split: Akkadian
path: akkadian-wiktextract-data.jsonl
- split: Sanskrit
path: sanskrit-wiktextract-data.jsonl
- config_name: Knowledge Graph
data_files:
- split: AllLanguage
path: word-definition-graph-data.jsonl
tags:
- Natural Language Processing
- NLP
- Wiktionary
- Vocabulary
- German
- Latin
- Ancient Greek
- Korean
- Old Persian
- Akkadian
- Sanskrit
- Knowledge Graph
size_categories:
- 100M
> [!TIP]
>
> Two words are structurally similar if and only if the two shares the same
> [stem](https://en.wikipedia.org/wiki/Word_stem)Development
-----------### Data Source
Although [the original Wiktionary dump](https://dumps.wikimedia.org/) is available, parsing it from scratch involves
rather complicated process. For example,
[acquiring the inflection data of most Indo-European languages on Wiktionary has already triggered some research-level efforts](https://stackoverflow.com/a/62977327).
We would probably do it in the future. At present, however, we would simply take the awesome works by
[tatuylonen](https://github.com/tatuylonen/wiktextract) which has already processed it and presented it in
[in JSONL format](https://kaikki.org/dictionary/rawdata.html). __wiktionary-data sources the data from
__raw Wiktextract data (JSONL, one object per line)__ option there.### Environment Setup
Get the source code:
```console
[email protected]:QubitPi/wiktionary-data.git
cd wiktionary-data
```It is strongly recommended to work in an isolated environment. Install virtualenv and create an isolated Python
environment by```console
python3 -m pip install --user -U virtualenv
python3 -m virtualenv .venv
```To activate this environment:
```console
source .venv/bin/activate
```or, on Windows
```console
./venv\Scripts\activate
```> [!TIP]
>
> To deactivate this environment, use
>
> ```console
> deactivate
> ```### Installing Dependencies
```console
pip3 install -r requirements.txt
```License
-------The use and distribution terms for [wiktionary-data]() are covered by the [Apache License, Version 2.0].
[Apache License Badge]: https://img.shields.io/badge/Apache%202.0-F25910.svg?style=for-the-badge&logo=Apache&logoColor=white
[Apache License, Version 2.0]: https://www.apache.org/licenses/LICENSE-2.0[Docker login command]: https://docker.qubitpi.org//reference/cli/docker/login/#options
[GitHub workflow status badge]: https://img.shields.io/github/actions/workflow/status/QubitPi/wiktionary-data/ci-cd.yaml?branch=master&style=for-the-badge&logo=github&logoColor=white&label=CI/CD
[GitHub workflow status URL]: https://github.com/QubitPi/wiktionary-data/actions/workflows/ci-cd.yaml[Hugging Face dataset badge]: https://img.shields.io/badge/Hugging%20Face%20Dataset-wiktionary--data-FF9D00?style=for-the-badge&logo=huggingface&logoColor=white&labelColor=6B7280
[Hugging Face dataset URL]: https://huggingface.co/datasets/QubitPi/wiktionary-data[Hugging Face sync status badge]: https://img.shields.io/github/actions/workflow/status/QubitPi/wiktionary-data/ci-cd.yaml?branch=master&style=for-the-badge&logo=github&logoColor=white&label=Hugging%20Face%20Sync%20Up
[Hugging Face sync status URL]: https://github.com/QubitPi/wiktionary-data/actions/workflows/ci-cd.yaml[Python Version Badge]: https://img.shields.io/badge/Python-3.10-FFD845?labelColor=498ABC&style=for-the-badge&logo=python&logoColor=white