Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/qubitpi/wiktionary-data

🤗 English Wiktionary hosted on Hugging Face Datasets
https://github.com/qubitpi/wiktionary-data

ancient-greek data german huggingface huggingface-datasets language latin natural-language-processing nlp old-persian python wiktionary wiktionary-data

Last synced: about 2 months ago
JSON representation

🤗 English Wiktionary hosted on Hugging Face Datasets

Awesome Lists containing this project

README

        

---
license: apache-2.0
pretty_name: English Wiktionary Data in JSONL
language:
- en
- de
- la
- grc
- ko
- peo
- akk
- sa
configs:
- config_name: Wiktionary
data_files:
- split: German
path: german-wiktextract-data.jsonl
- split: Latin
path: latin-wiktextract-data.jsonl
- split: AncientGreek
path: ancient-greek-wiktextract-data.jsonl
- split: Korean
path: korean-wiktextract-data.jsonl
- split: OldPersian
path: old-persian-wiktextract-data.jsonl
- split: Akkadian
path: akkadian-wiktextract-data.jsonl
- split: Sanskrit
path: sanskrit-wiktextract-data.jsonl
- config_name: Knowledge Graph
data_files:
- split: AllLanguage
path: word-definition-graph-data.jsonl
tags:
- Natural Language Processing
- NLP
- Wiktionary
- Vocabulary
- German
- Latin
- Ancient Greek
- Korean
- Old Persian
- Akkadian
- Sanskrit
- Knowledge Graph
size_categories:
- 100M
Error loading ontology.png

> [!TIP]
>
> Two words are structurally similar if and only if the two shares the same
> [stem](https://en.wikipedia.org/wiki/Word_stem)

Development
-----------

### Data Source

Although [the original Wiktionary dump](https://dumps.wikimedia.org/) is available, parsing it from scratch involves
rather complicated process. For example,
[acquiring the inflection data of most Indo-European languages on Wiktionary has already triggered some research-level efforts](https://stackoverflow.com/a/62977327).
We would probably do it in the future. At present, however, we would simply take the awesome works by
[tatuylonen](https://github.com/tatuylonen/wiktextract) which has already processed it and presented it in
[in JSONL format](https://kaikki.org/dictionary/rawdata.html). __wiktionary-data sources the data from
__raw Wiktextract data (JSONL, one object per line)__ option there.

### Environment Setup

Get the source code:

```console
[email protected]:QubitPi/wiktionary-data.git
cd wiktionary-data
```

It is strongly recommended to work in an isolated environment. Install virtualenv and create an isolated Python
environment by

```console
python3 -m pip install --user -U virtualenv
python3 -m virtualenv .venv
```

To activate this environment:

```console
source .venv/bin/activate
```

or, on Windows

```console
./venv\Scripts\activate
```

> [!TIP]
>
> To deactivate this environment, use
>
> ```console
> deactivate
> ```

### Installing Dependencies

```console
pip3 install -r requirements.txt
```

License
-------

The use and distribution terms for [wiktionary-data]() are covered by the [Apache License, Version 2.0].

[Apache License Badge]: https://img.shields.io/badge/Apache%202.0-F25910.svg?style=for-the-badge&logo=Apache&logoColor=white
[Apache License, Version 2.0]: https://www.apache.org/licenses/LICENSE-2.0

[Docker login command]: https://docker.qubitpi.org//reference/cli/docker/login/#options

[GitHub workflow status badge]: https://img.shields.io/github/actions/workflow/status/QubitPi/wiktionary-data/ci-cd.yaml?branch=master&style=for-the-badge&logo=github&logoColor=white&label=CI/CD
[GitHub workflow status URL]: https://github.com/QubitPi/wiktionary-data/actions/workflows/ci-cd.yaml

[Hugging Face dataset badge]: https://img.shields.io/badge/Hugging%20Face%20Dataset-wiktionary--data-FF9D00?style=for-the-badge&logo=huggingface&logoColor=white&labelColor=6B7280
[Hugging Face dataset URL]: https://huggingface.co/datasets/QubitPi/wiktionary-data

[Hugging Face sync status badge]: https://img.shields.io/github/actions/workflow/status/QubitPi/wiktionary-data/ci-cd.yaml?branch=master&style=for-the-badge&logo=github&logoColor=white&label=Hugging%20Face%20Sync%20Up
[Hugging Face sync status URL]: https://github.com/QubitPi/wiktionary-data/actions/workflows/ci-cd.yaml

[Python Version Badge]: https://img.shields.io/badge/Python-3.10-FFD845?labelColor=498ABC&style=for-the-badge&logo=python&logoColor=white