https://github.com/explosion/wikid
Generate a SQLite database from Wikipedia & Wikidata dumps.
https://github.com/explosion/wikid
wikidata wikipedia
Last synced: 3 months ago
JSON representation
Generate a SQLite database from Wikipedia & Wikidata dumps.
- Host: GitHub
- URL: https://github.com/explosion/wikid
- Owner: explosion
- License: mit
- Created: 2022-10-21T10:12:39.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2024-03-27T10:56:42.000Z (almost 2 years ago)
- Last Synced: 2025-04-05T14:11:10.705Z (9 months ago)
- Topics: wikidata, wikipedia
- Language: Python
- Homepage:
- Size: 133 KB
- Stars: 33
- Watchers: 7
- Forks: 6
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🪐 spaCy Project: wikid
[](https://github.com/explosion/wikid/actions/workflows/tests.yml)
[](https://spacy.io)
_No REST for the `wikid`_ :jack_o_lantern: - generate a SQLite database
and a spaCy `KnowledgeBase` from Wikipedia & Wikidata dumps. `wikid` was
designed with the use case of named entity linking (NEL) with spaCy in mind.
Note this repository is still in an experimental stage, so the public API
might change at any time.
## 📋 project.yml
The [`project.yml`](project.yml) defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
[spaCy projects documentation](https://spacy.io/usage/projects).
### ⏯ Commands
The following commands are defined by the project. They can be executed using
[`spacy project run [name]`](https://spacy.io/api/cli#project-run). Commands are
only re-run if their inputs have changed.
| Command | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------- |
| `parse` | Parse Wiki dumps. This can take a long time if you're not using the filtered dumps! |
| `download_model` | Download spaCy language model. |
| `create_kb` | Creates KB utilizing SQLite database with Wiki content. |
| `delete_db` | Deletes SQLite database generated in step parse_wiki_dumps with data parsed from Wikidata and Wikipedia dump. |
| `clean` | Delete all generated artifacts except for SQLite database. |
### ⏭ Workflows
The following workflows are defined by the project. They can be executed using
[`spacy project run [name]`](https://spacy.io/api/cli#project-run) and will run
the specified commands in order. Commands are only re-run if their inputs have
changed.
| Workflow | Steps |
| -------- | -------------------------------------------------- |
| `all` | `parse` → `download_model` → `create_kb` |
### 🗂 Assets
The following assets are defined by the project. They can be fetched by running
[`spacy project assets`](https://spacy.io/api/cli#project-assets) in the project
directory.
| File | Source | Description |
| ----------------------------------------------- | ------ | --------------------------------------------------------------- |
| `assets/wikidata_entity_dump.json.bz2` | URL | Wikidata entity dump. Download can take a long time! |
| `assets/wikipedia_dump.xml.bz2` | URL | Wikipedia dump. Download can take a long time! |
| `assets/wikidata_entity_dump_filtered.json.bz2` | URL | Filtered Wikidata entity dump for demo purposes (English only). |
| `assets/wikipedia_dump_filtered.xml.bz2` | URL | Filtered Wikipedia dump for demo purposes (English only). |