Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wikidata/soweego
Link Wikidata items to large catalogs
https://github.com/wikidata/soweego
data-matching entity-linking entity-resolution identifiers knowledge-graph record-linkage wikidata wikimedia
Last synced: about 1 month ago
JSON representation
Link Wikidata items to large catalogs
- Host: GitHub
- URL: https://github.com/wikidata/soweego
- Owner: Wikidata
- License: gpl-3.0
- Created: 2018-07-11T18:00:28.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2024-03-18T19:59:39.000Z (9 months ago)
- Last Synced: 2024-11-05T04:34:12.583Z (about 2 months ago)
- Topics: data-matching, entity-linking, entity-resolution, identifiers, knowledge-graph, record-linkage, wikidata, wikimedia
- Language: Python
- Homepage: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
- Size: 7.87 MB
- Stars: 96
- Watchers: 7
- Forks: 9
- Open Issues: 61
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# soweego: link Wikidata to large catalogs
[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/Wikidata/soweego/master.svg)](https://results.pre-commit.ci/latest/github/Wikidata/soweego/master)
[![Documentation Status](https://readthedocs.org/projects/soweego/badge/?version=latest)](https://soweego.readthedocs.io/en/latest/?badge=latest)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![License](https://img.shields.io/github/license/Wikidata/soweego.svg)](https://www.gnu.org/licenses/gpl-3.0.html)*soweego* is a pipeline that connects [Wikidata](https://wikidata.org/) to large-scale third-party catalogs.
*soweego* is the only system that makes *statisticians, epidemiologists, historians,* and *computer scientists* agree.
Why? Because it performs *record linkage, data matching,* and *entity resolution* at the same time.
Too easy, they all seem to be [synonyms](https://en.wikipedia.org/wiki/Record_linkage#Naming_conventions)!Oh, *soweego* also embeds [Machine Learning](https://en.wikipedia.org/wiki/Machine_learning) and advocates for [Linked Data](https://en.wikipedia.org/wiki/Linked_data).
![Is soweego similar to the Go game?](https://upload.wikimedia.org/wikipedia/commons/9/96/Crosscut.jpg)
# Official Project Pages
*soweego* is made possible thanks to the [Wikimedia Foundation](https://wikimediafoundation.org/):
- https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego
- https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2# Documentation
https://soweego.readthedocs.io/# Highlights
- Run the whole [pipeline](#run-the-pipeline), or
- use the [command line](#use-the-command-line);
- [import](https://soweego.readthedocs.io/en/latest/importer.html) large catalogs into a SQL database;
- [gather](https://soweego.readthedocs.io/en/latest/wikidata.html) live Wikidata datasets;
- [connect](https://soweego.readthedocs.io/en/latest/linker.html) them to target catalogs via *rule-based* and *supervised* linkers;
- [upload](https://soweego.readthedocs.io/en/latest/ingester.html) links to Wikidata and [Mix'n'match](https://tools.wmflabs.org/mix-n-match/);
- [synchronize](https://soweego.readthedocs.io/en/latest/validator.html#module-soweego.validator.checks) Wikidata to imported catalogs;
- [enrich](https://soweego.readthedocs.io/en/latest/validator.html#module-soweego.validator.enrichment) Wikidata items with relevant statements.# Get Ready
Install [Docker](https://docs.docker.com/install/) and [Compose](https://docs.docker.com/compose/install/), then enter *soweego*:```
$ git clone -b v1.1 https://github.com/Wikidata/soweego.git
$ cd soweego
$ ./docker/run.sh
Building soweego
...root@70c9b4894a30:/app/soweego#
```Now it's too late to get out!
# Run the Pipeline
Piece of cake:```
:/app/soweego# python -m soweego run CATALOG
```Pick `CATALOG` from `discogs`, `imdb`, or `musicbrainz`.
These steps are executed by default:
1. import the target catalog into a local database;
2. link Wikidata to the target with a supervised linker;
3. synchronize Wikidata to the target.Results are in `/app/shared/results`.
# Use the Command Line
You can launch every single *soweego* action with CLI commands:```
:/app/soweego# python -m soweego
Usage: soweego [OPTIONS] COMMAND [ARGS]...Link Wikidata to large catalogs.
Options:
-l, --log-level ...
Module name followed by one of [DEBUG, INFO,
WARNING, ERROR, CRITICAL]. Multiple pairs
allowed.
--help Show this message and exit.Commands:
importer Import target catalog dumps into a SQL database.
ingester Take soweego output into Wikidata items.
linker Link Wikidata items to target catalog identifiers.
run Launch the whole pipeline.
sync Sync Wikidata to target catalogs.
```Just two things to remember:
1. you can always get `--help`;
2. each command may have sub-commands.# Contribute
The best way is to [import a new catalog](https://soweego.readthedocs.io/en/latest/new_catalog.html).
Please also have a look at the [guidelines](CONTRIBUTING.md).# License
The source code is under the terms of the [GNU General Public License, version 3](https://www.gnu.org/licenses/gpl.html).