https://github.com/wmde/wikidata-constraints-violation-checker
a tool to analyze constraint violations on Wikidata
https://github.com/wmde/wikidata-constraints-violation-checker
wikidata
Last synced: 15 days ago
JSON representation
a tool to analyze constraint violations on Wikidata
- Host: GitHub
- URL: https://github.com/wmde/wikidata-constraints-violation-checker
- Owner: wmde
- License: bsd-3-clause
- Created: 2020-12-08T10:30:59.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2023-09-05T13:05:28.000Z (over 1 year ago)
- Last Synced: 2025-03-27T22:23:07.628Z (about 1 month ago)
- Topics: wikidata
- Language: Python
- Homepage:
- Size: 39.1 KB
- Stars: 13
- Watchers: 19
- Forks: 6
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Wikidata Constraints Violation Checker
The Wikidata Constraints Violations Checker allows you to analyze the number of constraints violations on a list of Wikidata Items. This is useful to better understand which Items need improvements the most and to better understand the data quality of a specific area of Wikidata.
## Installation
This script requires at least Python 3.6. In your terminal, run:```bash
git clone https://github.com/wmde/wikidata-constraints-violation-checker.git
cd wikidata-constraints-violation-checker
pip3 install -r requirements.txt
```## Usage
```bash
# To run the script with an input file
python3 checkDataQuality.py -i# To run the script using randomly generated Item IDs
python3 checkDataQuality.py -r# You can also specify an output filename
python3 checkDataQuality.py -i -o# Or a batch size
python3 checkDataQuality.py -r -b
```| Arg | Name | Description |
| :-: | ----------------------- | -------------------------------------------------------------------------------------- |
| -i | Input file | The path to the file containing the input data |
| -r | Randomly generate Items | The number of Items to randomly generate |
| -o | Output file | The path to the file for output |
| -b | Batch Size | The list of Items are broken down into batches for processing.
Default value is 10 |## Input Data
The script can read CSV files or generate random Item IDs.
### CSV File
Example input file, the first column will be used to query for constrains violations:
```csv
Q60,New York
Q64,Berlin
Q70,Bern
Q84,London
Q90,Paris
```## Output Data
The following fields are provided in the output data for Items that are succesfully checked.
| Field | Description |
| :-------------------------: | ------------------------------------------------------------------------------------------------------------------------------ |
| QID | The unique Item identifier |
| statements | Total amount of statements on the Item |
| violations_mandatory_level | # of violations at a [mandatory level](https://www.wikidata.org/wiki/Wikidata:2020_report_on_Property_constraints#mandatory) |
| violations_normal_level | # of violations at a [normal level](https://www.wikidata.org/wiki/Wikidata:2020_report_on_Property_constraints#normal) |
| violations_suggestion_level | # of violations at a [suggestion level](https://www.wikidata.org/wiki/Wikidata:2020_report_on_Property_constraints#suggestion) |
| violated_statements | # of statements with violations |
| total_sitelinks | # of sitelinks on the Item |
| wikipedia_sitelinks | # of sitelinks to Wikipedia |
| ores_score | [ORES Item quality score](https://www.wikidata.org/wiki/Wikidata:Item_quality)
From 1 to 5 (lowest to highest) |## Note
Please be aware that some large Items are skipped during the analysis because the constraint check API times out for them.