https://github.com/nichtich/jq-wikidata
jq module to process Wikidata JSON format
https://github.com/nichtich/jq-wikidata
jq json wikibase wikidata
Last synced: over 1 year ago
JSON representation
jq module to process Wikidata JSON format
- Host: GitHub
- URL: https://github.com/nichtich/jq-wikidata
- Owner: nichtich
- License: mit
- Created: 2019-04-27T17:51:50.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2019-06-04T19:55:51.000Z (about 7 years ago)
- Last Synced: 2025-02-28T06:21:22.095Z (over 1 year ago)
- Topics: jq, json, wikibase, wikidata
- Language: JSONiq
- Size: 71.3 KB
- Stars: 11
- Watchers: 5
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# jq-wikidata
[](https://travis-ci.org/nichtich/jq-wikidata)
> jq module to process Wikidata JSON format
This git repository contains a module for the [jq data transformation language](https://stedolan.github.io/jq/) to process entity data from [Wikidata](https://www.wikidata.org) or other [Wikibase](http://wikiba.se/) instances serialized in its JSON format.
Several methods exist [to get entity data from Wikidata](https://www.wikidata.org/wiki/Wikidata:Data_access).
This module is designed to process entities [in their JSON serialization](https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON)
especially for large numbers of entities. Please also consider using a dedicated client such as
[wikidata-cli] instead.
[wikidata-cli]: https://www.npmjs.com/package/wikidata-cli
## Table of Contents
* [Install](#install)
* [Usage](#usage)
* [Process JSON dumps](#process-json-dumps)
* [Per-item access](#per-item-access)
* [Reduce entity data](#reduce-entity-data)
* [API](#api)
* [Reduce item](#reduce-item)
* [Reduce property](#reduce-property)
* [Reduce labels](#reduce-labels)
* [Reduce descriptions](#reduce-descriptions)
* [Reduce aliases](#reduce-aliases)
* [Reduce sitelinks ](#reduce-sitelinks)
* [Reduce claims](#reduce-claims)
* [Reduce claim](#reduce-claim)
* [Reduce references](#reduce-references)
* [Reduce lexeme](#reduce-lexeme)
* [Reduce forms](#reduce-forms)
* [Reduce info](#reduce-info)
* [Stream an array of entities](#stream-an-array-of-entities)
* [Contributing](#contributing)
* [License](#license)
## Install
Installation requires [jq](https://stedolan.github.io/jq/) version 1.5 or newer.
Put `wikidata.jq` to a place where jq can [find it as module](https://stedolan.github.io/jq/manual/#Modules).
One way to do so is to check out this repository to directory `~/.jq/wikidata/`:
~~~sh
mkdir -p ~/.jq && git clone https://github.com/nichtich/jq-wikidata.git ~/.jq/wikidata
~~~
## Usage
The shortest method to use functions of this jq module is to directly `include` the module. Try to process a single Wikidata entity (see below for details about [per-item acces](#per-item-access)):
~~~sh
wget http://www.wikidata.org/wiki/Special:EntityData/Q42.json
jq 'include "wikidata"; .entities[].labels|reduceLabels' Q42.json
~~~
It is recommended to put Wikidata entities in a newline delimited JSON file:
~~~sh
jq -c .entities[] Q42.json > entities.ndjson
jq -c 'include "wikidata"; .labels|reduceLabels' entities.ndjson
~~~
More complex scripts should better be put into a `.jq` file:
~~~jq
include "wikidata";
.labels|reduceLabels
~~~
The file can then be processed this way:
~~~sh
jq -f script.jq entities.ndjson
~~~
### Process JSON dumps
Wikidata JSON dumps are made available at .
The current dumps exceed 35GB even in its most compressed form. The file contains one large JSON
array so it should better be converted into a stream of JSON objects for further processing.
With a fast and stable internet connection it's possible to process the dump on-the fly like this:
~~~sh
curl -s https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 \
| bzcat | jq -nc --stream 'include "wikidata"; ndjson' | jq .id
~~~
### Per-item access
JSON data for single entities can be ontained via the
[Entity Data URL](https://www.wikidata.org/wiki/Special:EntityData). Examples:
*
*
*
The module function `entity_data_url` creates these URLs from Wikidata
itentifier strings. The resulting data is wrapped in JSON object; unwrap with
`.entities|.[]`:
~~~bash
curl $(echo Q42 | jq -rR 'include "wikidata"; entity_data_url') | jq '.entities|.[]'
~~~
As mentioned above you better use [wikidata-cli] for accessing small sets of items:
~~~bash
wd d Q42
~~~
To get sets of items that match a given criteria either use SPARL or MediaWiki API module
[wbsearchentities] and/or MediaWiki API module [wbgetentities].
[wbsearchentities]: https://www.wikidata.org/w/api.php?action=help&modules=wbsearchentities
[wbgetentities]: https://www.wikidata.org/w/api.php?action=help&modules=wbgetentities
### Reduce entity data
Use function [reduceEntity](#reduce-entity) or more specific functions
([reduceInfo](#reduce-info), [reduceItem](#reduce-item),
[reduceProperty](#reduceProperty), [reduceLexeme](#reduceLexeme)) to
reduce the JSON data structure without loss of essential information.
Furher select only some specific fields if needed:
~~~jq
jq '{id,labels}' entities.ndjson
~~~
## API
### Reduce Entity
Applies [reduceInfo](#reduce-info) and one of [reduceItem](#reduce-item),
[reduceProperty](#reduce-property), [reduceLexeme](#reduce-lexeme).
~~~jq
reduceEntity
~~~
### Reduce item
Simplifies labels, descriptions, aliases, claims, and sitelinks of an item.
~~~jq
reduceItem
~~~
### Reduce property
Simplifies labels, descriptions, aliases, and claims of a property.
~~~jq
reduceProperty
~~~
### Reduce labels
~~~jq
.labels|reduceLabels
~~~
### Reduce descriptions
~~~jq
.descriptions|reduceDescriptions
~~~
### Reduce aliases
~~~jq
.aliases|reduceAliases
~~~
### Reduce sitelinks
~~~jq
.sitelinks|reduceSitelinks
~~~
### Reduce lexeme
Simplifies lemmas, forms, and senses of a lexeme entity.
~~~jq
reduceLexeme
~~~
### Reduce forms
~~~jq
.forms|reduceForms
~~~
### Reduce senses
~~~jq
.senses|reduceSenses
~~~
### Reduce claims
Removes unnecessary fields `.id`, `.hash`, `.type`, `.property` and simplifies
values for each claim.
~~~jq
.claims|reduceClaims
~~~
### Reduce claim
Reduces a single claim value.
~~~jq
.claims.P26[]|reduceClaim
~~~
### Reduce references
...
### Reduce forms
Only lexemes have forms.
~~~
.forms|reduceForms
~~~
### Reduce info
~~~jq
reduceInfo
~~~
Removes additional information fields `pageid`, `ns`, `title`, `lastrevid`, and `modified`.
To remove selected field see jq function [`del`](https://stedolan.github.io/jq/manual/#del(path_expression)).
### Stream an array of entities
Module function `ndjson` can be used to process a stream with an array of
entities into a list of entities:
~~~sh
bzcat latest-all.json.bz2 | jq -n --stream 'import "wikidata"; ndjson'
~~~
Alternative, possibly more performant methods to process array of entities [are described here](https://lucaswerkmeister.de/posts/2017/09/03/wikidata+dgsh/):
~~~sh
bzcat latest-all.json.bz2 | head -n-1 | tail -n+2 | sed 's/,$//'
~~~
## Contributing
The source code is hosted at .
Bug reports and feature requests [are welcome](https://github.com/nichtich/jq-wikidata/issues/new)!
## License
Made available under the MIT License by Jakob Voß.