Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nichtich/wikidata-taxonomy
command-line tool to extract taxonomies from Wikidata
https://github.com/nichtich/wikidata-taxonomy
cli coli-conc jskos library wikidata
Last synced: 3 months ago
JSON representation
command-line tool to extract taxonomies from Wikidata
- Host: GitHub
- URL: https://github.com/nichtich/wikidata-taxonomy
- Owner: nichtich
- License: mit
- Created: 2016-07-29T18:48:09.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2019-06-19T18:44:56.000Z (over 5 years ago)
- Last Synced: 2024-10-15T13:23:10.990Z (3 months ago)
- Topics: cli, coli-conc, jskos, library, wikidata
- Language: JavaScript
- Homepage: https://www.npmjs.org/package/wikidata-taxonomy
- Size: 292 KB
- Stars: 125
- Watchers: 9
- Forks: 11
- Open Issues: 24
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- License: LICENSE
Awesome Lists containing this project
- awesome-starred - nichtich/wikidata-taxonomy - command-line tool to extract taxonomies from Wikidata (cli)
README
# Wikidata-Taxonomy
Command-line tool and library to extract taxonomies from [Wikidata](https://wikidata.org).
![](https://github.com/nichtich/wikidata-taxonomy/raw/master/img/wdtaxonomy-example.png)
## Installation
wikidata-taxonomy requires at least [NodeJs](https://nodejs.org) version 6.
Install globally to make command `wdtaxonomy` accessible from your shell `$PATH`:
```sh
$ npm install -g wikidata-taxonomy
```Installation and usage [as module](#usage-as-module) and [in web applications](#usage-in-web-applications) is described below.
## Usage
This module provides the command `wdtaxonomy`. By default, a usage help is printed:
```
$ wdtaxonomyUsage: wdtaxonomy [options]
extract taxonomies from Wikidata
Options:
-V, --version output the version number
-b, --brief omit counting instances and sites
-c, --children get direct subclasses only
-C, --color enforce color output
-d, --descr include item descriptions
-e, --sparql-endpoint customize the SPARQL endpoint
-f, --format output format
-i, --instances include instances
-I, --no-instancecount omit counting instances
-j, --json use JSON output format
-l, --lang specify the language to use
-L, --no-labels omit all labels
-m, --mappings mapping properties (e.g. P1709)
-n, --no-colors disable color output
-o, --output write result to a file
-P, --property hierarchy property (e.g. P279)
-R, --prune prune hierarchies (e.g. mappings)
-p, --post use HTTP POST to disable caching
-r, --reverse get superclasses instead
-s, --sparql print SPARQL query and exit
-S, --no-sitecount omit counting sites
-t, --total count total number of instances
-u, --user user to the SPARQL endpoint
-U, --uris show full URIs in output formats
-v, --verbose make the output more verbose
-w, --password password to the SPARQL endpoint
-h, --help output usage information
```The first arguments needs to be a Wikidata identifier to be used as root of the taxonomy. For instance extract a taxonomy of planets ([Q634]):
```sh
$ wdtaxonomy Q634
```To look up by label, use [wikidata-cli] (e.g `wd id planet` or `wd f planet`).
The extracted taxonomy by default is based on statements using the property "subclass of" ([P279]) or "subproperty of" ([P1647]). Taxonomy extraction and output can be controlled by several [options](#options). Option `--sparql` (or `-s`) prints the underlying SPARQL queries instead of executing them.
### Examples
**Direct subclasses of planet ([Q634]) with description and mappings:**
$ wdtaxonomy Q634 -c -d -m =
The hierarchy properties [P279] ("subclass of") and [P31] ("instance of") to build taxonomies from can be changed with option `property` (`-P`).
**Members of ([P463]) the European Union ([Q458]):**
$ wdtaxonomy Q458 -P P463
**Members of ([P463]) the European Union ([Q458]) and number of its citizens in Wikidata ([P27]):**
$ wdtaxonomy Q458 -P 463/27
**Wikiversity ([Q370]) editions mapped to their homepage URL ([P856]):**
$ wdtaxonomy Q370 -i -m P856
**Biological taxonomy of mammals ([Q7377]):**
$ wdtaxonomy Q7377 -P P171 --brief
**Property constraints ([Q21502402]) with number of properties that have each constraint:**
$ wdtaxonomy Q21502402 -P 279,2302
As Wikidata is no strict ontology, subproperties are not factored in. For instance this query does not include members of the European Union although [P463] is a subproperty of [P361].
**Parts of ([P361]) the European Union ([Q458]):**
$ wdtaxonomy Q458 -P P361
A taxonomy of subproperties can be queried like taxonomies of items. The hierarchy property is set to [P1647] ("subproperty of") by default:
$ wdtaxonomy P361
$ wdtaxonomy P361 -P P1647 # equivalent**Subproperties of "part of" ([P361]) and which of them have an inverse property ([P1696]):**
$ wdtaxonomy P361 -P P1647/P1696
Inverse properties are neither factored in so queries like these do not necessarily return the same results:
**What hand ([Q33767]) is part of ([P361]):**
$ wdtaxonomy Q33767 -P 361 -r
**What parts the hand ([Q33767]) has ([P527]):**
$ wdtaxonomy Q33767 -P 527
## Options
### Query options
#### brief (`-b`)
Don't count instance and sites. Same as `-S/--no-sitecount` and `-I/--no-instancecount`.
#### children (`-c`)
Get direct subclasses only
#### descr (`-d`)
Include item descriptions
#### sparql-endpoint (`-e`)
SPARQL endpoint to query (default: )
#### instances (`-i`)
Include instances
#### no-instancecount (`-I`)
Don't count number of instances
#### lang (`-l`)
Language to get labels in (default: `en`)
#### no-labels (`-L`)
Omit all labels. This allows for querying larger taxonomies (several thousands of classes), especially if combined with option `--brief`.
#### mappings (`-m`)
Lookup mappings based on given comma-separated properties such as [P1709] (equivalent class). The following keywords can be used as shortcuts:
* `equal` or `=`: equivalent property ([P1628]), equivalent class ([P1709]), and exact match ([P2888])
* `broader`: external superproperty ([P2235])
* `narrower`: narrower external class ([P3950]), external subproperty ([P2236])
* `class`: properties for mapping classes
* `property`: properties for mapping properties
* `all` all properties for ontology mapping (instances of [Q30249126])#### reverse (`-r`)
Get superclasses instead of subclasses up to the root
#### no-sitecount (`-I`)
Don't count number of sites
#### total (`-t`)
Count total (transitive) number of instances, including instances of subclasses
#### post (`-p`)
Use HTTP POST to disable caching
#### sparql (`-s`)
Don't actually perform a query but print SPARQL query and exit
#### user (`-u`)
User to the SPARQL endpoint
#### password (`-w`)
Password to the SPARQL endpoint
### Output options
#### color (`-C`)
enable color output if it's disabled (e.g. when output is piped or written to a file)
#### format (`-f`)
Output format
#### json (`-j`)
Use JSON output format. Same as `--format json` but shorter.
#### no-colors (`-n`)
disable color output
#### output (`-o`)
write result to a file given by name
#### prune (`-R`)
prune hierarchy to all entries with any of a given criteria plus their broader
concepts and all top concepts:* `mappings`: has mappings
* `sites`: has sites
* `instances`: has instances
* `occurences` : has sites or instancesMultiple criteria can be combined as alternatives with comma.
#### uris (`-U`)
Show full URIs in output formats, e.g. instead of `Q1`
#### verbose (`-v`)
Show verbose error messages
## Output formats
### Text format
By default, the taxonomy is printed in "`text`" format with colored Unicode characters:
```sh
$ wdtaxonomy Q17362350
```
```
planet of the Solar System (Q17362350) •2 ↑
├──outer planet (Q30014) •25 ×4 ↑↑
└──inner planets (Q3504248) •8 ×4 ↑↑```
The output contains item labels, Wikidata identifiers, the number of Wikimedia
sites connected to each item (indicated by bullet character "`•`"), the number
of instances (property [P31]), indicated by a multiplication sign "`×`"), and
an upwards arrow ("`↑`") as indicator for additional superclasses.Option "`--instances`" (or "`-i`") explicitly includes instances:
```sh
$ wdtaxonomy -i Q17362350
```
```
planet of the Solar System (Q17362350) •2 ↑
├──outer planet (Q30014) •25 ↑↑
| -Saturn (Q193)
| -Jupiter (Q319)
| -Uranus (Q324)
| -Neptune (Q332)
└──inner planets (Q3504248) •8 ↑↑
-Earth (Q2)
-Mars (Q111)
-Mercury (Q308)
-Venus (Q313)
```Classes that occur at multiple places in the taxonomy (multihierarchy) are marked like in the following example:
```sh
$ wdtaxonomy Q634
```
```
planet (Q634) •202 ×7 ↑
├──extrasolar planet (Q44559) •88 ×833 ↑
| ├──circumbinary planet (Q205901) •15 ×10
| ├──super-Earth (Q327757) •32 ×46 ↑
...
├──terrestrial planet (Q128207) •70 ×7
| ╞══super-Earth (Q327757) •32 ×46 ↑ …
...
```### JSON format
Option `--format json` serializes the taxonomy as JSON object. The format follows specification of [JSKOS Concept Schemes](https://gbv.github.io/jskos/jskos.html#concept-schemes):
~~~json
{
"type": [ "http://www.w3.org/2004/02/skos/core#ConceptScheme" ],
"modified": "2017-11-06T10:25:54.966Z",
"license": [
{
"uri": "http://creativecommons.org/publicdomain/zero/1.0/",
"notation": [ "CC0" ]
}
],
"languages": [ "en" ],
"topConcepts": [
{ "uri": "http://www.wikidata.org/entity/Q17362350" }
],
"concepts": [ ]
}
~~~Field `concepts` contains an array of all extracted Wikidata entities (usually classes and instances) as [JSKOS Concepts](https://gbv.github.io/jskos/jskos.html#concepts):
~~~json
{
"uri": "http://www.wikidata.org/entity/Q17362350",
"notation": [ "Q17362350" ],
"prefLabel": {
"en": "planet of the Solar System"
},
"scopeNote": {
"en": [ "inner and outer planets of our solar system" ]
},
"broader": [
{ "uri": "http://www.wikidata.org/entity/Q634" }
],
"narrower": [
{ "uri": "http://www.wikidata.org/entity/Q30014" },
{ "uri": "http://www.wikidata.org/entity/Q3504248" }
]
}
~~~Instances (option `--instances`) are linked via field `subjectOf` the same way as field `broader` and `narrower`.
The number of instances and sites, if counted is given as array of [JSKOS Concept Occurrences](https://gbv.github.io/jskos/jskos.html#concept-occurrences) in field `occurrences`, each identified by subfield `relation`:
~~~json
{
"uri": "http://www.wikidata.org/entity/Q30014",
"notation": [ "Q30014" ],
"prefLabel": {
"en": "outer planet of the Solar system"
},
"occurrences": [
{
"relation": "http://www.wikidata.org/entity/P31",
"count": 4
},
{
"relation": "http://schema.org/about",
"count": 25
}
]
}
~~~Mappings (option `--mappings`) are stored in field `mappings` as array of [JSKOS Concept Mappings](https://gbv.github.io/jskos/jskos.html#concept-mappings):
~~~json
[
{
"from": {
"memberSet": [
{ "uri": "http://www.wikidata.org/entity/Q634" }
]
},
"to": {
"memberSet": [
{ "uri": "http://dbpedia.org/ontology/Planet" }
]
},
"type": [
"http://www.w3.org/2004/02/skos/core#exactMatch",
"http://www.w3.org/2002/07/owl#equivalentClass",
"http://www.wikidata.org/entity/P1709"
]
}
]
~~~The mapping type is given in field `type` with the Wikidata property URI as last array element and the SKOS mapping relation URI as first.
## NDJSON format
Option `--format ndjson` serializes JSON field `concepts` with one record per line. The order if records is same as in txt, json, and csv format but each concept is only included once.
### CSV and TSV format
CSV and TSV format are optimized for comparing differences in
time. Each output row consists of five fields:* **level** in the hierarchy indicated by zero or more "`-`" (default) or "`=`"
characters (multihierarchy).* **id** of the item. Items on the same level are sorted by their id.
* **label** of the item. Language can be selected with option `--language`.
The label in csv format is quoted.* **sites**: number of connected sites (Wikipedia and related project editions).
Larger numbers may indicate more established concepts.* **parents** outside of the hierarchy, indicated by zero or more "`^`" characters.
For instance the CSV output for [Q634] would be like this:
```sh
$ wdtaxonomy -f csv Q634
```
```csv
level,id,label,sites,instances,parents
,Q634,"planet",196,7,^
-,Q44559,"extrasolar planet",81,833,^
--,Q205901,"circumbinary planet",14,10,
--,Q327757,"super-Earth",32,46,
...
-,Q128207,"terrestrial planet",67,7,
==,Q327757,"super-Earth",32,46,
...```
In this example there are 196 Wikipedia editions or other sites with an article
about planets and seven Wikidata items are direct instance of a planet. At the
end of the line "`^`" indicates that "planet" has one superclass. In the next
rows "extrasolar planet" ([Q44559]) is a subclass of planet with another
superclass indicated by "`^`". Both "circumbinary planet" and "super-Earth" are
subclasses of "extrasolar planet". The latter also occurs as subclass of
"terrestrial planet" where it is marked by "`==`" instead of "`--`".## Usage as module
Add `wikidata-taxonomy` as dependency to you `package.json`:
```sh
$ npm install wikidata-taxonomy --save
```The library provides:
* `queryTaxonomy(id, options)` returns a promise with a taxonomy extracted from Wikidata
as [JSKOS Concept Scheme](https://gbv.github.io/jskos/jskos.html#concept-schemes). See
[JSON format](#json-format) of the command line client for documentation.~~~javascript
const { queryTaxonomy } = require('wikidata-taxonomy')var options = { lang: 'fr', brief: true }
queryTaxonomy('Q634', lang)
.then(taxonomy => {
taxonomy.concepts.forEach(concept => {
var qid = concept.notation[0]
var label = (concept.prefLabel || {}).fr || '???'
console.log('%s %s', qid, label)
})
})
.catch(error => console.error("E",error))
~~~Options roughly equivalent command line [query options](#query-options):
* boolean flags `brief`, `children`, `description`, `labels`, `total`,
`instances`, `instancecount`, `sitecount`, `reverse`, `post`
* SPARQL endpoint configuration with `endpoint`, `user`, `password`
* language tag `language` or `lang`
* array `property` (set to `['P279', 'P31']` by default)
* array or string `mappings`* `serializeTaxonomy` contains serializers to be called with a taxonomy, an output stream,
and optional configuration:~~~javascript
const { serializeTaxonomy } = require('wikidata-taxonomy')// serialize taxonomy to stream
serializeTaxonomy.csv(taxonomy, process.stdout)
serializeTaxonomy.txt(taxonomy, process.stdout, {colors: true}) // FIXME
serializeTaxonomy.json(taxonomy, process.stdout)
serializeTaxonomy.ndjson(taxonomy, process.stdout)
~~~## Usage in web applications
Experimental support of this library in web application is given with file `wikidata-taxonomy.js` in directoy `dist`. The `gh-pages` branch contains a sample application, also available at .
Requires wikidata-sdk and a HTTP client library. The latter can be attached to `window.requestPromise` (before wikidata-taxonomy is loaded). Axios is detected by default.
~~~html
...
~~~
## Release notes
Release notes are listed in file [CHANGES.md](https://github.com/nichtich/wikidata-taxonomy/blob/master/CHANGES.md#changelog) in the source code repository.
## See Also
[![build status](https://api.travis-ci.org/nichtich/wikidata-taxonomy.svg?branch=master)](http://travis-ci.org/nichtich/wikidata-taxonomy)
[![npm version](http://img.shields.io/npm/v/wikidata-taxonomy.svg?style=flat)](https://www.npmjs.org/package/wikidata-taxonomy)
[![Documentation Status](https://readthedocs.org/projects/wdtaxonomy/badge/?version=latest)](http://wdtaxonomy.readthedocs.io/en/latest/?badge=latest)**This document**
* [at GitHub](https://github.com/nichtich/wikidata-taxonomy/blob/master/README.md)
* [at npmjs](https://www.npmjs.com/package/wikidata-taxonomy)
* [at Read the Docs](https://wdtaxonomy.readthedocs.io/en/latest/)**Related tools**
* [wikidata-sdk](https://npmjs.com/package/wikidata-sdk) is used by this module
* [wikidata-cli] provide more generic command line tools for Wikidata
* [taxonomy browser](http://sergestratan.bitbucket.org/) is a web application
based on Wikidata dumps
* [Wikidata Graph Builder](https://angryloki.github.io/wikidata-graph-builder/) another web application
* [wdmapper](https://wdmapper.readthedocs.io/) Wikidata authority file mapping tool
* [Wikidata ontology explorer](https://lucaswerkmeister.github.io/wikidata-ontology-explorer/)[wikidata-cli]: https://npmjs.com/package/wikidata-cli
**Publications**
* Jakob Voß (2016): *Classification of Knowledge Organization Systems with Wikidata*. CEUR Workshop Proceedings 1676. (=[Q27261879])
[P1628]: http://www.wikidata.org/entity/P1628
[P1647]: http://www.wikidata.org/entity/P1647
[P1696]: http://www.wikidata.org/entity/P1696
[P1709]: http://www.wikidata.org/entity/P1709
[P171]: http://www.wikidata.org/entity/P171
[P2235]: http://www.wikidata.org/entity/P2235
[P2236]: http://www.wikidata.org/entity/P2236
[P27]: http://www.wikidata.org/entity/P27
[P279]: http://www.wikidata.org/entity/P279
[P2888]: http://www.wikidata.org/entity/P2888
[P31]: http://www.wikidata.org/entity/P31
[P361]: http://www.wikidata.org/entity/P361
[P3950]: http://www.wikidata.org/entity/P3950
[P463]: http://www.wikidata.org/entity/P463
[P527]: http://www.wikidata.org/entity/P527
[P856]: http://www.wikidata.org/entity/P856
[Q111]: http://www.wikidata.org/entity/Q111
[Q128207]: http://www.wikidata.org/entity/Q128207
[Q17362350]: http://www.wikidata.org/entity/Q17362350
[Q193]: http://www.wikidata.org/entity/Q193
[Q2]: http://www.wikidata.org/entity/Q2
[Q205901]: http://www.wikidata.org/entity/Q205901
[Q21502402]: http://www.wikidata.org/entity/Q21502402
[Q27261879]: http://www.wikidata.org/entity/Q27261879
[Q30014]: http://www.wikidata.org/entity/Q30014
[Q30249126]: http://www.wikidata.org/entity/Q30249126
[Q308]: http://www.wikidata.org/entity/Q308
[Q313]: http://www.wikidata.org/entity/Q313
[Q319]: http://www.wikidata.org/entity/Q319
[Q324]: http://www.wikidata.org/entity/Q324
[Q327757]: http://www.wikidata.org/entity/Q327757
[Q332]: http://www.wikidata.org/entity/Q332
[Q33767]: http://www.wikidata.org/entity/Q33767
[Q3504248]: http://www.wikidata.org/entity/Q3504248
[Q370]: http://www.wikidata.org/entity/Q370
[Q44559]: http://www.wikidata.org/entity/Q44559
[Q458]: http://www.wikidata.org/entity/Q458
[Q634]: http://www.wikidata.org/entity/Q634
[Q7377]: http://www.wikidata.org/entity/Q7377