https://github.com/flavienbwk/wikidata-jena-docker

Wikidata import in Apache Jena tripletstore (TDB) to be queried with SparQL.
https://github.com/flavienbwk/wikidata-jena-docker

jena ontology sparql tdb triplestore wikidata

Last synced: 7 months ago
JSON representation

Wikidata import in Apache Jena tripletstore (TDB) to be queried with SparQL.

Host: GitHub
URL: https://github.com/flavienbwk/wikidata-jena-docker
Owner: flavienbwk
Created: 2022-01-13T23:40:20.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-02-06T16:30:37.000Z (over 3 years ago)
Last Synced: 2025-01-28T16:17:19.322Z (9 months ago)
Topics: jena, ontology, sparql, tdb, triplestore, wikidata
Language: Dockerfile
Homepage:
Size: 6.84 KB
Stars: 0
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# WikiData Jena Docker

WikiData import in Apache Jena tripletstore (TDB) to be queried with SparQL.

## Terminology

- Apache Jena : Java framework for building Semantic Web and Linked Data applications
- Apache Jena Fuseki : SparQL server
- Apache Jena TDB : A RDF storage and query DBMS

## Download WikiData

From [WikiData dumps](https://dumps.wikimedia.org/wikidatawiki/entities/), download the [`latest-all.ttl.gz`](https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz) file (~104 Go)

## Start Fuseki

Run the following commands :

```bash
mkdir fuseki-data
sudo chown 9008 fuseki-data # Internal fuseki user
sudo chown 9008 -R fuseki-configuration # Internal fuseki user

docker-compose up -d
```

## Resources considerations

:warning: Importing Wikidata takes a **lot of time** to index the 16.805 billion triples on a 24-threads bi-Xeon E5 CPU and 189Gb of RAM (128Gb of RAM is sufficient, consumption is below 100Gb). We recommend using a [cloud provider](https://www.scaleway.com/en/elastic-metal/).

There are two parts :

- Data storage (took: 2d16h17m)
- Data indexing (SPO took: 32h50m, POS: 77h19m, OPS: 32h01m)

> Number of triples at the time of the writing is (a bit more than) `16 805 375 870`

## Import data

_This requires 701Go of disk space at the time of the writing._

**Place** the downloaded `latest-all.ttl.gz` file into the `wikidata/` directory

**Unzip** the downloaded file :

```bash
cd wikidata
gzip -d latest-all.ttl.gz
```

**Import** the data :

```bash
docker-compose exec fuseki bash

# Inside container
cd /jena-tools/apache-jena-4.3.2/bin
/jena-tools/apache-jena-4.3.2/bin/tdb2.xloader --loc /fuseki-base/databases/wikidata /wikidata/latest-all.ttl # this takes a LOT of time
```

## SparQL querying

Now, you can go to the dashboard page of your server at port `:3030` and test the following query :

```sparql
PREFIX wd:
PREFIX wdt:
PREFIX rdfs:

SELECT ?charLabel ?groupLabel
WHERE {
?group wdt:P31 wd:Q14514600; # is a group of fictional characters
wdt:P1080 wd:Q931597. # from fictional Marvel universe
?char wdt:P463 ?group. # Member of the group
?char rdfs:label ?charLabel. # Label of the character
?group rdfs:label ?groupLabel. # Label of the group
FILTER (LANG(?charLabel) = 'fr'). # Get labels on french language
FILTER (LANG(?groupLabel) = 'fr'). # Get labels on french language
}
LIMIT 1000
```

Connect with user `admin` and the password set in your docker-compose configuration

Or query manually inside the container :

```sparql
/jena-tools/apache-jena-4.3.2/bin/tdb2.tdbquery --loc /fuseki-base/databases/wikidata "

PREFIX wd:
PREFIX wdt:
PREFIX rdfs:

## Credits

- Using [fuseki-docker](https://github.com/SemanticComputing/fuseki-docker)
- [Importing WikiData into Jena](https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/#top)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/flavienbwk/wikidata-jena-docker

Awesome Lists containing this project

README