https://github.com/rdflib/rdflib-hdt

A Store back-end for rdflib to allow for reading and querying HDT documents
https://github.com/rdflib/rdflib-hdt

hdt python rdf rdflib sparql store

Last synced: 3 months ago
JSON representation

A Store back-end for rdflib to allow for reading and querying HDT documents

Host: GitHub
URL: https://github.com/rdflib/rdflib-hdt
Owner: RDFLib
License: mit
Created: 2020-03-18T10:44:01.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2024-11-26T02:46:08.000Z (7 months ago)
Last Synced: 2025-03-24T06:50:43.745Z (3 months ago)
Topics: hdt, python, rdf, rdflib, sparql, store
Language: C++
Homepage: https://rdflib.dev/rdflib-hdt
Size: 6.18 MB
Stars: 27
Watchers: 9
Forks: 9
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        ![](docs/source/_static/rdflib-hdt-250.png)

# rdflib-hdt

![Python tests](https://github.com/RDFLib/rdflib-hdt/workflows/Python%20tests/badge.svg) [![PyPI version](https://badge.fury.io/py/rdflib-hdt.svg)](https://badge.fury.io/py/rdflib-hdt)

A Store back-end for [rdflib](https://github.com/RDFLib) to allow for reading and querying HDT documents.

[Online Documentation](https://rdflib.dev/rdflib-hdt/)

# Requirements

* Python *version 3.6.4 or higher*

* [pip](https://pip.pypa.io/en/stable/)

* **gcc/clang** with **c++11 support**

* **Python Development headers**

> You should have the `Python.h` header available on your system.   

> For example, for Python 3.6, install the `python3.6-dev` package on Debian/Ubuntu systems.

# Installation

Installation using [pipenv](https://github.com/pypa/pipenv) or a [virtualenv](https://virtualenv.pypa.io/en/stable/) is **strongly advised!**

## PyPi installation (recommended)

```bash

# you can install using pip

pip install rdflib-hdt

# or you can use pipenv

pipenv install rdflib-hdt

```

## Manual installation

**Requirement:** [pipenv](https://github.com/pypa/pipenv) 

```

git clone https://github.com/Callidon/pyHDT

cd pyHDT/

./install.sh

```

# Getting started

You can use the `rdflib-hdt` library in two modes: as an rdflib Graph or as a raw HDT document.

## Graph usage (recommended)

```python

from rdflib import Graph

from rdflib_hdt import HDTStore

from rdflib.namespace import FOAF

# Load an HDT file. Missing indexes are generated automatically

# You can provide the index file by putting it in the same directory as the HDT file.

store = HDTStore("test.hdt")

# Display some metadata about the HDT document itself

print(f"Number of RDF triples: {len(store)}")

print(f"Number of subjects: {store.nb_subjects}")

print(f"Number of predicates: {store.nb_predicates}")

print(f"Number of objects: {store.nb_objects}")

print(f"Number of shared subject-object: {store.nb_shared}")

# Create an RDFlib Graph with the HDT document as a backend

graph = Graph(store=store)

# Fetch all triples that matches { ?s foaf:name ?o }

# Use None to indicates variables

for s, p, o in graph.triples((None, FOAF("name"), None)):

  print(triple)

```

Using the RDFlib API, you can also [execute SPARQL queries](https://rdflib.readthedocs.io/en/stable/intro_to_sparql.html) over an HDT document.

If you do so, we recommend that you first call the `optimize_sparql` function, which optimize

the RDFlib SPARQL query engine in the context of HDT documents.

```python

from rdflib import Graph

from rdflib_hdt import HDTStore, optimize_sparql

# Calling this function optimizes the RDFlib SPARQL engine for HDT documents

optimize_sparql()

graph = Graph(store=HDTStore("test.hdt"))

# You can execute SPARQL queries using the regular RDFlib API

qres = graph.query("""

  PREFIX foaf: 

  SELECT ?name ?friend WHERE {

    ?a foaf:knows ?b.

    ?a foaf:name ?name.

    ?b foaf:name ?friend.

  }""")

for row in qres:

  print(f"{row.name} knows {row.friend}")

```

## HDT Document usage

```python

from rdflib_hdt import HDTDocument

from rdflib.namespace import FOAF

# Load an HDT file. Missing indexes are generated automatically.

# You can provide the index file by putting it in the same directory as the HDT file.

document = HDTDocument("test.hdt")

# Display some metadata about the HDT document itself

print(f"Number of RDF triples: {document.total_triples}")

print(f"Number of subjects: {document.nb_subjects}")

print(f"Number of predicates: {document.nb_predicates}")

print(f"Number of objects: {document.nb_objects}")

print(f"Number of shared subject-object: {document.nb_shared}")

# Fetch all triples that matches { ?s foaf:name ?o }

# Use None to indicates variables

triples, cardinality = document.search((None, FOAF("name"), None))

print(f"Cardinality of (?s foaf:name ?o): {cardinality}")

for s, p, o in triples:

  print(triple)

# The search also support limit and offset

triples, cardinality = document.search((None, FOAF("name"), None), limit=10, offset=100)

# etc ...

```

An HDT document also provides support for evaluating joins over a set of triples patterns.

```python

from rdflib_hdt import HDTDocument

from rdflib import Variable

from rdflib.namespace import FOAF, RDF

document = HDTDocument("test.hdt")

# find the names of two entities that know each other

tp_a = (Variable("a"), FOAF("knows"), Variable("b"))

tp_b = (Variable("a"), FOAF("name"), Variable("name"))

tp_c = (Variable("b"), FOAF("name"), Variable("friend"))

query = set([tp_a, tp_b, tp_c])

iterator = document.search_join(query)

print(f"Estimated join cardinality: {len(iterator)}")

# Join results are produced as ResultRow, like in the RDFlib SPARQL API

for row in iterator:

  print(f"{row.name} knows {row.friend}")

```

# Handling non UTF-8 strings in python

If the HDT document has been encoded with a non UTF-8 encoding the previous code won't work correctly and will result in a `UnicodeDecodeError`.

More details on how to convert string to str from C++ to Python [here](https://pybind11.readthedocs.io/en/stable/advanced/cast/strings.html)

To handle this, we doubled the API of the HDT document by adding:

- `search_triples_bytes(...)` return an iterator of triples as `(py::bytes, py::bytes, py::bytes)`

- `search_join_bytes(...)` return an iterator of sets of solutions mapping as `py::set(py::bytes, py::bytes)`

- `convert_tripleid_bytes(...)` return a triple as: `(py::bytes, py::bytes, py::bytes)`

- `convert_id_bytes(...)` return a `py::bytes`

**Parameters and documentation are the same as the standard version**

```python

from rdflib_hdt import HDTDocument

document = HDTDocument("test.hdt")

it = document.search_triple_bytes("", "", "")

for s, p, o in it:

  print(s, p, o) # print b'...', b'...', b'...'

  # now decode it, or handle any error

  try:

    s, p, o = s.decode('UTF-8'), p.decode('UTF-8'), o.decode('UTF-8')

  except UnicodeDecodeError as err:

    # try another other codecs, ignore error, etc

    pass

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rdflib/rdflib-hdt

Awesome Lists containing this project

README