Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rdflib/rdflib-hdt
A Store back-end for rdflib to allow for reading and querying HDT documents
https://github.com/rdflib/rdflib-hdt
hdt python rdf rdflib sparql store
Last synced: 9 days ago
JSON representation
A Store back-end for rdflib to allow for reading and querying HDT documents
- Host: GitHub
- URL: https://github.com/rdflib/rdflib-hdt
- Owner: RDFLib
- License: mit
- Created: 2020-03-18T10:44:01.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2024-11-26T02:46:08.000Z (2 months ago)
- Last Synced: 2025-01-16T21:10:35.709Z (16 days ago)
- Topics: hdt, python, rdf, rdflib, sparql, store
- Language: C++
- Homepage: https://rdflib.dev/rdflib-hdt
- Size: 6.18 MB
- Stars: 26
- Watchers: 10
- Forks: 9
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
![](docs/source/_static/rdflib-hdt-250.png)
# rdflib-hdt
![Python tests](https://github.com/RDFLib/rdflib-hdt/workflows/Python%20tests/badge.svg) [![PyPI version](https://badge.fury.io/py/rdflib-hdt.svg)](https://badge.fury.io/py/rdflib-hdt)
A Store back-end for [rdflib](https://github.com/RDFLib) to allow for reading and querying HDT documents.
[Online Documentation](https://rdflib.dev/rdflib-hdt/)
# Requirements
* Python *version 3.6.4 or higher*
* [pip](https://pip.pypa.io/en/stable/)
* **gcc/clang** with **c++11 support**
* **Python Development headers**
> You should have the `Python.h` header available on your system.
> For example, for Python 3.6, install the `python3.6-dev` package on Debian/Ubuntu systems.# Installation
Installation using [pipenv](https://github.com/pypa/pipenv) or a [virtualenv](https://virtualenv.pypa.io/en/stable/) is **strongly advised!**
## PyPi installation (recommended)
```bash
# you can install using pip
pip install rdflib-hdt# or you can use pipenv
pipenv install rdflib-hdt
```## Manual installation
**Requirement:** [pipenv](https://github.com/pypa/pipenv)
```
git clone https://github.com/Callidon/pyHDT
cd pyHDT/
./install.sh
```# Getting started
You can use the `rdflib-hdt` library in two modes: as an rdflib Graph or as a raw HDT document.
## Graph usage (recommended)
```python
from rdflib import Graph
from rdflib_hdt import HDTStore
from rdflib.namespace import FOAF# Load an HDT file. Missing indexes are generated automatically
# You can provide the index file by putting it in the same directory as the HDT file.
store = HDTStore("test.hdt")# Display some metadata about the HDT document itself
print(f"Number of RDF triples: {len(store)}")
print(f"Number of subjects: {store.nb_subjects}")
print(f"Number of predicates: {store.nb_predicates}")
print(f"Number of objects: {store.nb_objects}")
print(f"Number of shared subject-object: {store.nb_shared}")# Create an RDFlib Graph with the HDT document as a backend
graph = Graph(store=store)# Fetch all triples that matches { ?s foaf:name ?o }
# Use None to indicates variables
for s, p, o in graph.triples((None, FOAF("name"), None)):
print(triple)
```Using the RDFlib API, you can also [execute SPARQL queries](https://rdflib.readthedocs.io/en/stable/intro_to_sparql.html) over an HDT document.
If you do so, we recommend that you first call the `optimize_sparql` function, which optimize
the RDFlib SPARQL query engine in the context of HDT documents.```python
from rdflib import Graph
from rdflib_hdt import HDTStore, optimize_sparql# Calling this function optimizes the RDFlib SPARQL engine for HDT documents
optimize_sparql()graph = Graph(store=HDTStore("test.hdt"))
# You can execute SPARQL queries using the regular RDFlib API
qres = graph.query("""
PREFIX foaf:
SELECT ?name ?friend WHERE {
?a foaf:knows ?b.
?a foaf:name ?name.
?b foaf:name ?friend.
}""")for row in qres:
print(f"{row.name} knows {row.friend}")
```## HDT Document usage
```python
from rdflib_hdt import HDTDocument
from rdflib.namespace import FOAF# Load an HDT file. Missing indexes are generated automatically.
# You can provide the index file by putting it in the same directory as the HDT file.
document = HDTDocument("test.hdt")# Display some metadata about the HDT document itself
print(f"Number of RDF triples: {document.total_triples}")
print(f"Number of subjects: {document.nb_subjects}")
print(f"Number of predicates: {document.nb_predicates}")
print(f"Number of objects: {document.nb_objects}")
print(f"Number of shared subject-object: {document.nb_shared}")# Fetch all triples that matches { ?s foaf:name ?o }
# Use None to indicates variables
triples, cardinality = document.search((None, FOAF("name"), None))print(f"Cardinality of (?s foaf:name ?o): {cardinality}")
for s, p, o in triples:
print(triple)# The search also support limit and offset
triples, cardinality = document.search((None, FOAF("name"), None), limit=10, offset=100)
# etc ...
```An HDT document also provides support for evaluating joins over a set of triples patterns.
```python
from rdflib_hdt import HDTDocument
from rdflib import Variable
from rdflib.namespace import FOAF, RDFdocument = HDTDocument("test.hdt")
# find the names of two entities that know each other
tp_a = (Variable("a"), FOAF("knows"), Variable("b"))
tp_b = (Variable("a"), FOAF("name"), Variable("name"))
tp_c = (Variable("b"), FOAF("name"), Variable("friend"))
query = set([tp_a, tp_b, tp_c])iterator = document.search_join(query)
print(f"Estimated join cardinality: {len(iterator)}")# Join results are produced as ResultRow, like in the RDFlib SPARQL API
for row in iterator:
print(f"{row.name} knows {row.friend}")
```# Handling non UTF-8 strings in python
If the HDT document has been encoded with a non UTF-8 encoding the previous code won't work correctly and will result in a `UnicodeDecodeError`.
More details on how to convert string to str from C++ to Python [here](https://pybind11.readthedocs.io/en/stable/advanced/cast/strings.html)To handle this, we doubled the API of the HDT document by adding:
- `search_triples_bytes(...)` return an iterator of triples as `(py::bytes, py::bytes, py::bytes)`
- `search_join_bytes(...)` return an iterator of sets of solutions mapping as `py::set(py::bytes, py::bytes)`
- `convert_tripleid_bytes(...)` return a triple as: `(py::bytes, py::bytes, py::bytes)`
- `convert_id_bytes(...)` return a `py::bytes`**Parameters and documentation are the same as the standard version**
```python
from rdflib_hdt import HDTDocumentdocument = HDTDocument("test.hdt")
it = document.search_triple_bytes("", "", "")for s, p, o in it:
print(s, p, o) # print b'...', b'...', b'...'
# now decode it, or handle any error
try:
s, p, o = s.decode('UTF-8'), p.decode('UTF-8'), o.decode('UTF-8')
except UnicodeDecodeError as err:
# try another other codecs, ignore error, etc
pass
```