Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/occrp-attic/loom
Weaving SQL databases into graph data.
https://github.com/occrp-attic/loom
Last synced: 11 days ago
JSON representation
Weaving SQL databases into graph data.
- Host: GitHub
- URL: https://github.com/occrp-attic/loom
- Owner: occrp-attic
- License: agpl-3.0
- Created: 2015-09-11T09:22:55.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2015-12-09T15:25:12.000Z (almost 9 years ago)
- Last Synced: 2024-08-02T14:12:17.989Z (4 months ago)
- Language: Python
- Homepage:
- Size: 173 KB
- Stars: 9
- Watchers: 16
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Loom
``loom`` is a command-line tool that maps data from its source table structure
to a common data model, defined through [JSON Schema](http://json-schema.org/).
Once data has been modeled into such objects, it is stored as a set of
statements in a SQL database and eventually indexed in ElasticSearch.``loom`` is the backend component for [spindle](https://github.com/occrp/spindle),
a web-based graph data browser/editor tool that uses the same data model.## Design
The design goal of ``loom`` is to accept data from many different sources and
in many different formats, and to integrate it into a single, coherent data
model that can be used for analysis.### Data integration
Imagine, for example, trying to investigate a senior politician. You might find
data about that person on WikiData, by scraping a parliament's web site,
extracting data from expense reports and by looking up company ownership
records in a corporate registry.The purpose of ``loom`` is to read data from all of these sources, and to
translate each source record into a (partially-filled) JSON object representing
a person. This object may have nested items, such as information about party
posts, company directorships etc.``loom`` will also provide means for de-duplicating entities, so that all the
different records about the politician coming from various sources will be
merged into a single, coherent and complete profile. This is made possible by
splitting the information into atomic statements (often called triples or quads,
[learn more](http://www.w3.org/TR/rdf11-concepts/#section-triples)).### Indexing
Using statements, the information can also be re-constructed in different ways
than it was originally submitted. For example, after importing a list of people
with nested information about what companies they control, you could very
easily invert that into a list of companies with nested information about the
people who control them.This aspect is used when indexing the data: ``loom`` will go through all entity
types and try and generate a nested representation of each entity it finds.
This means that if you import a record about a person that controls a company,
you will end up with two indexed objects: the person (with a reference to the
company) and the company (with a reference to the person).### Design constraints
The statement tables generated by ``loom`` are very similar to an RDF triple
store. It may be tempting to think about using them to perform recursive graph
queries (*Show me all the politicians from Germany that have companies which
have subsidiaries in the Cayman Islands*).This, however, would quickly lead to an explosion in the join cross-product and
kill any SQL database. It is not a design goal of ``loom`` to support this. We
should, however, provide an export facility to generate RDF data which can be
imported into a triple store and queried there.## Configuration
All of ``loom`` is controlled via two types of configuration files: a core
configuration (which sets the destination database, available data types, and
the search index), and a set of source specifications (``spec.yaml``, which
define the source database for data mapping, how to query the source data,
source details and, most importantly, a mapping between source data
structure and the JSON schema-defined model).An example core configuration (passed in via the ``--config`` argument) would
look like this:```yaml
# A PostgreSQL databases should be provided. It can be empty (statement
# tables will be created automatically). Since loom uses maintenance functions
# (COPY) to load data, you may need to provide superuser access to the target
# database.
database: postgresql://localhost/statements# ElasticSearch indexing destination. The index does not need to exist prior to
# running loom.
elastic_host: localhost:9200
elastic_index: graph# This is the schema registry, which will be used to determine short-hand
# aliases for specific types. All schemas listed here will be indexed to
# ElasticSearch.
schemas:
company: https://schema.occrp.org/generic/company.json#
```After setting these base parameters, a data source can be specified like this:
```yaml
# The ODS database is expected to contain the source tables referenced below
# (fo_companies_director, fo_companies_company).
database: postgresql://localhost/source_database# Source metadata, limited to three basic fields for the moment.
source:
slug: foo_companies
title: "Foo Country Company Registry"
url: http://registry.gov.foo/companies# Source data tables, and how to join them. Tables can also be assigned an
# alias to allow for recursive joins (i.e. for parent-child relationships).
tables:
- fo_companies_company
- fo_companies_director
joins:
- foo_companies_company.id: fo_companies_director.company_id# Outputs define the set of queries to be run, alongside with a mapping that
# is used to translate the resulting records into JSON schema objects:
outputs:
# Every output has a name, which is used only for internal purposes:
demo:
schema:
$ref: https://schema.occrp.org/generic/company.json#
mapping:
# Mapping for a single field in the destination schema:
name:
column: foo_companies_company.name
# Some transformations can be applied:
transforms:
- clean
- latinize
company_id:
columns: foo_companies_company.id
format: 'urn:foo:%s'
directors:
# Nested object, which uses the same mapping method:
mapping:
name:
column: fo_companies_director.name
```The content of the ``mapping`` section of this specification file is based on
a separate Python library called [jsonmapping](https://github.com/pudo/jsonmapping).
Refer to that package's documentation for information about how to perform more
complex mappings and transformations on the source data.## Installation
Before installing ``loom``, make sure you have both dependencies - PostgreSQL
and ElasticSearch - installed.If you wish to use the ``loom`` command line tool, you can install the
application into a Python [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)
like this:```bash
$ pip install git+https://github.com/occrp/loom
```To do development on the tool, you should instead check out a local copy:
```bash
$ git clone [email protected]:occrp/loom.git
$ cd loom
$ make install
$ make test
```Instead of executing the ``Makefile`` commands (which create a virtual
environment and install the necessary dependencies), you can also run these
steps manually.## Usage
A prerequisite for using ``loom`` is that source data needs to be stored in a
PostgreSQL database. The specific layout of the source data tables is not
important, since the JSON object mapping will be applied.After installing ``loom``, a command-line tool is available. You can use the
``--help`` argument to learn more about it's functioning.A basic sequence of commands might look like this:
```bash
# translate the source data records into the loom data store as statements:
$ loom -c config.yaml map spec.yaml
# index the statements into ElasticSearch:
$ loom -c config.yaml index --source foo_companies
# delete statements from the data store:
$ loom -c config.yaml flush --source foo_companies
```## Similar work and references
``loom`` is heavily inspired by Linked Data and RDF. If you're interested in
similar tools in that ecosystem, check out:* [Grafter](http://grafter.org/)
* [Silk Framework](http://silk-framework.com/)## License
``loom`` is free software; it is distributed under the terms of the Affero
General Public License, version 3.