https://github.com/aws/amazon-neptune-csv-to-rdf-converter

Amazon Neptune CSV to RDF Converter is a tool for Amazon Neptune that converts property graphs stored as comma separated values into RDF graphs.
https://github.com/aws/amazon-neptune-csv-to-rdf-converter

amazon-neptune gremlin property-graph rdf sparql

Last synced: 5 months ago
JSON representation

Amazon Neptune CSV to RDF Converter is a tool for Amazon Neptune that converts property graphs stored as comma separated values into RDF graphs.

Host: GitHub
URL: https://github.com/aws/amazon-neptune-csv-to-rdf-converter
Owner: aws
License: apache-2.0
Created: 2019-12-09T21:09:42.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2023-04-11T18:45:55.000Z (over 2 years ago)
Last Synced: 2025-01-28T19:48:22.977Z (6 months ago)
Topics: amazon-neptune, gremlin, property-graph, rdf, sparql
Language: Java
Homepage:
Size: 1.64 MB
Stars: 32
Watchers: 24
Forks: 10
Open Issues: 3
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

# Amazon Neptune CSV to RDF Converter

A tool for [Amazon Neptune](https://aws.amazon.com/neptune/) that converts property graphs stored as comma separated values into RDF graphs.

## Usage

Amazon Neptune CSV to RDF Converter is a Java library for converting a property graph stored in
CSV files to RDF. It expects an input directory containing the CSV files, an output directory, and an
optional configuration file. The output directory will be created if it does not exist.
See [Gremlin Load Data Format](https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html)
about the input and [RDF 1.1 N-Quads](https://www.w3.org/TR/n-quads/) about the output format.

The input files need to be UTF-8 encoded. The same encoding is used for the output files.

The library is available as executable Jar file and can be run from the command line by `java -jar amazon-neptune-csv2rdf.jar -i -o `. Use `java -jar amazon-neptune-csv2rdf.jar -h` to see all options:

Usage: java -jar amazon-neptune-csv2rdf.jar [-hV] [-c=]
-i= -o=
-c, --config=
Propery file containing the configuration.
-h, --help Show this help message and exit.
-i, --input=
Directory containing the CSV files (UTF-8 encoded).
-o, --output=
Directory for writing the RDF files (UTF-8 encoded); will be
created if it does not exist.
-V, --version Print version information and exit.

The conversion is based on two steps. First, a **general mapping** from property graph vertices and edges
to RDF statements is applied to the input files. The optional second step **transforms RDF resource IRIs**
according to user defined rules for replacing artificial ids with more natural ones. However, this
transformation needs to load all triples into main memory, so the JVM memory must be set accordingly with
`-Xmx`, e.g., `java -Xmx2g`.

Let's start with a small example to see how both steps work.

**General mapping**

Let vertices and edges be given as

~id,~label,name,code,country
1,city,Seattle,S,USA
2,city,Vancouver,V,CA

and

~id,~label,~from,~to,distance,type
a,route,1,2,166,highway

Using some simplified namespaces (see Configuration below for the details), the mapping results in:

.
"Seattle" .
"S" .
"USA" .
.
"Vancouver" .
"V" .
"CA" .

.
"166" .
"highway" .

The result shows that **edge identifiers are stored as context** of the corresponding RDF statement, and
the edge properties are statements about that context. The edge identifiers can be queried in SPARQL using
the [GRAPH keyword](https://www.w3.org/TR/sparql11-query/#accessByLabel).

Vertex labels are mapped to **RDF types**. The first letter of the label will be capitalized for this step:
The label `city` becomes the RDF type ``.

Additionally, the mapping can add **RDFS labels** to the vertices. For example, the configuration

mapper.mapping.pgVertexType2PropertyForRdfsLabel.city=name

creates two additional RDF statements:

"Seattle" .
"Vancouver" .

The mapping can also map **property values to resources**. In the example, the value for *country* becomes
an URI with

mapper.mapping.pgProperty2RdfResourcePattern.country=country:{{VALUE}}

and the two statements with the literal value "USA" and "CA" are replaced by:

.
.

**URI transformations**

A URI transformation rule replaces parts of a resource URI with the value of a property. In the previous
example, the code could be used to create the resource URIs. This can be achieved using:

transformer.uriPostTransformations.1.srcPattern=vertex:([0-9]+)
transformer.uriPostTransformations.1.typeUri=type:City
transformer.uriPostTransformations.1.propertyUri=vproperty:code
transformer.uriPostTransformations.1.dstPattern=city:{{VALUE}}

The resulting statements are now:

.
"Seattle" .
"Seattle" .
"S" .
.
.
"Vancouver" .
"Vancouver" .
"V" .
.
.
"166" .
"highway" .

## Configuration

The configuration of the converter is a property file. It contains a default type, a default named graph,
and namespaces for building vertex URIs, edge URIs, type URIs, vertex property URIs, and edge property URIs.
The rules for adding RDFS labels, creating resources from property values, and the URI transformations are
optional. It's also possible to set the file extension of the input files.

If no configuration file is given, the following default values are used:

inputFileExtension=csv

mapper.alwaysAddPropertyStatements=true

mapper.mapping.typeNamespace=http://aws.amazon.com/neptune/csv2rdf/class/
mapper.mapping.vertexNamespace=http://aws.amazon.com/neptune/csv2rdf/resource/
mapper.mapping.edgeNamespace=http://aws.amazon.com/neptune/csv2rdf/objectProperty/
mapper.mapping.edgeContextNamespace=http://aws.amazon.com/neptune/csv2rdf/resource/
mapper.mapping.vertexPropertyNamespace=http://aws.amazon.com/neptune/csv2rdf/datatypeProperty/
mapper.mapping.edgePropertyNamespace=http://aws.amazon.com/neptune/csv2rdf/datatypeProperty/
mapper.mapping.defaultNamedGraph=http://aws.amazon.com/neptune/vocab/v01/DefaultNamedGraph
mapper.mapping.defaultType=http://www.w3.org/2002/07/owl#Thing

The setting `mapper.alwaysAddPropertyStatements` has only effect if a rule for adding RDFS labels is used. In
that case it decides whether or not to add the property that is being used for the RDFS label additionally
as RDF literal statement with that property. For the small example above, if the setting was chosen to be
`false`, two statements would not be generated:

"Seattle" .
"Vancouver" .

The setting `mapper.mapping.edgeContextNamespace` takes effect only when explicitly set. Otherwise, it uses
the value set for `mapper.mapping.vertexNamespace`.

**Vertex type to RDFS label mapping**

Vertex types are defined by vertex labels. The option `mapper.mapping.pgVertexType2PropertyForRdfsLabel..` is used to specify a mapping from a vertex type to to a vertex property, whose value is then used
to create RDFS labels for any vertex belonging to this vertex type. Multiple such mappings are allowed.

**Property to RDF resource mapping**

The option `pgProperty2RdfResourcePattern.={{VALUE}}` is used to create RDF resources
instead of literal values for vertices where the specified property is found. The variable `{{value}}` will
be replaced with the value of the property and prefixed with the given namespace. Multiple such mappings
are allowed.

**URI Post Transformations**

URI Post Transformations are used to transform RDF resource IRIs into more readable ones.

An URI Post Transformation consists of four elements:

uriPostTransformation..srcPattern=
uriPostTransformation..typeUri=
uriPostTransformation..propertyUri=
uriPostTransformation..dstPattern=

A positive integer `` is required to group the elements. The grouping numbers of several transformation
configurations do not need to be consecutive. The transformation rules will be executed in ascending order
according to the grouping numbers. All four configuration items are required:

* `srcPattern` is a URI with a single regular expression group, e.g.
``, defining
the URI patterns of RDF resources to which the post transformation applies.
* `typeUri` filters out all matched source URIs that do not belong to
the specified RDF type.
* `propertyUri` is the RDF predicate pointing to the replacement
value.
* `dstPattern` is the new URI, which must contain a
`{{VALUE}}` substring which is then substituted with the value of
`propertyUri`.

*Example:*

uriPostTransformation.1.srcPattern=http://example.org/resource/([0-9]+)
uriPostTransformation.1.typeUri=http://example.org/class/Country
uriPostTransformation.1.propertyUri=http://example.org/datatypeProperty/code
uriPostTransformation.1.dstPattern=http://example.org/resource/{{VALUE}}

This configuration transforms the URI `http://example.org/resource/123` into `http://example.org/resource/FR`,
given that there are the statements:

http://example.org/resource/123 a http://example.org/class/Country.
http://example.org/resource/123 http://example.org/datatypeProperty/code "FR".

Note that we assume that the property `propertyUri` is unique for each resource, otherwise a runtime exception
will be thrown. Also note that the post transformation is applied using a two-pass algorithm over the
generated data, and the translation mapping is kept fully in memory. This means the property is suitable
only in cases where the number of mappings is small or if the amount of main memory is large.

**Complete Configuration**

The complete configuration for the small example above is:

mapper.alwaysAddPropertyStatements=false

mapper.mapping.typeNamespace=type:
mapper.mapping.vertexNamespace=vertex:
mapper.mapping.edgeNamespace=edge:
mapper.mapping.edgeContextNamespace=econtext:
mapper.mapping.vertexPropertyNamespace=vproperty:
mapper.mapping.edgePropertyNamespace=eproperty:
mapper.mapping.defaultNamedGraph=dng:/
mapper.mapping.defaultType=dt:/
mapper.mapping.defaultPredicate=dp:/
mapper.mapping.pgVertexType2PropertyForRdfsLabel.city=name

mapper.mapping.pgProperty2RdfResourcePattern.country=country:{{VALUE}}

## Examples

The small example above is contained in `src/test/example` and can be tested with:

java -jar amazon-neptune-csv2rdf.jar -i src/test/example/ -o . -c src/test/example/city.properties

Additionally, the directory `src/test/air-routes` contains a Zip archive of the
[Air Routes data set](https://github.com/krlawrence/graph/tree/master/sample-data) and a sample configuration.
After unzipping the archive into `air-routes`, it can be converted with:

java -jar amazon-neptune-csv2rdf.jar -i air-routes/ -o . -c src/test/air-routes/air-routes.properties

## Known Limitations

The general mapping from property graph vertices and edges is done individually for each CSV line in order to avoid
loading the whole CSV file into memory. However, that means that properties being defined on different lines are not joined
and cardinality constraints cannot be checked. For example, the RDF mapping (using the simplified namespaces from the small
example above) of the following property graph
* should reject the statement ` "tomorrow" ` because edge properties have *single*
cardinality,
* should contain only one ` ` statement (however, RDF joins multiple equal
statements into one), and
* should not generate the statement ` ` because
vertex 3 has a label.

**Property Graph:**

~id,~label,name
2,person,Alice
3,person,Bob
3,,Robert

~id,~label,~from,~to,since,personally
1,knows,2,3,yesterday,
1,knows,2,3,tomorrow,
1,knows,2,3,,true

**RDF mapping:**

.
"Alice" .
.
"Bob" .
.
"Robert" .

.
"yesterday" .
.
"tomorrow" .
.
"true" .

## Building from source

Amazon Neptune CSV to RDF Converter is a Java Maven project and requires JDK 8 and Maven 3 to build from source. Change
into the source folder containing the file `pom.xml` and run `mvn clean install`. The directory `target/` contains the
executable Jar library `amazon-neptune-csv2rdf.jar` after a successful build. The executable Jar is not attached to the
build artifacts.

Activate the profile *integration* for running the integration tests during the build by using
`mvn -Pintegration clean install`. Integration tests are distinguished from other tests by adding
the annotation `@Tag("IntegrationTest")`.

## Adding the library to your build

The group id of Amazon Neptune CSV to RDF Converter \[[javadoc](https://www.javadoc.io/doc/software.amazon.neptune/amazon-neptune-csv2rdf)\]
is `software.amazon.neptune`, its artifact id is `amazon-neptune-csv2rdf`. In case you want to use
the library as part of another project, use the following to add a dependency in Maven:

software.amazon.neptune
amazon-neptune-csv2rdf
1.0.0

## License

Amazon Neptune CSV to RDF Converter is available under [Apache License, Version 2.0](https://aws.amazon.com/apache2.0).

----

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aws/amazon-neptune-csv-to-rdf-converter

Awesome Lists containing this project

README