https://github.com/graphaware/neo4j-importer

Java importer skeleton for complicated, business-logic-heavy high-performance Neo4j imports directly from SQL databases, CSV files, etc.
https://github.com/graphaware/neo4j-importer
java neo4j neo4j-graphaware-framework
Last synced: 2 days ago
JSON representation
Java importer skeleton for complicated, business-logic-heavy high-performance Neo4j imports directly from SQL databases, CSV files, etc.
Host: GitHub
URL: https://github.com/graphaware/neo4j-importer
Owner: graphaware
Created: 2015-05-28T22:17:33.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2017-04-30T15:31:04.000Z (about 8 years ago)
Last Synced: 2024-11-02T16:35:03.614Z (8 months ago)
Topics: java, neo4j, neo4j-graphaware-framework
Language: Java
Size: 200 KB
Stars: 26
Watchers: 33
Forks: 8
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

awesome-neo4j - GraphAware Neo4j Importer - Java importer skeleton for complicated, business-logic-heavy high-performance Neo4j imports directly from SQL databases, CSV files, etc. (REST API / Other)
README

        GraphAware Neo4j Importer - RETIRED

=========================

Importer Has Been Retired

-------------------------

As of April 1st 2017, this module is retiring in favour of [GraphAware Databridge](https://neo4j.com/blog/graphaware-databridge-graph-data-import/). This means it will no longer be maintained and released together with new versions of the GraphAware Framework and Neo4j. The last compatible Neo4j version is 3.1.0.

This repository will remain public.

Introduction

============

[![Build Status](https://travis-ci.org/graphaware/neo4j-importer.png)](https://travis-ci.org/graphaware/neo4j-importer) | Products | Latest Release: 3.1.0.44.3

GraphAware Importer is a high-performance importer for importing data from any data source to Neo4j. It is intended

for initial one-off imports of large amounts of data (millions to billions of nodes and relationships), which needs

to be cleansed, normalised, or transformed during the import. Depending on many things (connection speed, database speed,

query complexity, data quality,...), you'll be able to import millions of nodes and relationships in minutes.

## Another Importer?

There are a number of ways of getting data into Neo4j.

* If you have small amounts of CSV data, use [Neo4j's LOAD CSV](http://neo4j.com/docs/stable/query-load-csv.html)

* If you have large amounts of clean CSV data where you can separate nodes and relationships into different files, use [Neo4j's Import Tool](http://neo4j.com/docs/stable/import-tool.html)

* If you have large amounts of ready-to-be imported (i.e. not too dirty) data in any tabular form and don't want do code, use GraphAware's Neo4j DataBridge (coming soon)

* For all other scenarios, especially if you have large volumes of data from any source (CSV, MySQL, Oracle, HBase, you name it!) that need to be cleansed, normalised or transformed in some way, use this importer. **You will need to code** in Java.

## Tutorial

This tutorial will guide you through writing an efficient one-off importer of data into Neo4j in a short amount of time.

**You need to be able to write some Java.** What you will get at the end of the process is a standalone Java application

that you can invoke from the command line. It will import data from a data source of your choice and create a brand new

fully usable Neo4j database on disk. It is using [Neo4j's BatchInserter](http://neo4j.com/docs/stable/batchinsert.html)

under the hood.

This tool **will not** be able to import into an existing database, or a running Neo4j instance (yet).

### Step 0: Get Data

You need some data of course. For this tutorial, we're going to be importing from 2 CSV files:

people-file.csv:

```

"id","name","location","age"

"1","Michal Bachman","1",30

"2","Adam George","2",29

"","PersonWithNoId","2",99

"4","  ","2",100

```

locations-file.csv:

```

"id","name"

"1","London"

"2","Watnall"

"3","Prague"

```

In practice, these could be tables from (or queries against) a relational database, column families from Cassandra, you name it.

The graph we're looking to get by importing the files above is:

```

CREATE

(m:Person {id:1, name:'Michal Bachman', age:30}),

(a:Person {id:2, name:'Adam George', age:29}),

(l:City {id:1, name:'London'}),

(w:City {id:2, name:'Watnall'}),

(p:City {id:3, name:'Prague'}),

(m)-[:LIVES_IN]->(l),

(a)-[:LIVES_IN]->(w)

```

Note that the last two lines in people-file.csv are bad data, we don't want to import these.

### Step 1: New Project

Create a brand new Java project and bring this project as its dependency. Assuming you're using Maven, declare the following

dependency in your pom.xml

```

    com.graphaware.neo4j

    programmatic-importer

    3.1.0.44.2

```

You will also need to make sure that the .jar file produced at the end of the process is a "fat jar", i.e. that it contains

all the needed dependencies. For this to happen, you need something like this in your pom.xml:

```

    

        

            maven-assembly-plugin

            2.4.1

            

                

                    package

                    

                        attached

                    

                

            

            

                my-importer

                

                    jar-with-dependencies

                

                false

            

        

    

```

### Step 2: Data Reader

Implement a `DataReader` that is able to read from your data source. Most readers will be `TabularDataReader`s. If you're

importing from a CSV file, you can skip this step and use the provided `CsvDataReader`. If you're importing from a relational database,

you can save some time by extending `DbDataReader` or `QueueDbDataReader` (recommended).

For example, it you're reading from Oracle, your data reader will look something like this:

```java

/**

 * {@link com.graphaware.importer.data.access.DbDataReader} for Oracle.

 */

public class OracleDataReader extends QueueDbDataReader {

    private final String db;

    private final int prefetchSize;

    private final int fetchSize;

    public OracleDataReader(String dbHost, String dbPort, String user, String password, String db, int prefetchSize, int fetchSize) {

        super(dbHost, dbPort, user, password);

        this.db = db;

        this.prefetchSize = prefetchSize;

        this.fetchSize = fetchSize;

    }

    @Override

    protected String getDriverClassName() {

        return "oracle.jdbc.OracleDriver";

    }

    @Override

    protected String getUrl(String host, String port) {

        return "jdbc:oracle:thin:@//" + host + ":" + port + "/" + db;

    }

    @Override

    protected void additionalConfig(JdbcTemplate template) {

        template.setFetchSize(fetchSize);

    }

    @Override

    protected void additionalConfig(DataSource dataSource) {

        ((BasicDataSource) dataSource).addConnectionProperty("defaultRowPrefetch", String.valueOf(prefetchSize));

        ((BasicDataSource) dataSource).setInitialSize(10);

    }

}

```

Note that you will have to add the driver (Oracle JDBC driver in this case) into your Maven dependencies.

If you're writing an importer for a non-relational database, for example HBase, you will need to do a bit more work. An

example HBase data reader would look like this:

```java

import com.graphaware.importer.data.access.DataReader;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.TableName;

import org.apache.hadoop.hbase.client.*;

import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

import java.util.*;

public class HbaseDataReader implements DataReader>> {

    private final Configuration configuration;

    private final String columnFamily;

    private Connection connection;

    private ResultScanner scanner;

    private Iterator results = null;

    private Result result = null;

    private int row = 0;

    public HbaseDataReader(Configuration configuration, String columnFamily) {

        this.configuration = configuration;

        this.columnFamily = columnFamily;

    }

    @Override

    public void initialize() {

    }

    @Override

    public Map> readObject(String columnFamily) {

        Set cells = new HashSet<>();

        for (byte[] cell : result.getFamilyMap(Bytes.toBytes(columnFamily)).keySet()) {

            cells.add(Bytes.toString(cell));

        }

        String key = Bytes.toString(result.getRow());

        return Collections.>singletonMap(key, cells);

    }

    @Override

    public void read(String connectionString, String hint) {

        if (results != null) {

            throw new IllegalStateException("Previous reader hasn't been closed");

        }

        try {

            connection = ConnectionFactory.createConnection(configuration);

            Table table = connection.getTable(TableName.valueOf(connectionString));

            Scan scan = new Scan();

            scan.addFamily(Bytes.toBytes(columnFamily));

            scanner = table.getScanner(scan);

            results = scanner.iterator();

        } catch (IOException e) {

            throw new RuntimeException(e);

        }

    }

    @Override

    public void close() {

        scanner.close();

        try {

            connection.close();

        } catch (IOException e) {

            throw new RuntimeException(e);

        }

        results = null;

        result = null;

    }

    @Override

    public int getRow() {

        return row;

    }

    @Override

    public boolean readRecord() {

        if (!results.hasNext()) {

            return false;

        }

        result = results.next();

        row++;

        return true;

    }

    @Override

    public String getRawRecord() {

        return result.toString();

    }

```

### Step 3: Domain

You now need to define some Java classes that represent the things you are going to be importing. The data from the reader

will be translated to these classes. Validation, normalization, and transformation can be applied to these classes, before

they are translated into Neo4j nodes and relationships.

If you don't need to apply much logic to the data, you can choose to go with `Map` instead of concrete objects.

The `String` in the map is some property key and the `Object` is that property's value.

Let's assume location data is clean, so we'll go with the `Map` approach. For importing people, we choose to create a class like this:

```java

public class Person extends Neo4jPropertyContainer {

    @Neo4jProperty

    private final Long id;

    @Neo4jProperty

    private final String name;

    @Neo4jProperty

    private final Integer age;

    private final Long location;

    public Person(Long id, String name, Integer age, Long location) {

        this.id = id;

        this.name = name;

        this.age = age;

        this.location = location;

    }

    public Long getId() {

        return id;

    }

    public String getName() {

        return name;

    }

    public Integer getAge() {

        return age;

    }

    public Long getLocation() {

        return location;

    }

}

```

In this case, we're expecting each row from the data source to contain four pieces of information (id, name, age, location).

The ones that we want to become a node's properties in Neo4j, we annotate with `@Neo4jProperty`. The `location` property will not

be stored in Neo4j, it will be used to link the person to a location, so it is not annotated. Choose the names of the properties

according to how they will be called in Neo4j - it doesn't matter at this point what they are called in your source database.

### Step 4: Importers

Now you define the actual import logic. For each domain class from the previous step, there should be one `Importer`.

Importers should extend `BaseImporter`. If using `TabularDataReader`, you can extend `TabularImporter` instead.

For locations and people, we will write the two importers. Don't get scared, we will explain all aspects of

writing such importers step-by-step.

```java

public class LocationImporter extends TabularImporter> {

    @InjectCache(name = "locations", creator = true)

    private Cache locationCache;

    @Override

    public Data inputData() {

        return DynamicData.withName("locations");

    }

    @Override

    public Map produceObject(TabularDataReader record) {

        Map result = new HashMap<>();

        result.put("id", record.readLong("id"));

        result.put("name", record.readObject("name"));

        return result;

    }

    @Override

    public void processObject(Map object) {

        locationCache.put((Long) object.get("id"), context.inserter().createNode(object, Label.label("Location")));

    }

    @Override

    protected void createCache(Caches caches, String name) {

        if ("locations".equals(name)) {

            caches.createCache(name, Long.class, Long.class);

        }

        else {

            super.createCache(caches, name);

        }

    }

}

```

Let's start with the `LocationImporter` above. We've decided earlier not to create a dedicated "domain" object for locations.

We're importing from tabular data (CSV), therefore we will extend `TabularImporter>`.

There are two important methods that need to be implemented first. `produceObject(..)` will produce a "domain" object from

a tabular record. `processObject(..)` should validate and normalize the object and insert it into Neo4j.

Producing the object should be a trivial mapping exercise, reading values from the (database/csv) record and populating

our object with it. Populating it with dirty data is fine at this point, but `null` can be returned if we don't really

want to produce an object from the record, because it is apparently wrong.

Processing the object means a couple of things. The minimum we should do is create a Location node from the object

by writing: `context.inserter().createNode(object, Label.label("Location")`. This will create a new node with label "Location"

and properties in the `Map` - "id" and "name" in this case. This method call returns the Neo4j node ID of the newly created

node.

Since we will need to link people to locations later on, we should remember what Neo4j node ID was assigned to our the each

location. Remember the "id" property of the location is coming from our relational data. For this reason, we need to have

an (off-heap) `Cache` in place:

```java

@InjectCache(name = "locations", creator = true)

private Cache locationCache;

```

This tells the importer infrastructure that a cache called "locations" is used by this importer and that the key (own ID)

is a `Long`. The value is usually a `Long`, because it is the Neo4j node ID. Moreover, `creator=true` tells the infrastructure

that this importer creates this cache. That means other importers that need this cache will need to run after this one

has finished. For each cache, there can only ever be a single creator.

When an importer is a cache creator, it needs to actually create the cache by implementing the `createCache(..)` method.

It should check that it is asked to create the right one. If not, it should delegate to super-class, e.g.:

```java

@Override

protected void createCache(Caches caches, String name) {

    if ("locations".equals(name)) {

        caches.createCache(name, Long.class, Long.class);

    }

    else {

        super.createCache(caches, name);

    }

}

```

With the caches explained, we will refine our node creating method to populate the cache with each new location:

```java

@Override

public void processObject(Map object) {

    locationCache.put((Long) object.get("id"), context.inserter().createNode(object, Label.label("Location")));

}

```

Finally, each importer needs to implement the `inputData()` method to indicate, what sort of input data it works with.

This is later used to actually find the data. So "locations" here could represent a CSV file called "locations-file.csv", or

a SQL query "SELECT * FROM locations", etc...

With this in mind, let's have a look at the slightly more complicated implementation of `PersonImporter`:

```java

public class PersonImporter extends TabularImporter {

    @InjectCache(name = "people", creator = true)

    private Cache personCache;

    @InjectCache(name = "locations")

    private Cache locationCache;

    @Override

    public Data inputData() {

        return DynamicData.withName("people");

    }

    @Override

    public Person produceObject(TabularDataReader record) {

        //for demo purposes, let's say we can't construct a person without ID

        if (record.readLong("id") == null) {

            return null;

        }

        return new Person(record.readLong("id"), record.readObject("name"), record.readInt("age"), record.readLong("location"));

    }

    @Override

    public void processObject(Person person) {

        //for demo purposes, let's say people with empty names are invalid.

        if (StringUtils.isEmpty(person.getName())) {

            throw new RuntimeException("Person has empty name");

        }

        personCache.put(person.getId(), context.inserter().createNode(person.getProperties(), Label.label("Person")));

        context.inserter().createRelationship(personCache.get(person.getId()), locationCache.get(person.getLocation()), withName("LIVES_IN"), Collections.emptyMap());

    }

    @Override

    protected void createCache(Caches caches, String name) {

        if ("people".equals(name)) {

            caches.createCache(name, Long.class, Long.class);

        } else {

            super.createCache(caches, name);

        }

    }

    @Override

    public void createIndices() {

        createIndex(Label.label("Person"), "name");

    }

}

```

This importer is producing a person cache and using a location cache to create relationships between people and locations.

It also overrides to `createIndices()` method to create an index on people's names.

### Step 5: Wiring it all together

Finally, we need to create the actual main importer class that will be called when data is to be imported. In our simple

case, it will look as follows:

```java

public class MyBatchImporter extends FileBatchImporter {

    public static void main(String[] args) {

        new MyBatchImporter().run(args);

    }

    @Override

    protected Set createImporters() {

        //list all importers, order does not matter

        return new HashSet<>(Arrays.asList(

                new PersonImporter(),

                new LocationImporter()

        ));

    }

    @Override

    protected Map input() {

        //map logical input names to physical ones (file names, queries,...)

        Map map = new HashMap<>();

        map.put(DynamicData.withName("people"), "people-file");

        map.put(DynamicData.withName("locations"), "locations-file");

        return map;

    }

}

```

### Step 6: Tests

We should now test our importer. This isn't hard. We will be using GraphUnit to do that, so you should have that in your

dependencies:

```

    com.graphaware.neo4j

    tests

    3.1.0.44

    test

```

The test would use the inserter on our csv data and verify the contents of the produced database:

```java

@Test

public void testImport() throws IOException, InterruptedException {

    TemporaryFolder temporaryFolder = new TemporaryFolder();

    temporaryFolder.create();

    String tmpFolder = temporaryFolder.getRoot().getAbsolutePath();

    String cp = new ClassPathResource("people-file.csv").getFile().getAbsolutePath();

    String path = cp.substring(0, cp.length() - "people-file.csv".length());

    try {

        TestBatchImporter.main(new String[]{"-g", tmpFolder + "/graph.db", "-i", path, "-o", tmpFolder, "-r", "neo4j.properties"});

    } catch (Throwable t) {

        fail();

    }

    GraphDatabaseService database = new GraphDatabaseFactory().newEmbeddedDatabase(tmpFolder + "/graph.db");

    GraphUnit.assertSameGraph(database, "CREATE " +

                    "(p1:Person {id: 1, name: 'Michal Bachman', age:30})," +

                    "(p2:Person {id: 2, name: 'Adam George', age:29})," +

                    "(l1:Location {id: 1, name: 'London'})," +

                    "(l2:Location {id: 2, name: 'Watnall'})," +

                    "(l3:Location {id: 3, name: 'Prague'})," +

                    "(p1)-[:LIVES_IN]->(l1)," +

                    "(p2)-[:LIVES_IN]->(l2)"

    );

    database.shutdown();

    temporaryFolder.delete();

}

```

### Step 7: Use

`java -cp ./path/to/importer/importer.jar com.graphaware.importer.MyBatchImporter`

usage:

```

 -g,--graph         use given directory to output the graph

 -i,--input         use given directory to find input files

 -o,--output        use given directory to output auxiliary files, such as statistics

 -r,--properties    use given file as neo4j properties

 -c,--cachefile     use given file as temporary on-disk cache

```

### Step 8: Further Customization

#### Custom Config

The import process can be further customised. First of all, if additional configuration needs to be passed into the process,

it is possible to implement a custom `CommandLineParser`. Typically, this is needed to somehow customise the data reading

components. Depending of where you're importing from and what configuration you need, you may choose to extend

`BaseCommandLineParser`, `FileCommandLineParser`, or `DbCommandLineParser`.

Closely tied to `CommandLineParser` is the `ImportConfig` that it produces. Again, for custom import configuration, you

can implement `ImportConfig` by extending `BaseImportConfig`, `FileImportConfig`, or `DbImportConfig`.

`ImportConfig` then produces a `DataReader`.

Let's illustrate using an example. If we were importing from Oracle and wanted the user to specify the fetchSize for the

JdbcTemplate and prefetchSize for the Oracle connection, we would need to implement the following classes:

```java

public class OracleCommandLineParser extends DbCommandLineParser {

    @Override

    protected DbImportConfig doProduceConfig(CommandLine line, String graphDir, String outputDir, String props, String host, String port, String user, String password) throws ParseException {

        int prefetchSize = Integer.valueOf(getOptionalValue(line, "pfs", "10000"));

        int fetchSize = Integer.valueOf(getOptionalValue(line, "fs", "10000"));

        return new OracleImportConfig(

                graphDir,

                outputDir,

                props,

                host,

                port,

                user,

                password,

                prefetchSize,

                fetchSize);

    }

    @Override

    protected void addOptions(Options options) {

        super.addOptions(options);

        options.addOption(new Option("pfs", "prefetchSize", true, "Oracle row prefetch size (default 10000)"));

        options.addOption(new Option("fs", "fetchSize", true, "JDBC driver row fetch size (default 10000)"));

    }

}

```

```java

public class OracleImportConfig extends DbImportConfig {

    private final int prefetchSize;

    private final int fetchSize;

    public OracleImportConfig(String graphDir, String outputDir, String props, String dbHost, String dbPort, String user, String password, int prefetchSize, int fetchSize) {

        super(graphDir, outputDir, props, dbHost, dbPort, user, password);

        this.prefetchSize = prefetchSize;

        this.fetchSize = fetchSize;

    }

    @Override

    public DataReader createReader() {

        return new OracleDataReader(getDbHost(), getDbPort(), getUser(), getPassword(), prefetchSize, fetchSize);

    }

}

```

Once we have these two classes, we can wire them into the top-level importer by overriding a single method:

```java

@Override

protected CommandLineParser commandLineParser() {

    return new OracleCommandLineParser();

}

```

#### Custom Context

Throughout the import process, an `ImportContext` is available to the `Inserter`s by accessing the protected `context`

field. This context provides access to the actual `BatchInserter` used for creating nodes and relationships, to `caches()`,

etc. In case more context is needed, for example an external validator (e.g. some JSR-303 validator implementation), you

can implement a custom `ImportContext` by extending the default `SimpleImportContext`.

```java

public class MyImportContext extends SimpleImportContext {

    private ObjectNormalizer normalizer;

    private ObjectValidator validator;

    public MyImportContext(ImportConfig config, Caches caches, DataLocator inputLocator, DataLocator outputLocator) {

        super(config, caches, inputLocator, outputLocator);

    }

    public ObjectNormalizer normalizer() {

        return normalizer;

    }

    public ObjectValidator validator() {

        return validator;

    }

    @Override

    protected void postBootstrap() {

        super.postBootstrap();

        normalizer = createNormalizer();

        validator = createValidator();

    }

    protected ObjectNormalizer createNormalizer() {

        return new AnnotationObjectNormalizer();

    }

    protected ObjectValidator createValidator() {

        return new StandardObjectValidator();

    }

}

```

Again, this custom context is wired into the import process in the top-level `BatchImporter`:

```java

@Override

protected ImportContext createContext(T config) {

    return new MyImportContext(config, createCaches(), createInputDataLocator(config), createOutputDataLocator(config));

}

```

For further customisations, please have a look at the [Javadoc](http://graphaware.com/site/importer/latest/apidocs) or the code in this repo.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/graphaware/neo4j-importer

Awesome Lists containing this project

README