Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mneedham/neo4j-spark-chicago


https://github.com/mneedham/neo4j-spark-chicago

Last synced: 12 days ago
JSON representation

Awesome Lists containing this project

README

        

= Importing Chicago Crime Dataset into Neo4j

The following are instructions for importing the open Chicago crime data set into Neo4j.

* Download the link:https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2[Chicago crime data set]
* Set a system variable containing the path to the data set. e.g.

```
export CSV_FILE=Crimes_-_2001_to_present.csv
```

* Download a link:https://spark.apache.org/downloads.html[pre-installed Spark distribution] into this folder.
The `create_files.sh` script assumes the use of the link:http://apache.mirrors.spacedump.net/spark/spark-1.3.0/spark-1.3.0-bin-hadoop1.tgz[1.3.0 release pre built for Hadoop 1.X]
* Unpack Spark

```
tar -xvf spark-1.3.0-bin-hadoop1.tgz
```

* Generate the CSV files that Neo4j import will use:

```
./create-files
```

You should see the following at the beginning of the output:

```
$ ./create_files.sh
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using /Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv
...
```

The following files will be generated:

```
$ ls -alh tmp/*.csv
-rwxrwxrwx 1 markneedham wheel 3.0K 14 Apr 06:48 /tmp/beats.csv
-rwxrwxrwx 1 markneedham wheel 217M 14 Apr 06:48 /tmp/crimes.csv
-rwxrwxrwx 1 markneedham wheel 84M 14 Apr 06:49 /tmp/crimesBeats.csv
-rwxrwxrwx 1 markneedham wheel 120M 14 Apr 06:49 /tmp/crimesPrimaryTypes.csv
-rwxrwxrwx 1 markneedham wheel 912B 14 Apr 06:48 /tmp/primaryTypes.csv
```

Now let's clean up the CSV files to get rid of empty columns:

```
./clean.sh
```

* link:http://neo4j.com/download/[Download Neo4j]

* Unpack Neo4j

```
tar -xvf neo4j-enterprise-2.2.3-unix.tar.gz
```

* Create a Neo4j graph from the CSV files:

```
./import.sh [/path/to/csv/files] [/path/to/neo4j]
```

== FAQS

=== Using a different version of Spark

If you want to use a different version of Spark you'll need to update the appropriate references in `create_files.sh` and `build.sbt`

=== I can't generate the CSV files

If you forget to set the `CSV_FILE` environment variable or set it to a non-existent file you'll see the following error message:

```
$ ./create_files.sh
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread "main" java.lang.RuntimeException: Cannot find CSV file [null]
...
```

Set the `CSV_FILE` environment variable and you're good to go:

```
export CSV_FILE=Crimes_-_2001_to_present.csv
```

=== I made some changes to the project and want to build it

If you haven't made any changes to the source code of the project there's no need to build it - a JAR is checked in and referenced in `create_files.sh`.
You can, however, rebuild the JAR with the following command:

```
sbt clean package
```

=== I can't build this project

sbt doesn't seem to honour `java_home` so if you're seeing the following error:

```
2015-04-14 13:57:50,116 INFO [main] spark.SparkContext (Logging.scala:logInfo(59)) - Job finished: saveAsTextFile at GenerateCSVFiles.scala:51, took 8.292283862 s
Exception in thread "main" java.lang.UnsupportedClassVersionError: MyFileUtil : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
```

You should set the following environment variable:

```
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
```

And then retry:

```
PATH=$JAVA_HOME/bin:$PATH sbt clean package
```