https://github.com/neo4j-examples/oscon-graph
DIY Neo4j Graph from the OSCON 2014 event data feed
https://github.com/neo4j-examples/oscon-graph
Last synced: about 2 months ago
JSON representation
DIY Neo4j Graph from the OSCON 2014 event data feed
- Host: GitHub
- URL: https://github.com/neo4j-examples/oscon-graph
- Owner: neo4j-examples
- Created: 2014-07-15T21:55:49.000Z (almost 11 years ago)
- Default Branch: master
- Last Pushed: 2014-07-16T21:44:24.000Z (almost 11 years ago)
- Last Synced: 2025-05-08T01:43:31.154Z (about 2 months ago)
- Language: Shell
- Size: 387 KB
- Stars: 2
- Watchers: 9
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: readme.adoc
Awesome Lists containing this project
README
= The OSCON Graph-in-a-Gist
At http://neo4j.org[Neo4j], we love going to
image::http://cdn.oreillystatic.com/en/assets/1/event/115/oscon2014_logo.png[]
This year, there is a http://www.oscon.com/oscon2014/public/content/schedulefeed[DYI Schedule page] that we are very intrigued by, so let's build a -- graph of the conference!
== Quick Install
run `scripts/doit.sh` to
* download jq for your platform `scripts/install_jq.sh`
* load the OSCON JSON feed `scripts/load_feed.sh`
* create the 3 csv files for venues, speakers and events using jq `scripts/create_csv.sh`
* insert the data into neo4j using your locally running neo4j instance on +http://localhost:7474+ `scripts/insert_data.sh`The following is also detailed in the http://www.neo4j.org/graphgist?github-neo4j-examples%2Foscon-graph%2F%2Foscon_graphgist.adoc[GraphGist]
== The Model
We are followingthe model of the data closely and are just using three different labels for the nodes, `Venue`, `Speaker` and `Event`:
image::http://yuml.me/diagram/scruffy/class/[Speaker%7C+serial+;+twitter+]-SPEAKS_AT-0..*%3E[Event%7C+serial+],[Event]-AT_VENUE%3E[Venue%7C+serial+].png[]
== The Data
First, we download the excellent http://stedolan.github.io/jq/[JQ] utility, it's a single binary, in my case I downloaded it to my current directory, `./jq`. Now, the http://www.oscon.com/oscon2014[OSCON] organizers are so awesome to publish the full schedule as a JSON feed from the http://www.oscon.com/oscon2014/public/content/schedulefeed[OSCON website] - thank you! We can parse this directly from the feed via `curl` and `jq` and pipe the the resulting lines into 3 different CSV files for Neo4j import.
It turns out that the conference has over 350 speakers, 53 venues and 470+ talks!
=== The JSON feed
The data feed can be downloaded and looks something like
[source,json]
----
{ "Schedule": {
"conferences": [{"serial": 115 }],
"events": [
{
"serial": 33451,
"name": "Migrating to the Web Using Dart and Polymer - A Guide for Legacy OOP Developers",
"event_type": "40-minute conference session",
"time_start": "2014-07-23 17:00:00",
"time_stop": "2014-07-23 17:40:00",
"venue_serial": 1458,
"description": "The web development platform is massive. With tons of libraries, frameworks and concepts out there, it might be daunting for the "legacy" developer to jump into it.\r\n\r\nIn this presentation we will introduce Google Dart & Polymer. Two hot technologies that work in harmony to create powerful web applications using concepts familiar to OOP developers.",
"website_url": "http://oscon.com/oscon2014/public/schedule/detail/33451",
"speakers": [149868],
"categories": [
"Emerging Languages"
]
},
...
----=== The model
We are very pragmatic in this little post and define just three node Labels: `Venue`, `Speaker` and `Event`, following the OSCON data structure:
image::http://yuml.me/diagram/scruffy/class/[Speaker%7C+serial+;+twitter+]-SPEAKS_AT-0..*%3E[Event%7C+serial+],[Event]-AT_VENUE%3E[Venue%7C+serial+].png[]
=== JSON -> CSV
Now, using the awesome `jq` utility, we can easily filter out the relevant bits for our import, most notably the serial numbers of `event`, `venue` and `speaker` which then are cross-referenced in the various parts. We also add a header line to each CSV file for convenience.
[source,bash]
----
include:data/create_csv.sh[]
----Resulting in e.g. a `speaker.csv`, `venues.csv` and `events.csv` in your data directory.
A you can see, in the last case, we have to make sure the empty values, empty arrays and serialization into a parseable string from an integer array are taken care of in `jq`, resulting in the gnarly `.speakers | if (. | type) == "null" then "" else (. | tostring | ltrimstr("[") | rtrimstr("]")) end]` construct. Still, it's three one-liners and at least for me impressively compact and readable, resulting in files like event.json:
[source,csv]
----
"serial","name","time_start","time_end","venue_serial","event_type","categories","speakers"
"33451","Migrating to the Web Using Dart and Polymer - A Guide for Legacy OOP Developers","2014-07-23 17:00:00","2014-07-23 17:40:00","1458","40-minute conference session","Emerging Languages","149868"
"33457","Refactoring 101","2014-07-23 11:30:00","2014-07-23 12:10:00","1458","40-minute conference session","PHP","169862"
"33463","Open Source and Mobile Development: Where Does it go From Here?","2014-07-23 10:40:00","2014-07-23 11:20:00","1449","40-minute conference session","Mobile Platforms","169870,2216,96208,150073"
"33464","Open Source Protocols and Architectures to Fix the Internet of Things","2014-07-23 16:10:00","2014-07-23 16:50:00","1451","40-minute conference session","Open Hardware","2216"
"33476","Scaling PHP in the Real World!","2014-07-23 14:30:00","2014-07-23 15:10:00","1458","40-minute conference session","PHP","54107"
"33481","API Ecosystem with Scala, Scalatra, and Swagger at Netflix","2014-07-23 17:00:00","2014-07-23 17:40:00","1456","40-minute conference session","Emerging Languages","113667"
"33485","XSS and SQL Injections: The Tip of the Web Security Iceberg ","2014-07-23 16:10:00","2014-07-23 16:50:00","1458","40-minute conference session","PHP","169932"
"33503","Scalable Analytics with R, Hadoop and RHadoop","2014-07-23 14:30:00","2014-07-23 15:10:00","1475","40-minute conference session","Databases & Datastores","126882"
"33520","HA 101 with OpenStack","2014-07-24 10:00:00","2014-07-24 10:40:00","1466","40-minute conference session","Cloud","131499"
wuqour:oscon mh$ head -5 data/events.csv
"serial","name","time_start","time_end","venue_serial","event_type","categories","speakers"
"33451","Migrating to the Web Using Dart and Polymer - A Guide for Legacy OOP Developers","2014-07-23 17:00:00","2014-07-23 17:40:00","1458","40-minute conference session","Emerging Languages","149868"
"33457","Refactoring 101","2014-07-23 11:30:00","2014-07-23 12:10:00","1458","40-minute conference session","PHP","169862"
"33463","Open Source and Mobile Development: Where Does it go From Here?","2014-07-23 10:40:00","2014-07-23 11:20:00","1449","40-minute conference session","Mobile Platforms","169870,2216,96208,150073"
"33464","Open Source Protocols and Architectures to Fix the Internet of Things","2014-07-23 16:10:00","2014-07-23 16:50:00","1451","40-minute conference session","Open Hardware","2216"
----Now we have nicely formatted `CSV` files with headers that we can import into Neo4j.
== The Import
From here, there is built-in support in the standard http://docs.neo4j.org/chunked/stable/cypher-query-lang.html[Neo4j Cypher language] for importing files. Loading venues and speakers is just a matter of iterating over the lines and using http://docs.neo4j.org/chunked/stable/cypher-query-lang.html[LOAD CSV]:
[source,cypher]
----
LOAD CSV WITH headers FROM "https://raw.githubusercontent.com/neo4j-examples/oscon-graph/master/data/speakers.csv" as line
CREATE (speaker:Speaker{serial:line.serial, name:line.name, photo:line.photo, twitter:line.twitter})
----[source,cypher]
----
LOAD CSV WITH headers FROM "https://raw.githubusercontent.com/neo4j-examples/oscon-graph/master/data/venues.csv" as line
CREATE (speaker:Venue{serial:line.serial, name:line.name})
----For the events, it is a bit more involved since one event can have several speakers. We are thus creating one `venue`
[source,cypher]
----
LOAD CSV WITH headers FROM "https://raw.githubusercontent.com/neo4j-examples/oscon-graph/master/data/events.csv" as line
MATCH (venue: Venue{serial:line.venue_serial})
CREATE (event:Event{serial:line.serial, name:line.name, time_start: line.time_start, time_end: line.time_end, type:line.event_type})-[:AT_VENUE]->(venue)
FOREACH (category_name in split(line.categories,"|") | MERGE (category:Category {name:category_name}) CREATE (event)-[:IN_CATEGORY]->(category))
WITH event, line
WHERE line.speakers <> ""
UNWIND split(line.speakers, ",") as speaker_serial
MATCH (speaker:Speaker{serial:speaker_serial})
CREATE (event)<-[:SPEAKS_AT]-(speaker)
----