{"id":15159696,"url":"https://github.com/meaxsom/govdocs","last_synced_at":"2026-01-20T05:31:45.094Z","repository":{"id":228879868,"uuid":"775165972","full_name":"meaxsom/govdocs","owner":"meaxsom","description":"Using python with Neo4J. Importing and associating JSON and related XML nodes","archived":false,"fork":false,"pushed_at":"2024-03-29T19:41:32.000Z","size":15,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-07T17:44:59.693Z","etag":null,"topics":["boto3","dotenv","json","neo4j","python","s3","typeid","xml"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/meaxsom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-20T22:03:01.000Z","updated_at":"2024-03-21T01:19:50.000Z","dependencies_parsed_at":"2024-03-29T20:49:39.561Z","dependency_job_id":null,"html_url":"https://github.com/meaxsom/govdocs","commit_stats":{"total_commits":7,"total_committers":1,"mean_commits":7.0,"dds":0.0,"last_synced_commit":"09a6a87a2a6c7e97e9e1a21fc1e1e4a56c97f59f"},"previous_names":["meaxsom/govdocs"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/meaxsom/govdocs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/meaxsom%2Fgovdocs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/meaxsom%2Fgovdocs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/meaxsom%2Fgovdocs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/meaxsom%2Fgovdocs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/meaxsom","download_url":"https://codeload.github.com/meaxsom/govdocs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/meaxsom%2Fgovdocs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28596411,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-20T02:08:49.799Z","status":"ssl_error","status_checked_at":"2026-01-20T02:08:44.148Z","response_time":117,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["boto3","dotenv","json","neo4j","python","s3","typeid","xml"],"created_at":"2024-09-26T21:41:32.650Z","updated_at":"2026-01-20T05:31:45.079Z","avatar_url":"https://github.com/meaxsom.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Python Data Munging into Neo4J\n\n## Background\n\nFound on [UpWork](https://www.upwork.com/jobs/~010f399c86288cbe73):\n\nNeed someone to download the [following file](https://www.ecfr.gov/current/title-12) into a Neo4J database, then download and parse the [following file]( https://www.ecfr.gov/api/versioner/v1/full/2024-02-21/title-12.xml?part=1002), and attach/match it to the hierarchy file (the first file).\n\nIf successful, we will keep doing this with other files.\n\nMust have Neo4J experience, must have data parsing and matching experience.\n\n\n## Goals\n- develop in 100% Python\n- Files should reside in S3 or be read from local file\n  - could be http(s) as well??\n\n- Neo4j and python development should run in a container\n  - DockerFile or devcontainer??\n    - Ended up using a single Dockerfile.\n      - Devcontainer had problems mounting/working with attached database directory from Neo4J\n  \n- should be updatable with new XML data\n\n\n### Step 0\n- set up s3 bucket with access restricted with no public access. Access will be by AWS \"keys\" attached to a new IAM user/group\n\n- set up Python dev environment in a dev container with python, s3, and neo4j support\n\t- [boto3](https://github.com/boto/boto3)\n\t\n- set up neo4j into same environment\n- [Build applications with Neo4j and Python](https://neo4j.com/docs/python-manual/current/)\n\n### Step 0.5\n\n- develop data structures for Neo4J access\n\t- JSON Structure \n    - related by `[HAS_CHILD]` between Parent/Child in JSON\n\n```\n    typeId: \"{our-generated-type-id},\n    identifier: \"1.130\",\n    label: \"\\u00a7 1.130 Type II securities; guidelines for obligations issued for university and housing purposes.\",\n    label_level: \"\\u00a7 1.130\",\n    label_description: \"Type II securities; guidelines for obligations issued for university and housing purposes.\",\n    reserved: false,\n    type: \"section\",\n    volumes: [\n      \"1\"\n    ],\n    received_on: \"2017-01-07T00:00:00-0500\"\n```\n  \n- XML Structure\n  - related by `[HAS_XML]` between related JSON node\n\n```\n      typeId: \"{our-generated-type-id},\n      elementId: \"B\",\n      xml: \"{quote-encoded-xml-as-in-the-original}\"\n```\n\n### Step 2\n- read file from s3 into local environment \n- used [dotenv](https://dev.to/jakewitcher/using-env-files-for-environment-variables-in-python-applications-55a1) to control enviornment variable (AWS keys injection into container)\n\n\n### Step 3\n- parse JSON file recurslively and populate Neo4J\n  - inserted full node from JSON file less children\n    - used existing JSON format\n  - children related using `(a:Division)-[r:HAS_CHILD]-\u003e(b:Division)`\n- used [TypeId](https://github.com/akhundMurad/typeid-python) to generate unique node IDs\n\n### Step 4\n- parse XML recurslively and attach it to Neo4J nodes from JSON data\n  - similar approach to JSON/Step 3\n    - used TypeID to generate unique node IDs\n    - carried \"N\" identifier into structure\n    - xml inserted directly as quote escaped XML string\n      - xml nodes related using `(a:Division)-[r:HAS_XML]-\u003e(b:XMLData)`\n\n\n## To Do\n- would be interesting to see if we can do this using AWS cloud resourcs\n  - Use EC2 w/mounted EBS for Neo4J DB\n  - Use lambda function tied to S3 bucket(s) for updating JSON/XML files and serverless Neptune\n  - Some combo of the above\n\n\n- adjust Cypher queries to use \"best practices\"\n  - use `MERGE` instead of `CREATE`\n  - create `CONSTRAINTS` on `typeId` for uniquness and value\n  - standardize on property creation so each node has the same \"primary\" key(??)\n\n\n## Schema\n\nThere are 2 general purpose \"nodes\" in the graph:\n- `Division` nodes represent an individual \"record\" from the JSON file\n  - each `Division` has an additional label that is derived from the `type` property, e.g. \"Section\", \"Chapter\", etc\n  - each `Division` node contains a `typeId` property that represents a unique ID for the node since one doesn't seem to exist within the existing data structure. The `typeId` property us gerated using the \"typeid-python\" module\n  - each `Division` node also contains all the properties from the original JSON record with the exception of `children`\n  - Each \"child\" of a `Division` become its own `Division` node, i.e. all `children` records of a `Division` are themselves `Divisions` and related to their parent `Division` by a `HAS_CHILD` relationship. \n\n- `XMLData` nodes represent the collection of elements within a XML `DIVn` element, e.g. `DIV5`. HTML/XML tags with the element name of `DIV` are ignored.\n  - each `XMLData` node contains a `typeId` property that represents a unique ID for the node since one doesn't seem to exist within the existing data structure. The `typeId` property us gerated using the \"typeid-python\" module\n  - each `XMLData` node also contains all the properties from the original XML `DIVn` element - including text - with the exception of embedded `DIVn` elements\n  - each \"child\" `DIVn` element of a `DIVn` element becomes its own `XMLData` node.\n  - Each `XMLData` node is assoicated with a `Division` node by the following:\n    - discover the related parent/child `Division` nodes via the `HAS_CHILD` relationship that contain matching `identifier` propertie from the parent/child `DIVn/N` element values\n      - i.e `MATCH (a:Division)-[:HAS_CHILD]-\u003e(b:Division) return a.typeId, b.typeId`\n    - use the approperiate `typeId` to create a `[:HAS_XML]` relationship between the `Division` and `XMLData` nodes\n  \n- The population of the JSON `Division` nodes and `XMLData` nodes is recursive\n\n```\n(a:Division)-[:HAS_CHILD]-\u003e(b:Divsion)\n(a:Division)-[:HAS_XML]-\u003e(x:XMLData)\n```\nEach `Division` may have multiple `[:HAS_CHILD]` relationships\nEach `Division` may have only 1 [:HAS_XML] relationship\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmeaxsom%2Fgovdocs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmeaxsom%2Fgovdocs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmeaxsom%2Fgovdocs/lists"}