https://github.com/rggh/exa-data
ETL : FHIR sample data JSON to SQL
https://github.com/rggh/exa-data
docker-compose etl json pandas postgresql psycopg2-binary pytest python
Last synced: 6 days ago
JSON representation
ETL : FHIR sample data JSON to SQL
- Host: GitHub
- URL: https://github.com/rggh/exa-data
- Owner: RGGH
- License: mit
- Created: 2022-02-16T16:25:54.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-02-20T23:59:23.000Z (over 3 years ago)
- Last Synced: 2025-03-28T01:55:39.437Z (3 months ago)
- Topics: docker-compose, etl, json, pandas, postgresql, psycopg2-binary, pytest, python
- Language: Python
- Homepage:
- Size: 19.4 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Example ETL project using FHIR standard data
FHIR is a standard for health care data exchange, published by HL7®Purpose of code : Transfer data to structured format (SQL) from 80 sample JSON files, with circa 50,000 lines in each file.
# User guide :
## TLDR;
- Clone this repo : `git clone https://github.com/RGGH/exa-data.git`
- Install Docker to create the database and initialize 'patient_info' table with required fields
- Run python code (2 files below)
---
python3 exa_flatten_json_csv.py
python3 exa_parse_for_sql.py
---## Full Steps:
#### 1. Install Docker Composehttps://docs.docker.com/compose/install/
docker-compose --version
##### *If you get an error :
(env) rag@rag-Latitude-5490:~/env/exa-data-1/exa-data$ sudo docker-compose up
Creating network "exa-data_default" with the default driver
ERROR: could not find an available, non-overlapping IPv4 address pool among the defaults to assign to the network
Stop your openvpn :
sudo service openvpn stop
#### 2. Run Docker Comopse
sudo docker-compose up
#### 3. Install Python Packagespip3 install -r requirements.txt
#### 4. Run the 2 Python files- These have been kept separate to allow for batch conversion from JSON to CSV, and then import into SQL
---
python3 exa_flatten_json_csv.py
python3 exa_parse_for_sql.py
#### 5. Check the imported data in in PostgreSQL
python3 db_connect.py
---### About this project
- used Docker compose with yaml file to create container for a PostgreSQL 14 image
- used regex to extract nested keys from 50,000 lines of JSON
- used Vim to remove empty rows
- used Pandas to remove duplicates but preserve sort order
- used Vim to add data types in bulk, plus commas
- used Jupyter Notebook for exploratory data anlaysis and testing
- unable to find data dictionary on FHIR site for this actual project http://hl7.org/fhir/overview-dev.html http://hl7.org/fhir/resourcelist.html
- not all fields populated due to time constraint, have done first few in order
- had issues with lists inside dicts. Resolved this with 'typing' module https://docs.python.org/3/library/typing.html#module-typing###### ToDo
- Add extra tests
- Fix issues flagged in pylint with github workflow
- Consider more than 1 table for data, eg patients table, plus a join with notes each fullUrl
- Create frontend in HTML/JavaScript and consume data from an API (yet to be written, in Python with FastAPI)
- add remaining fieids in exa_parse_for_sql.py