Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tylerlittlefield/openfda-extract
🩺 Convert openFDA device data from JSON files to tables and store in a database
https://github.com/tylerlittlefield/openfda-extract
openfda rstats
Last synced: about 1 month ago
JSON representation
🩺 Convert openFDA device data from JSON files to tables and store in a database
- Host: GitHub
- URL: https://github.com/tylerlittlefield/openfda-extract
- Owner: tylerlittlefield
- Created: 2020-09-07T20:19:01.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2021-01-14T05:17:39.000Z (almost 4 years ago)
- Last Synced: 2023-07-20T12:01:31.673Z (over 1 year ago)
- Topics: openfda, rstats
- Language: R
- Homepage:
- Size: 23.4 KB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```# openfda-extract
This repository collects [adverse events data](https://open.fda.gov/apis/device/event/download/) from openFDA. A single [script](/data-raw/loop.R) attempts to:
1. Convert JSON files to tabular format using `jsonlite::fromJSON` and `tidyr::unnest`.
2. Save the data to a database.The data is relational as described [here](https://opendata.stackexchange.com/a/2187). Converting the data to tabular format may not be efficient and causes lots of duplication. To avoid duplication, I have tried to store nested data in different tables:
1. `adverse_events`
2. `adverse_events.mdr_text`
3. `adverse_events.product_problems`
4. `adverse_events.source_type`
5. `adverse_events.device`
6. `adverse_events.patient`
7. `adverse_events.remedial_action`
8. `adverse_events.type_of_report`Where the naming convention is `mainframe.`.
## Hardware
1. I transformed the data on:
* 2013 15" MacBook Pro, 8 GB Memory, 8 Core CPU.
2. I wrote the data to:
* Postgres database hosted on a digital ocean droplet.
* 2 GB Memory, 2 vCPUs, 60 GB Disk, Ubuntu 18.04.3 (LTS) x64.
For context, this was the result of my first run:```
~ openFDA database refresh completed in [17.4161201466454 hours]
```## Examples
If you have successfully ran everything, you should have 8 tables with millions of observations that you can explore.
```{r}
library(dplyr, warn.conflicts = FALSE)
library(DBI)# credentials
dw <- config::get("datawarehouse")# connect to db
con <- DBI::dbConnect(
odbc::odbc(),
Driver = dw$driver,
Server = dw$server,
Database = dw$database,
UID = dw$uid,
PWD = dw$pwd,
Port = dw$port
)# list all available tables
dbListTables(con)# query the mdr text
tbl(con, "adverse_events.mdr_text") %>%
select(text) %>%
head(1)# query the device information
tbl(con, "adverse_events.device") %>%
filter(manufacturer_d_name == "ETHICON, INC.") %>%
glimpse()# disconnect
dbDisconnect(con)
```Note that there is an `id` column in every table, for example:
* `2014q4-0002-0003-3683`
* `---`I made this column so that tables can be joined (though this ID in some other form might already exist in the data, I just haven't figured out if that is the case or not).