Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tylerlittlefield/openfda-extract

🩺 Convert openFDA device data from JSON files to tables and store in a database
https://github.com/tylerlittlefield/openfda-extract

openfda rstats

Last synced: about 1 month ago
JSON representation

🩺 Convert openFDA device data from JSON files to tables and store in a database

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

# openfda-extract

This repository collects [adverse events data](https://open.fda.gov/apis/device/event/download/) from openFDA. A single [script](/data-raw/loop.R) attempts to:

1. Convert JSON files to tabular format using `jsonlite::fromJSON` and `tidyr::unnest`.
2. Save the data to a database.

The data is relational as described [here](https://opendata.stackexchange.com/a/2187). Converting the data to tabular format may not be efficient and causes lots of duplication. To avoid duplication, I have tried to store nested data in different tables:

1. `adverse_events`
2. `adverse_events.mdr_text`
3. `adverse_events.product_problems`
4. `adverse_events.source_type`
5. `adverse_events.device`
6. `adverse_events.patient`
7. `adverse_events.remedial_action`
8. `adverse_events.type_of_report`

Where the naming convention is `mainframe.`.

## Hardware

1. I transformed the data on:
* 2013 15" MacBook Pro, 8 GB Memory, 8 Core CPU.
2. I wrote the data to:
* Postgres database hosted on a digital ocean droplet.
* 2 GB Memory, 2 vCPUs, 60 GB Disk, Ubuntu 18.04.3 (LTS) x64.

For context, this was the result of my first run:

```
~ openFDA database refresh completed in [17.4161201466454 hours]
```

## Examples

If you have successfully ran everything, you should have 8 tables with millions of observations that you can explore.

```{r}
library(dplyr, warn.conflicts = FALSE)
library(DBI)

# credentials
dw <- config::get("datawarehouse")

# connect to db
con <- DBI::dbConnect(
odbc::odbc(),
Driver = dw$driver,
Server = dw$server,
Database = dw$database,
UID = dw$uid,
PWD = dw$pwd,
Port = dw$port
)

# list all available tables
dbListTables(con)

# query the mdr text
tbl(con, "adverse_events.mdr_text") %>%
select(text) %>%
head(1)

# query the device information
tbl(con, "adverse_events.device") %>%
filter(manufacturer_d_name == "ETHICON, INC.") %>%
glimpse()

# disconnect
dbDisconnect(con)
```

Note that there is an `id` column in every table, for example:

* `2014q4-0002-0003-3683`
* `---`

I made this column so that tables can be joined (though this ID in some other form might already exist in the data, I just haven't figured out if that is the case or not).