https://github.com/dylanpieper/redquack

REDCap and DuckDB Flock Together
https://github.com/dylanpieper/redquack

duckdb r redcap

Last synced: about 1 year ago
JSON representation

REDCap and DuckDB Flock Together

Host: GitHub
URL: https://github.com/dylanpieper/redquack
Owner: dylanpieper
License: other
Created: 2025-03-15T19:58:05.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-04-25T13:23:58.000Z (about 1 year ago)
Last Synced: 2025-04-25T22:38:41.735Z (about 1 year ago)
Topics: duckdb, r, redcap
Language: R
Homepage: https://dylanpieper.github.io/redquack/
Size: 9.53 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: NEWS.md
- License: LICENSE

Awesome Lists containing this project

README

          # redquack 

[![CRAN status](https://www.r-pkg.org/badges/version/redquack)](https://cran.r-pkg.org/package=redquack) [![R-CMD-check](https://github.com/dylanpieper/redquack/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/dylanpieper/redquack/actions/workflows/R-CMD-check.yaml)

Transfer [REDCap](https://www.project-redcap.org/) data to [DuckDB](https://duckdb.org/) with minimal memory overhead, designed for large datasets that exceed available RAM.

## Motivation

R objects live entirely in RAM, causing three problems if not using a specialized framework:

1.  You must load full datasets even if you only need a subset

2.  Unused objects still consume memory

3.  Large datasets can easily exceed available RAM

redquack's solution to this problem is to:

1.  Request all of the REDCap record IDs to sequence in chunks

2.  Process each chunk of the REDCap data in one R object at a time

3.  Remove each object from memory after it has been transferred to DuckDB

## Features

-   Chunked transfers for memory efficiency

-   Auto-resume from interruptions

-   Optimal data type conversion

-   Timestamped operation logs

-   Configurable API request retries

-   Real-time progress indicators

-   Completion notifications (🔊 🦆)

## Installation

From CRAN:

``` r

install.packages("redquack")

```

Development version:

``` r

# install.packages("pak")

pak::pak("dylanpieper/redquack")

```

## Basic Usage

Data from REDCap is transferred to DuckDB in configurable chunks of record IDs:

``` r

library(redquack)

con <- redcap_to_duckdb(

  redcap_uri = "https://redcap.example.org/api/",

  token = "YOUR_API_TOKEN",

  record_id_name = "record_id",

  chunk_size = 1000  

  # Increase chunk size for memory-efficient systems (faster)

  # Decrease chunk size for memory-constrained systems (slower)

)

```

By default, the function returns the DuckDB connection from the output file `redcap.duckdb`.

### Data Manipulation

Query and collect the data with `dplyr`:

``` r

library(dplyr)

demographics <- tbl(con, "data") |>

  filter(demographics_complete == 2) |>

  select(record_id, age, race, gender) |>

  collect()

```

Create a Parquet file directly from DuckDB (efficient for sharing data):

``` r

DBI::dbExecute(con, "COPY (SELECT * FROM data) TO 'redcap.parquet' (FORMAT PARQUET)")

```

Remember to close the connection when finished:

``` r

DBI::dbDisconnect(con)

```

### Workflow Mode

For scripted workflows or automated processes where you don't need to return the connection, you can use the function in workflow mode:

```r

success <- redcap_to_duckdb(

  redcap_uri = "https://redcap.example.org/api/",

  token = "YOUR_API_TOKEN",

  record_id_name = "record_id",

  return_duckdb = FALSE

)

if (success) {

  message("Data transfer completed successfully!")

} else {

  stop("Data transfer failed or is incomplete!")

}

```

When `return_duckdb = FALSE`, the function returns a logical value:

-   `TRUE` for a complete successful transfer

-   `FALSE` for a failed or partially completed transfer

Workflow mode automatically tries to resume incomplete transfers up to `max_retries` times.

## Database Structure

The DuckDB database created by `redcap_to_duckdb()` contains two tables:

1.  `data`: Contains all exported REDCap records with optimized column types

    ``` r

    DBI::dbGetQuery(con, "SELECT * FROM data LIMIT 10")

    ```

2.  `log`: Contains timestamped logs of the transfer process for troubleshooting

    ``` r

    DBI::dbGetQuery(con, "SELECT timestamp, type, message FROM log ORDER BY timestamp")

    ```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dylanpieper/redquack

Awesome Lists containing this project

README