https://github.com/WerthPADOH/naaccr
Read cancer records in the NAACCR format
https://github.com/WerthPADOH/naaccr
naaccr rstats
Last synced: 3 months ago
JSON representation
Read cancer records in the NAACCR format
- Host: GitHub
- URL: https://github.com/WerthPADOH/naaccr
- Owner: WerthPADOH
- License: other
- Created: 2018-07-13T12:53:03.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2022-11-23T15:50:53.000Z (about 2 years ago)
- Last Synced: 2024-04-24T21:21:22.008Z (10 months ago)
- Topics: naaccr, rstats
- Language: R
- Size: 1.01 MB
- Stars: 10
- Watchers: 3
- Forks: 2
- Open Issues: 3
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - WerthPADOH/naaccr - Read cancer records in the NAACCR format (R)
README
---
title: "naaccr"
output:
github_document:
html_preview: false
---```{r setup, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```## Summary
The `naaccr` R package enables researchers to easily read and begin analyzing
cancer incidence records stored in the
[North American Association of Central Cancer Registries](https://www.naaccr.org/)
(NAACCR) file format.## Usage
`naaccr` focuses on two tasks: arranging the records and preparing the fields
for analysis.### Records
The `naaccr_record` class defines objects which store cancer incidence records.
It inherits from `data.frame`, and for now only makes sure a dataset has a
standard set of columns. While `naaccr_record` has a singular-sounding name, it
can contain multiple records as rows.The `read_naaccr` function creates a `naaccr_record` object from a
NAACCR-formatted file.```{r showRecords}
record_file <- system.file(
"extdata/synthetic-naaccr-18-abstract.txt",
package = "naaccr"
)
record_lines <- readLines(record_file)
## Marital status and race fields
cat(substr(record_lines[1:5], 206, 216), sep = "\n")
``````{r readNaaccr}
library(naaccr)records <- read_naaccr(record_file, version = 18)
records[1:5, c("maritalStatusAtDx", "race1", "race2", "race3")]
```By default, `read_naaccr` reads all fields defined in a format. For example,
the NAACCR 18 format used above has `r nrow(naaccr_format_18)` fields. Rarely
would an analysis need even 100 fields. By specifying which fields to keep, one
can improve time and memory efficiency.```{r readKeepColumns}
dim(records)
records_slim <- read_naaccr(
input = record_file,
version = 18,
keep_fields = c("ageAtDiagnosis", "countyAtDx", "primarySite")
)
dim(records_slim)
```Like with most classes, one can create a new `naaccr_record` object with the
function of the same name. The result will have the given columns.```{r naaccrRecord}
nr <- naaccr_record(
primarySite = "C010",
dateOfBirth = "19450521"
)
nr[, c("primarySite", "dateOfBirth")]
```The `as.naaccr_record` function can transform an existing data frame. It does
require any existing columns to use NAACCR's XML names.```{r asNaaccrRecord}
prefab <- data.frame(
ageAtDiagnosis = c(1, 120, 999),
race1 = c("01", "02", "88")
)
converted <- as.naaccr_record(prefab)
converted[, c("ageAtDiagnosis", "race1")]
```### Code translation
The NAACCR format uses similar schemes for a lot of fields, and the `naaccr`
package includes functions to help translate them.`naaccr_boolean` translates "yes/no" fields. By default, it assumes `"0"` stands
for "no", and `"1"` stands for "yes."```{r naaccrBoolean}
naaccr_boolean(c("0", "1", "2"))
```Some fields use `"1"` for `FALSE` and `"2"` for `TRUE`. Use the `false_value`
parameter to work with these.```{r falseValue}
naaccr_boolean(c("0", "1", "2"), false_value = "1")
```#### Categorical fields
The `naaccr_factor` function translates values using a specific field's category
codes.```{r naaccrFactor}
naaccr_factor(c("01", "31", "65"), "primaryPayerAtDx")
```Some fields have multiple codes explaining why an actual value isn't known.
By default, they'll all be converted to `NA` so they can propagate that information in R.
But the reasons can be useful, so `naaccr_factor` and `naaccr_record` both have
a `keep_unknown` parameter.```{r keepUnknown}
naaccr_factor(c("1", "9"), field = "sex")
naaccr_factor(c("1", "9"), field = "sex", keep_unknown = TRUE)
naaccr_record(sex = c("1", "9"), race1 = c("01", "99"), keep_unknown = TRUE)
```#### Numeric with special missing
Some fields contain primarily continuous or count data but also use special
codes. One name for this type of code is a "sentinel value." The
`split_sentineled` function splits these fields in two.```{r naaccrSentineled}
rnp <- split_sentineled(c(10, 20, 90, 95, 99, NA), "regionalNodesPositive")
rnp
```## Building
```{r needForBuild}
library(devtools)deps <- packageDescription("naaccr", fields = c("Depends", "Imports", "Suggests"))
deps <- Filter(function(x) any(!is.na(x)), deps)
dep_names <- lapply(deps, function(x) devtools::parse_deps(x)[["name"]])
dep_names <- sort(unlist(dep_names))
dep_list <- paste0("- `", dep_names, "`", collapse = "\n")
```To build the `naaccr` package, you'll need the following R packages:
`r dep_list`
To document, build, and test the package, run the `build.R` script with the
package's root as the working directory.## Project files
First, know this project fills two roles:
1. Creating a package to work with NAACCR data in R.
2. Collecting the data needed to process NAACCR files in plain-text and
machine-readable formats.```
naaccr/
├ R/ # R files to create the package objects
├ data-raw/ # Plain-text data files and scripts for processing them
│ ├ code-labels/ # Mappings of codes to understandable labels
│ ├ sentinel-labels/ # Mappings of sentinel values to understandable labels
│ └ record-formats/ # Tables defining each NAACCR file format
├ external/ # Downloaded files and scripts to create files in `data-raw`
├ inst/
│ └ extdata/ # Data files for examples in the documentation
└ tests/ # tests and data using the `testthat` package
```Files in `external` only need to be updated or run when NAACCR publishes a new
or revised format. In that case, refer to the comments in the `.R` scripts in
that directory for where to download the new files.Think of these scripts as handy tools for generating `data-raw` files.
Some cleaning of their output may be required.To run `create-record-format-files.R`, you'll need to create an account for the
[SEER API](https://api.seer.cancer.gov/) from the National Cancer Institute's
Surveillance, Epidemiology and End Results (SEER) program.
Store the API key as an environment variable named `SEER_API_KEY`.