Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aaronrudkin/autumn
autumn: Fast, Modern, and Tidy-Friendly Iterative Raking in R.
https://github.com/aaronrudkin/autumn
r r-package r-stats raking stats survey weighting
Last synced: about 2 months ago
JSON representation
autumn: Fast, Modern, and Tidy-Friendly Iterative Raking in R.
- Host: GitHub
- URL: https://github.com/aaronrudkin/autumn
- Owner: aaronrudkin
- License: other
- Created: 2019-11-17T07:23:51.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2024-01-31T17:50:11.000Z (12 months ago)
- Last Synced: 2024-08-13T07:14:27.312Z (5 months ago)
- Topics: r, r-package, r-stats, raking, stats, survey, weighting
- Language: R
- Size: 471 KB
- Stars: 36
- Watchers: 2
- Forks: 8
- Open Issues: 5
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - aaronrudkin/autumn - autumn: Fast, Modern, and Tidy-Friendly Iterative Raking in R. (R)
README
---
output: github_document
---```{r, echo = FALSE, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
library(autumn)
library(bench)
library(tidyverse)
```# autumn: Fast, Modern, and Tidy Raking
> *"And as to me, I know nothing else but miracles"*
> - Walt Whitman, probably talking about this package.[![Travis-CI build status](https://travis-ci.org/r-lib/pkgdown.svg?branch=master)](https://travis-ci.org/aaronrudkin/autumn)
[![Coverage Status](https://coveralls.io/repos/github/aaronrudkin/autumn/badge.svg?branch=master)](https://coveralls.io/github/aaronrudkin/autumn?branch=master)Iterative proportional fitting (raking) is a straightforward and fast way to generate weights which ensure a dataset reflects known target marginal distributions: put simply, survey professionals use raking to ensure that samples represent the population they are drawn from.
Existing `R` implementations of raking are frustrating to use, have antiquated syntax, require external dependencies or compilation, have inadequate documentation, generate difficult to understand errors, run slowly, and don't support "tidy" workflows. **autumn** is a modern package built from the ground up to fix these problems.
## Installation
**autumn** will be submitted to CRAN in January 2020. In the meantime, you can install it using the following command:
```{r, eval = FALSE}
# Install GitHub version:
devtools::install_github("aaronrudkin/autumn")
```## Usage
The workhorse function of **autumn** is `harvest()`, which takes at minimum two arguments: 1) a data.frame (or tibble) containing data; 2) target proportions. At its simplest, a call to `harvest()` works as follows:
```{r eval=FALSE}
# Standard R function call
harvest(respondent_data, ns_target)# Using `magrittr`'s pipe operator
respondent_data %>% harvest(ns_target)
```It just works! This function call will iteratively weight observations to match the target proportions and add a column `weights` to the data frame (it is also possible to rename the column or return the weights as a vector). Default parameters are helpful and sane: weights are guaranteed mean 1 and maximum 5.
### Specifying a Target
The main challenge when running `harvest()` is to correctly specify target proportions. Two formats are supported: 1) a list of named vectors; 2) a data.frame or tibble.
When supplying targets as a list of named vectors, it looks like this:
```{r, eval=FALSE}
list(
gender = c(Male = 0.4829, Female = 0.5171),
region = c(Midwest = 0.2086,
Northeast = 0.1764,
South = 0.3775,
West = 0.2374)
)
```Each list element should match the name of a single variable in the data, and each vector name should match a value the variable can take. The numeric values should be positive and sum to 1 within each variable.
When supplying data as a data.frame or tibble, the data.frame should have three columns (by default `harvest()` looks for columns named "variable", "level", and "proportion" -- although these names can be overridden):
```{r echo=FALSE}
target_tbl = as_tibble(autumn:::list_targets_to_df(ns_target[c("gender", "region")])) %>% mutate(proportion = round(proportion, 4))
```
```{r}
target_tbl
```### Advanced Usage
**autumn** supports a variety of advanced features including:
- Supplying starting weights
- Adjusting maximum weights
- Adjusting convergence and iteration criteria
- Adjusting variable selection and error calculation criteria
- Handling missing data appropriately
- Calculating design effects for produced weights
- Summarizing raking resultsInterested in doing something fancy? Check out our R vignettes for more details:
TODO VIGNETTES GO HERE## Speed `r emo::ji("rocket")`
How fast is **autumn**? Fast.
Below, we present results of three different benchmark scenarios, each using real data (the first two benchmarks use the `respondent_data` and `ns_target` datasets included with **autumn**). All of these benchmarks use identical data and default parameterizations, and were run on a low power 2016-vintage personal computer. The larger the the dataset and the more complicated the rake, the more you benefit from using **autumn**. Customizing convergence criteria to allow for earlier termination can result in further speed improvements over existing software.
*Note:*
### Small scale
This benchmark generates weights for a dataset of 6,691 observations, raking on 10 variables. Compared with the implementation in **anesrake**, **autumn** is about *67% faster* and allocates one third less memory. Compared with the implementation in **survey**, **autumn** is about *4X as fast* and allocates 20% more memory.
```{r echo=FALSE}
# Pre-compute the result above on real data and store as a pre-made expression
# so that we can
result_small_benchmark = structure(list(expression = c("autumn", "anesrake", "survey"),
min = structure(c(1.34557561800001, 2.462163475, 4.56870103200001
), class = c("bench_time", "numeric")), median = structure(c(1.73368290000001,
2.892148122, 6.756519067), class = c("bench_time", "numeric"
)), `itr/sec` = c(0.572472241007729, 0.314504693882058, 0.147574009494806
), mem_alloc = structure(c(774216136, 1189511664, 644706568
), class = c("bench_bytes", "numeric")), `gc/sec` = c(3.00547926529058,
2.42483118983067, 0.866259435734511), n_itr = c(100L, 100L,
100L)), class = c("bench_mark", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L))result_small_benchmark
```### Medium scale
Consider a raking task that is more difficult to converge: the same dataset (6,691 observations) raked on 17 variables. The extra variables involve interactions which greatly complicate convergence. **autumn** is *three times as fast* as **anesrake** and uses almost two thirds less memory (**survey** will not complete the rake):
```{r echo=FALSE}
result_big_benchmark = structure(list(expression = c("autumn", "anesrake"), min = structure(c(2.29740926100001,
8.409736236), class = c("bench_time", "numeric")), median = structure(c(3.0174227935,
9.993015223), class = c("bench_time", "numeric")), `itr/sec` = c(0.327217747698054,
0.0945252613310439), mem_alloc = structure(c(1305062264, 3338550552
), class = c("bench_bytes", "numeric")), `gc/sec` = c(6.57707672873089,
4.76974468676447), n_itr = c(100L, 100L), n_gc = c(2010, 5046
)), class = c("bench_mark", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-2L))result_big_benchmark
```### Large scale
Finally, consider an extremely resource intensive problem: raking a much larger dataset of 108,660 observations on 17 variables. In this scenario, **autumn** is 11 times faster and uses 92% less memory. (This benchmark is limited to 10 iterations):
```{r echo=FALSE}
# result_mega_benchmark = bench::mark(
# "autumn" = { autumn::harvest(result_mega, ns_target) },
# "anesrake" = { anesrake::anesrake(ns_target,
# junk_data_mega,
# junk_data_mega$ResponseID) },
# check = FALSE,
# min_iterations = 10
# ) %>% select(1:8) %>%
# mutate(expression = c("autumn", "anesrake"))result_mega_benchmark = structure(list(expression = c("autumn", "anesrake"), min = structure(c(47.098076077,
489.156242585), class = c("bench_time", "numeric")), median = structure(c(48.7915496325,
527.8651931685), class = c("bench_time", "numeric")), `itr/sec` = c(0.0200139846259755,
0.00189092458918188), mem_alloc = structure(c(22328143016, 256220489208
), class = c("bench_bytes", "numeric")), `gc/sec` = c(2.43570192898122,
2.52211521705079), n_itr = c(10L, 10L), n_gc = c(1217, 13338)), class = c("bench_mark",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -2L))result_mega_benchmark
```## Why is the package called "autumn"?
## Authorship and Funding
**autumn** is written and maintained by [Aaron Rudkin](https://github.com/aaronrudkin/). Target proportions in the included `ns_target` data were developed by [Alex Rossell-Hayes](https://github.com/rossellhayes).
If you have any comments, issues, or concerns, please [open a GitHub issue](https://github.com/aaronrudkin/autumn/issues). Contributions are welcome. Please see our [Contributor Code of Conduct](.github/CODE_OF_CONDUCT.md) for details.
**autumn** was developed in conjunction with [Democracy Fund + UCLA Nationscape](https://www.voterstudygroup.org/nationscape), one of the largest public opinion surveys ever conducted. UCLA's Nationscape team are: [Tyler Reny](http://tylerreny.github.io/), [Alex Rossell-Hayes](http://alexander.rossellhayes.com/), [Aaron Rudkin](https://github.com/aaronrudkin/), [Chris Tausanovitch](http://www.ctausanovitch.com/), and [Lynn Vavreck](https://www.lynnvavreck.com/). Funding for this project was provided by [Democracy Fund](https://www.democracyfund.org/), part of the [Omidyar Group](http://omidyargroup.com/).
![UCLA + Democracy Fund](man/figures/logo_ucla_demfund.png "UCLA + Democracy Fund")
Package hex logo adapted from art by [Freepik](https://www.flaticon.com/authors/Freepik) from [flaticon.com](https://www.flaticon.com/)