https://github.com/lydialucchesi/smallsets
Visual documentation for data preprocessing in R and Python
https://github.com/lydialucchesi/smallsets
data-science data-visualization documentation-tool machine-learning preprocessing python r r-package visualization-tools
Last synced: 8 months ago
JSON representation
Visual documentation for data preprocessing in R and Python
- Host: GitHub
- URL: https://github.com/lydialucchesi/smallsets
- Owner: lydialucchesi
- License: gpl-3.0
- Created: 2020-10-20T04:42:32.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2025-01-23T18:31:56.000Z (about 1 year ago)
- Last Synced: 2025-03-17T05:05:40.820Z (11 months ago)
- Topics: data-science, data-visualization, documentation-tool, machine-learning, preprocessing, python, r, r-package, visualization-tools
- Language: R
- Homepage: https://lydialucchesi.github.io/smallsets/
- Size: 16.1 MB
- Stars: 14
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
Awesome Lists containing this project
README
---
output: github_document
---
```{r, echo=FALSE, out.width="17%", fig.align="right", out.extra='style="float:right; padding:15px"'}
knitr::include_graphics("man/figures/hex_sticker.png")
```
# smallsets: Visual Documentation for Data Preprocessing in R and Python
[](https://CRAN.R-project.org/package=smallsets)

**`smallsets` website: [lydialucchesi.github.io/smallsets/](https://lydialucchesi.github.io/smallsets/)**
Do you use R or Python to preprocess datasets for analyses? `smallsets` is an R package (https://CRAN.R-project.org/package=smallsets) that transforms the preprocessing code in your R, R Markdown, Python, or Jupyter Notebook file into a Smallset Timeline. A Smallset Timeline is a static, compact visualisation composed of small data snapshots of different preprocessing steps. A full description of the Smallset Timeline can be found in the paper [**Smallset Timelines: A Visual Representation of Data Preprocessing Decisions**](https://doi.org/10.1145/3531146.3533175) in the proceedings of ACM FAccT '22.
The `smallsets` user guide is available [here](https://lydialucchesi.github.io/smallsets/articles/smallsets.html) and in the package in `vignette("smallsets")`. If you have questions or would like help building a Smallset Timeline, please [email Lydia](mailto:lydia.lucchesi@anu.edu.au).
**[Download the smallsets cheatsheet (1-page PDF)](https://lydialucchesi.github.io/smallsets_cheatsheet/smallsets_cheatsheet.pdf)**
## Install from CRAN
```{r, eval=FALSE}
install.packages("smallsets")
```
## Quick start example
Run this snippet of code to build your first Smallset Timeline! It's based on the synthetic dataset s_data, with 100 observations and eight variables (C1-C8), and the preprocessing script s_data_preprocess.R, discussed in the following section.
```{r quick-start-example, eval=FALSE}
library(smallsets)
set.seed(145)
Smallset_Timeline(data = s_data,
code = system.file("s_data_preprocess.R", package = "smallsets"))
```

## Structured comments
The Smallset Timeline above is based on the R preprocessing script below, s_data_preprocess.R. Structured comments were added to it, informing `smallsets` what to do.
```{r, code=readLines(system.file("s_data_preprocess.R", package="smallsets")), eval=FALSE, class.source="view-only"}
```
## Citing `smallsets`
If you use the `smallsets` software, please cite the Smallset Timeline paper.
Lydia R. Lucchesi, Petra M. Kuhnert, Jenny L. Davis, and Lexing Xie. 2022. Smallset Timelines: A Visual Representation of Data Preprocessing Decisions. In 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT '22). Association for Computing Machinery, New York, NY, USA, 1136–1153. https://doi.org/10.1145/3531146.3533175
```
@inproceedings{SmallsetTimelines,
author = {Lucchesi, Lydia R. and Kuhnert, Petra M. and Davis, Jenny L. and Xie, Lexing},
title = {Smallset Timelines: A Visual Representation of Data Preprocessing Decisions},
year = {2022},
isbn = {9781450393522},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3531146.3533175},
doi = {10.1145/3531146.3533175},
location = {Seoul, Republic of Korea},
series = {FAccT '22}
}
```