Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hauselin/docdata

R package to generate dataset documentation semi-automatically https://hauselin.github.io/docdata/
https://github.com/hauselin/docdata

data-docs data-management data-sharing documentation documentation-tool open-science

Last synced: 29 days ago
JSON representation

R package to generate dataset documentation semi-automatically https://hauselin.github.io/docdata/

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures",
out.width = "100%"
)
```

# docdata

docdata is an R package that **generates documentation for datasets semi-automatically**. It streamlines the process of documenting when/where/who etc. a dataset is from. It also **standardizes documentation**.

Ideally, every dataset (e.g., csv/txt file) with tabular data should have a corresponding documentation file that describes the rows and columns of that dataset and other information about the dataset. `docdata` helps you accomplish all that.

`docdata` aims to make data docmentation and sharing easier. It helps you avoid being **that** person who shares data that no one else can use because nothing was documented.

[![Travis build status](https://travis-ci.org/hauselin/docdata.svg?branch=master)](https://travis-ci.org/hauselin/docdata)
[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/hauselin/docdata?branch=master&svg=true)](https://ci.appveyor.com/project/hauselin/docdata)

## Examples

Below are examples of documentation generated by `docdata`:

* [Data from experimental research](https://github.com/hauselin/depletion_bayes/tree/master/Data)
* Cognitive task data in [GitHub repository](https://github.com/hauselin/depletion_bayes/blob/master/Data/stroop_single_trial.md) and as a [raw markdown file](https://raw.githubusercontent.com/hauselin/depletion_bayes/master/Data/stroop_single_trial.md)

## Installation

To install the package, type the following commands into the R console:

``` r
# install.packages("devtools")
devtools::install_github("hauselin/docdata") # you might have to install devtools first (see above)
```

## How to use docdata?

**Step 1: use `doc_data()` to generate a documentation (markdown file)**

* Example: `doc_data("mtcars.csv")` (assuming `mtcars.csv` is a dataset in your working directory.)

**Step 2: use `disp_doc()` to print the doc in your console**

* Example: `disp_doc("mtcars.csv")` or `disp_doc("mtcars.md")`

**Step 3: use `doc_open()` to open the doc to edit it**

* Example: `doc_open("mtcars.csv")` or `doc_open("mtcars.md")`

**Step 4: use `doc_refresh()` to refresh/update your documentation**

* Example: `doc_refresh(mtcars.csv)` or `doc_refresh(mtcars.md)`

**Step 5: share your dataset and documentation file with others or your future self(!)**

### Step 1: `doc_data()`

`doc_data()` generates a markdown file that looks like the one shown below. If you dataset is `mtcars.csv`, the markdown file will be named `mtcars.md` and will be located in the same directory as `mtcars.csv`.

Example usage: `doc_data("mtcars.csv")` (assuming `mtcars.csv` is a dataset in your working directory.)

```
A GitHub flavored Markdown textfile documenting a dataset.

Generated using [docdata package](https://hauselin.github.io/docdata/) on 2019-12-08 18:16:46.
To cite this package, type citations("docdata") in console.

## Data source

mtcars.csv

## About this file

* What (is the data):
* Who (generated this documentation):
* Who (collected the data):
* When (was the data collected):
* Where (was the data collected):
* How (was the data collected):
* Why (was the data collected):

## Additional information

* Contact: [email protected]
* Registration: https://osf.io

## Columns

* Rows: 32
* Columns: 4

| Column | Type | Description |
| ------- | -------- | ----------- |
| mpg | numeric | |
| cyl | numeric | |
| disp | numeric | |
| hp | numeric | |

End of documentation.

```

### Step 2: `disp_doc()`

`disp_doc()` prints the documentation in your console. An example (truncated) output is shown below.

Example usage: `disp_doc("mtcars.csv")` or `disp_doc("mtcars.md")`

```
--- DOCUMENTATION BEGIN ---
1 A GitHub flavored Markdown textfile documenting a dataset.
2
3 Generated using docdata package on 2019-12-08 12:50:50.
4 To cite this package, type citations("docdata") in console.
5
6 ## Data source
7
8 mtcars.csv
9
10 ## About this file
...
--- DOCUMENTATION END ---
```

### Step 3: `doc_open()`

`doc_open()` opens the documentation in R or RStudio so you can edit it and fill in the details.

Example usage: `doc_open("mtcars.csv")` or `doc_open("mtcars.md")`

### Step 4: `doc_refresh()`

If your documentation looks messy after you've edited it (especially if the description column isn't aligned), run `doc_refresh()` to clean it up. Or if the columns/rows of your dataset have changed since the last time the documentation was generated, run this function again to update your documentation, which merges your previous documentation with a refreshed/updated one.

Example usage: `doc_refresh("mtcars.csv")` or `doc_refresh("mtcars.md")`

* Before (messy)

```
| Column | Type | Description |
| ------- | -------- | --------------------- |
| mpg | numeric | miles per gallon |
| cyl | numeric | number of cylinders |
| disp | numeric | displacement (cu.in.) |
| fakecolumn | numeric | non-existent column |
```

* After running `doc_refresh()`: spacing are cleaned and new columns are deleted/added

```
| Column | Type | Description |
| ------- | -------- | ---------------------- |
| mpg | numeric | miles per gallon |
| cyl | numeric | number of cylinders |
| disp | numeric | displacement (cu.in.) |
| hp | numeric | |
| drat | numeric | |
```

### Step 5: Share your dataset + documentation