https://github.com/dataobservatory-eu/dataset
Create interoperable and well described data frames in R
https://github.com/dataobservatory-eu/dataset
dataset metadata-management r rstats
Last synced: 4 months ago
JSON representation
Create interoperable and well described data frames in R
- Host: GitHub
- URL: https://github.com/dataobservatory-eu/dataset
- Owner: dataobservatory-eu
- License: gpl-3.0
- Created: 2022-06-23T12:06:27.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2024-10-18T22:11:12.000Z (6 months ago)
- Last Synced: 2024-12-08T04:42:48.272Z (4 months ago)
- Topics: dataset, metadata-management, r, rstats
- Language: R
- Homepage: http://dataset.dataobservatory.eu/
- Size: 1000 KB
- Stars: 12
- Watchers: 1
- Forks: 3
- Open Issues: 6
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md
- Codemeta: codemeta.json
Awesome Lists containing this project
- jimsghstars - dataobservatory-eu/dataset - Create interoperable and well described data frames in R (R)
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
rlang::check_installed("here")
```[](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[](https://www.repostatus.org/#wip)
[](https://cran.r-project.org/package=dataset)
[](https://cran.r-project.org/package=dataset)
[](https://github.com/ropensci/software-review/issues/553)
[](https://zenodo.org/record/6950435#.YukDAXZBzIU)
[](https://github.com/dataobservatory-eu/dataset)
[](https://dataobservatory.eu/)
[](https://app.codecov.io/gh/dataobservatory-eu/dataset?branch=master)
[](https://github.com/dataobservatory-eu/dataset/actions?query=workflow%3Apkgcheck)
[](https://ci.appveyor.com/project/dataobservatory-eu/dataset)
[](https://github.com/dataobservatory-eu/dataset/actions/workflows/R-CMD-check.yaml)The dataset package extension to the R statistical environment aims to ensure that the most important R object that contains a dataset, i.e. a [data.frame](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame) or an inherited [tibble](https://tibble.tidyverse.org/reference/tibble.html), [tsibble](https://tsibble.tidyverts.org/) or [data.table](https://rdatatable.gitlab.io/data.table/) contains important metadata for the reuse and validation of the dataset contents. We aim to offer a novel solution to support individuals or small groups of data scientists working in various business, academic or policy research functions who cannot count on the support of librarians, knowledge engineers, and extensive documentation processes.
The dataset package extends the concept of tidy data and adds further, standardized semantic information to the user's dataset to increase the (re-)use value of the data object.
- [x] More descriptive information about the dataset as a creation, its authors, contributors, reuse rights and other metadata to make it easier to find and use.
- [x] More standardized and linked metadata, such as standard variable definitions and code lists, enable the data owner to gather far more information from third parties or for third parties to understand and use the data correctly.
- [x] More information about the data provenance makes the quality assessment easier and reduces the need for time-consuming and unnecessary re-processing steps.
- [x] More structural information about the data makes it more accessible to reuse and join with new information, making it less error-prone for logical errors.Further development plans for peer-review are added in till 5 November 2024 here:
[New Requirement](https://dataset.dataobservatory.eu/articles/new-requirements.html) setting.The current version of the `dataset` package is in an early, experimental stage. You can follow the discussion of this package on [rOpenSci](https://github.com/ropensci/software-review/issues/553).
```{r initialise}
library(dataset)
iris_ds <- dataset(
x = iris,
title = "Iris Dataset",
author = person("Edgar", "Anderson", role = "aut"),
publisher = "American Iris Society",
source = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
date = 1935,
language = "en",
description = "This famous (Fisher's or Anderson's) iris data set."
)
```It is mandatory to add a `title`, `author` to a dataset, and if the `date` is not specified, the current date will be added.
As the dataset at this point is just created, if it is not published yet, the `identifer` receives the default `:tba` value, a `version` of 0.1.0 and the `:unas` (unassigned) `publisher` field.
The dataset behaves as expected, with all data.frame methods applicable. If the dataset was originally a tibble or data.table object, it retained all methods of these s3 classes because the dataset class only implements further methods in the attributes of the original object.
```{r summary}
summary(iris_ds)
```A brief description of the extended metadata attributes:
```{r describe}
describe(iris_ds)
``````{r individualattributes}
paste0("Publisher:", publisher(iris_ds))
paste0("Rights:", rights(iris_ds))
```The descriptive metadata are added to a `utils::bibentry` object which has many printing options (see `?bibentry`).
```{r bibentry}
mybibentry <- dataset_bibentry(iris_ds)
print(mybibentry, "text")
print(mybibentry, "Bibtex")
``````{r prevent-overwrite}
rights(iris_ds) <- "CC0"
rights(iris_ds)
rights(iris_ds, overwrite = FALSE) <- "GNU-2"
```Some important metadata is protected from accidental overwriting (except for the default `:unas` unassigned and `:tba` to-be-announced values.)
```{r overwrite}
rights(iris_ds, overwrite = TRUE) <- "GNU-2"
```
## Code of Conduct
Please note that the `dataset` package is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.Furthermore, [rOpenSci Community Contributing Guide](https://contributing.ropensci.org/) - *A guide to help people find ways to contribute to rOpenSci* is also applicable, because `dataset` is under software review for potential inclusion in [rOpenSci](https://github.com/ropensci/software-review/issues/553).