https://github.com/dataobservatory-eu/dataset

Create interoperable and well described data frames in R
https://github.com/dataobservatory-eu/dataset

dataset metadata-management r rstats

Last synced: about 1 month ago
JSON representation

Create interoperable and well described data frames in R

Host: GitHub
URL: https://github.com/dataobservatory-eu/dataset
Owner: dataobservatory-eu
License: gpl-3.0
Created: 2022-06-23T12:06:27.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2025-04-22T07:00:31.000Z (about 1 month ago)
Last Synced: 2025-04-23T03:48:38.473Z (about 1 month ago)
Topics: dataset, metadata-management, r, rstats
Language: R
Homepage: http://dataset.dataobservatory.eu/
Size: 1.4 MB
Stars: 14
Watchers: 1
Forks: 7
Open Issues: 16
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

jimsghstars - dataobservatory-eu/dataset - Create interoperable and well described data frames in R (R)

README

        ---

output: github_document

---

```{r setupdefinitions, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

rlang::check_installed("here")

```

# The dataset R Package 

[![rhub](https://github.com/dataobservatory-eu/dataset/actions/workflows/rhub.yaml/badge.svg)](https://github.com/dataobservatory-eu/dataset/actions/workflows/rhub.yaml)

[![lifecycle](https://lifecycle.r-lib.org/articles/figures/lifecycle-experimental.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)

[![Project Status: WIP](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)

[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/dataset)](https://cran.r-project.org/package=dataset)

[![CRAN_time_from_release](https://www.r-pkg.org/badges/ago/dataset)](https://cran.r-project.org/package=dataset)

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/553_status.svg)](https://github.com/ropensci/software-review/issues/553)

[![DOI](https://zenodo.org/badge/DOI/10.32614/CRAN.package.dataset.svg)](https://zenodo.org/record/6950435#.YukDAXZBzIU)

[![devel-version](https://img.shields.io/badge/devel%20version-0.3.4017-blue.svg)](https://github.com/dataobservatory-eu/dataset)

[![dataobservatory](https://img.shields.io/badge/ecosystem-dataobservatory.eu-3EA135.svg)](https://dataobservatory.eu/)

[![Codecov test coverage](https://codecov.io/gh/dataobservatory-eu/dataset/graph/badge.svg)](https://app.codecov.io/gh/dataobservatory-eu/dataset)

The aim of the _dataset_ package is to make tidy datasets easier to release, exchange and reuse. It organizes and formats data frame R objects into well-referenced, well-described, interoperable datasets into release and reuse ready form.

You can install the latest CRAN release with `install.packages("dataset")`, and the latest development version of dataset with `remotes::install_github()`:

```{r installation, eval=FALSE}

install.packages("dataset")

remotes::install_github("dataobservatory-eu/dataset", build = FALSE)

```

The current version of the `dataset` package is in an early, experimental stage. You can follow the discussion of this package on [rOpenSci #553](https://github.com/ropensci/software-review/issues/553) about the original scope, that included the datacube data model, and the [rOpenSci #681](https://github.com/ropensci/software-review/issues/681) on the new version that moves the data cube data model of SDMX into a future downstream package. (See, again, the [Motivation](https://dataset.dataobservatory.eu/articles/Motivation.html) article.)

Interoperability and future (re)usability depends on the amount and quality of the metadata that was generated, recorded, and released together with the data. The `dataset` package aims to collect such metadata and record them in the least possible intrusive way.

## Semantically richer data frames

Let's take a simple data.frame from the datasets package. The "Growth of Orange Trees"

dataset contains 35 rows and 3 columns that record the growth of orange trees.

```{r initialise}

library(datasets)

head(datasets::Orange)

```

Following the tidy data principle, we create an unambiguous row identifier. Then

we go three steps further: 

1. We add more semantic information about the meaning of the 

variables, for example, to avoid joining numeric variables of the same type (numeric

or integer) but different unit of measure (mm vs cm.)

```{r richerorange}

library(dataset)

data("orange_df")

orange_df

```

```{r unit}

var_unit(orange_df$circumference)

```

The `dataset_df` behaves as expected from a data.frame-like object. 

```{r summary}

summary(orange_df)

```

2. We add more descriptive metadata to make the dataset easier to find and reuse:

```{r bibentry}

print(get_bibentry(orange_df), "BibTex")

```

3. We add provenance metadata to increase the trust and usability of the dataset. 

This feature is highly experimental at this point and will be further developed 

considering usability and new use cases.

```{r provenance}

provenance(orange_df)

```

1. **Increase FAIR use of your datasets**: Offer a way to better utilise the `utils:bibentry` bibliographic entry objects to offer more comprehensive and standardised descriptive metadata utilising the  [DCTERMS](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/) and [DataCite](https://datacite-metadata-schema.readthedocs.io/en/4.6/) standards. This will lead to a higher level of findability and accessibility, and a better use of the rOpenSci package [RefManageR](https://docs.ropensci.org/RefManageR/). See for more information the [Bibentry for FAIR datasets](https://dataset.dataobservatory.eu/articles/bibentry.html) vignette.

2. **Interoperability outside R**: Extending the `haven_labelled` class of the `tidyverse` for consistently labelled categorical variables with linked (standard) definitions and units of measures in our [defined](https://dataset.dataobservatory.eu/articles/defined.html) class; this enables to share metadata not only about the dataset as a whole, but about its key components (rows and columns), including precise definitions, units of measures. This results in a higher level of interoperability and reusability, within and outside of the R ecossytem. 

3. **Tidy data tidier, richer**: Offering a new data frame format, `dataset_df` that extends tibbles with semantically rich metadata, ready to be shared on open data exchange platforms and in data repositories. This s3 class is aimed at developers and we are working on several packages that provide interoperability with SDMX statistical data exchange platforms, Wikidata, or the EU Open Data portal. Read more in the [Create Datasets that are Easy to Share Exchange and Extend](https://dataset.dataobservatory.eu/articles/dataset_df.html) vignette.

4. Adding provenance metadata to make your dataset easier to reuse by making its history known to future users. We have no vignette on this topic, but you find at the bottom of this `README` an example.

5. **Releasing and exchanging datasets**: The [From R to RDF](https://dataset.dataobservatory.eu/articles/rdf.html) vignette shows how to leverage the capabilities of the _dataset_ package with [rdflib](https://docs.ropensci.org/rdflib/index.html), an R-user-friendly wrapper on rOpenSci to work with the _redland_ Python library for performing common tasks on RDF data, such as parsing and converting between formats including rdfxml, turtle, nquads, ntriples, creating RDF graphs, and performing SPARQL queries.

Putting it all together: the  [Motivation](https://dataset.dataobservatory.eu/articles/Motivation.html) explains in a long case study why `tidyverse` and the *tidy data principle* is no longer sufficient for a high level of interoperability and reusability. 

## Semantically richer data frame columns

It is important to see that we do not only increase the semantics of the dataset as a whole, but also the semantics of each variable. R users often have a problem with the reusability of their data frames because, by default, a variable is only described by a programmatically usable name label.

When working with datasets that receive their components from different linked open data sources, it is particularly important to have a more precise semantic definition and description of each variable.

```{r defined}

gdp_1 <- defined(

  c(3897, 7365),

  label = "Gross Domestic Product",

  unit = "million dollars",

  definition = "http://data.europa.eu/83i/aa/GDP"

)

# Summarise this semantically better defined vector:

summary(gdp_1)

# See its attributes under the hood:

attributes(gdp_1)

```

## Dataset Provenance

The constructor of the `dataset_df` objects also records the most important processes that created or modified the dataset. This experimental feature has not been fully developed in the current _dataset_ version. The aim is to provide a standard way of describing the processes that help to understand what happened with your data using the W3C [PROV-O](https://www.w3.org/TR/prov-o/) provenance ontology and the [RDF 1.1 N-Triples](https://www.w3.org/TR/n-triples/) W3C standard for describing these processes in a flat file.

```{r provenanceagain}

provenance(orange_df)

```

## Code of Conduct

Please note that the `dataset` package is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.

Furthermore, [rOpenSci Community Contributing Guide](https://contributing.ropensci.org/) - *A guide to help people find ways to contribute to rOpenSci* is also applicable, because `dataset` is under software review for potential inclusion in [rOpenSci](https://github.com/ropensci/software-review/issues/553).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dataobservatory-eu/dataset

Awesome Lists containing this project

README