An open API service indexing awesome lists of open source software.

https://github.com/stefanieschneider/unstruwwel

Detect and Parse Historic Dates in R
https://github.com/stefanieschneider/unstruwwel

dates nlp parser r

Last synced: 5 months ago
JSON representation

Detect and Parse Historic Dates in R

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)

options(width = "100")

require(unstruwwel)
require(magrittr)
```

# unstruwwel

[![Lifecycle badge](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html#maturing)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4451796.svg)](https://doi.org/10.5281/zenodo.4451796)
[![CRAN badge](http://www.r-pkg.org/badges/version/unstruwwel)](https://cran.r-project.org/package=unstruwwel)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/stefanieschneider/unstruwwel?branch=master&svg=true)](https://ci.appveyor.com/project/stefanieschneider/unstruwwel)
[![Coverage status](https://codecov.io/github/stefanieschneider/unstruwwel/coverage.svg?branch=master)](https://app.codecov.io/github/stefanieschneider/unstruwwel?branch=master)

## Overview

This R package provides means to detect and parse historic dates, e.g., to ISO 8601:2-2019. It automatically converts language-specific verbal information, e.g., “circa 1st half of the 19th century,” into its standardized numerical counterparts, e.g., “1801-01-01\~/1850-12-31\~.” The package follows the recommendations of the MIDAS (Marburger Informations-, Dokumentations- und Administrations-System), see, e.g., https://doi.org/10.11588/artdok.00003770. It internally uses [lubridate](https://github.com/tidyverse/lubridate). The name of the package is inspired by Heinrich Hoffmann’s rhymed story “[Struwwelpeter](http://www.gutenberg.org/files/12116/12116-h/12116-h.htm#Shock-headed_Peter)”, which goes as follows:

> Just look at him! there he stands,
with his nasty hair and hands.
See! his nails are never cut;
they are grimed as black as soot;
and the sloven, I declare,
never once has combed his hair;
anything to me is sweeter
than to see Shock-headed Peter.

For the German-language original text, see the online digital library [Wikisource](https://de.wikisource.org/wiki/Der_Struwwelpeter/Struwwelpeter).

## Installation

You can install the released version of unstruwwel from [CRAN](https://CRAN.R-project.org) with:

``` r
install.packages("unstruwwel")
```

To install the development version from [GitHub](https://github.com/stefanieschneider/unstruwwel) use:

``` r
# install.packages("devtools")
devtools::install_github("stefanieschneider/unstruwwel")
```

## Usage

The unstruwwel package contains only one function, `unstruwwel()`, that does all the *magic* language-specific standardization. `unstruwwel()` returns a named list, where each element is the result of applying the function to the corresponding element in the input vector.

### English-language examples
```{r example_en, message=FALSE, warning=FALSE}
dates <- c(
"5th century b.c.", "unknown", "late 16th century", "mid-12th century",
"mid-1880s", "June 1963", "August 11, 1958", "ca. 1920", "before 1856"
)

# returns valid ISO 8601:2-2019 dates
unlist(unstruwwel(dates, "en", scheme = "iso-format"), use.names = FALSE)

# returns a numerical interval of length 2
unstruwwel(dates, language = "en", scheme = "time-span") %>%
tibble::as_tibble() %>%
dplyr::mutate(id = dplyr::row_number()) %>%
tidyr::gather(key = id) %>%
tidyr::unnest_wider(value, names_sep = "_") %>%
dplyr::rename_all(dplyr::funs(c("text", "start", "end")))
```

### German-language examples
```{r example_de, message=FALSE, warning=FALSE}
dates <- c(
"letztes Drittel 15. und 1. Hälfte 16. Jahrhundert", "undatiert", "1460?",
"wohl nach 1923", "spätestens 1750er Jahre", "1897 (Guss vmtl. vor 1906)"
)

# returns valid ISO 8601:2-2019 dates
unlist(unstruwwel(dates, "de", scheme = "iso-format"), use.names = FALSE)

# returns a numerical interval of length 2
unstruwwel(dates, language = "de", scheme = "time-span") %>%
tibble::as_tibble() %>%
dplyr::mutate(id = dplyr::row_number()) %>%
tidyr::gather(key = id) %>%
tidyr::unnest_wider(value, names_sep = "_") %>%
dplyr::rename_all(dplyr::funs(c("text", "start", "end")))
```

## Contributing

Please report issues, feature requests, and questions to the [GitHub issue tracker](https://github.com/stefanieschneider/unstruwwel/issues). We have a [Contributor Code of Conduct](https://github.com/stefanieschneider/unstruwwel/blob/master/CODE_OF_CONDUCT.md). By participating in unstruwwel you agree to abide by its terms.