https://github.com/stefanieschneider/unstruwwel
Detect and Parse Historic Dates in R
https://github.com/stefanieschneider/unstruwwel
dates nlp parser r
Last synced: 5 months ago
JSON representation
Detect and Parse Historic Dates in R
- Host: GitHub
- URL: https://github.com/stefanieschneider/unstruwwel
- Owner: stefanieschneider
- License: gpl-3.0
- Created: 2019-06-27T16:51:24.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2024-08-28T14:04:05.000Z (8 months ago)
- Last Synced: 2024-10-30T09:24:18.147Z (6 months ago)
- Topics: dates, nlp, parser, r
- Language: R
- Homepage:
- Size: 788 KB
- Stars: 7
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)options(width = "100")
require(unstruwwel)
require(magrittr)
```# unstruwwel
[](https://lifecycle.r-lib.org/articles/stages.html#maturing)
[](https://doi.org/10.5281/zenodo.4451796)
[](https://cran.r-project.org/package=unstruwwel)
[](https://ci.appveyor.com/project/stefanieschneider/unstruwwel)
[](https://app.codecov.io/github/stefanieschneider/unstruwwel?branch=master)## Overview
This R package provides means to detect and parse historic dates, e.g., to ISO 8601:2-2019. It automatically converts language-specific verbal information, e.g., “circa 1st half of the 19th century,” into its standardized numerical counterparts, e.g., “1801-01-01\~/1850-12-31\~.” The package follows the recommendations of the MIDAS (Marburger Informations-, Dokumentations- und Administrations-System), see, e.g., https://doi.org/10.11588/artdok.00003770. It internally uses [lubridate](https://github.com/tidyverse/lubridate). The name of the package is inspired by Heinrich Hoffmann’s rhymed story “[Struwwelpeter](http://www.gutenberg.org/files/12116/12116-h/12116-h.htm#Shock-headed_Peter)”, which goes as follows:
> Just look at him! there he stands,
with his nasty hair and hands.
See! his nails are never cut;
they are grimed as black as soot;
and the sloven, I declare,
never once has combed his hair;
anything to me is sweeter
than to see Shock-headed Peter.For the German-language original text, see the online digital library [Wikisource](https://de.wikisource.org/wiki/Der_Struwwelpeter/Struwwelpeter).
## Installation
You can install the released version of unstruwwel from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("unstruwwel")
```To install the development version from [GitHub](https://github.com/stefanieschneider/unstruwwel) use:
``` r
# install.packages("devtools")
devtools::install_github("stefanieschneider/unstruwwel")
```## Usage
The unstruwwel package contains only one function, `unstruwwel()`, that does all the *magic* language-specific standardization. `unstruwwel()` returns a named list, where each element is the result of applying the function to the corresponding element in the input vector.
### English-language examples
```{r example_en, message=FALSE, warning=FALSE}
dates <- c(
"5th century b.c.", "unknown", "late 16th century", "mid-12th century",
"mid-1880s", "June 1963", "August 11, 1958", "ca. 1920", "before 1856"
)# returns valid ISO 8601:2-2019 dates
unlist(unstruwwel(dates, "en", scheme = "iso-format"), use.names = FALSE)# returns a numerical interval of length 2
unstruwwel(dates, language = "en", scheme = "time-span") %>%
tibble::as_tibble() %>%
dplyr::mutate(id = dplyr::row_number()) %>%
tidyr::gather(key = id) %>%
tidyr::unnest_wider(value, names_sep = "_") %>%
dplyr::rename_all(dplyr::funs(c("text", "start", "end")))
```### German-language examples
```{r example_de, message=FALSE, warning=FALSE}
dates <- c(
"letztes Drittel 15. und 1. Hälfte 16. Jahrhundert", "undatiert", "1460?",
"wohl nach 1923", "spätestens 1750er Jahre", "1897 (Guss vmtl. vor 1906)"
)# returns valid ISO 8601:2-2019 dates
unlist(unstruwwel(dates, "de", scheme = "iso-format"), use.names = FALSE)# returns a numerical interval of length 2
unstruwwel(dates, language = "de", scheme = "time-span") %>%
tibble::as_tibble() %>%
dplyr::mutate(id = dplyr::row_number()) %>%
tidyr::gather(key = id) %>%
tidyr::unnest_wider(value, names_sep = "_") %>%
dplyr::rename_all(dplyr::funs(c("text", "start", "end")))
```## Contributing
Please report issues, feature requests, and questions to the [GitHub issue tracker](https://github.com/stefanieschneider/unstruwwel/issues). We have a [Contributor Code of Conduct](https://github.com/stefanieschneider/unstruwwel/blob/master/CODE_OF_CONDUCT.md). By participating in unstruwwel you agree to abide by its terms.