An open API service indexing awesome lists of open source software.

https://github.com/gesistsa/minty

🌿MINimal TYpe guesser
https://github.com/gesistsa/minty

Last synced: about 1 year ago
JSON representation

🌿MINimal TYpe guesser

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# minty

[![R-CMD-check](https://github.com/gesistsa/minty/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/gesistsa/minty/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/minty)](https://CRAN.R-project.org/package=minty)

`minty` (**Min**imal **ty**pe guesser) is a package with the type inferencing and parsing tools (the so-called 1e parsing engine) extracted from `readr` (with permission, see this issue [tidyverse/readr#1517](https://github.com/tidyverse/readr/issues/1517)). Since July 2021, these tools are not used internally by `readr` for parsing text files. Now `vroom` is used by default, unless explicitly call the first edition parsing engine (see the explanation on [editions](https://github.com/tidyverse/readr?tab=readme-ov-file#editions)).

`readr`'s 1e type inferencing and parsing tools are used by various R packages, e.g. `readODS` and `surveytoolbox` for parsing in-memory objects, but those packages do not use the main functions (e.g. `readr::read_delim()`) of `readr`. As explained in the README of `readr`, those 1e code will be eventually removed from `readr`.

`minty` aims at providing a set of minimal, long-term, and compatible type inferencing and parsing tools for those packages. You might consider `minty` to be 1.5e parsing engine.

## Installation

You can install the development version of minty like so:

``` r
if (!require("remotes")){
install.packages("remotes")
}
remotes::install_github("gesistsa/minty")
```

## Example

A character-only data.frame

```{r}
text_only <- data.frame(maybe_age = c("17", "18", "019"),
maybe_male = c("true", "false", "true"),
maybe_name = c("AA", "BB", "CC"),
some_na = c("NA", "Not good", "Bad"),
dob = c("2019/07/21", "2019/08/31", "2019/10/01"))
str(text_only)
```

```{r}
## built-in function type.convert:
## except numeric, no type inferencing
str(type.convert(text_only, as.is = TRUE))
```

Inferencing the column types

```{r}
library(minty)
data <- type_convert(text_only)
data
```

```{r}
str(data)
```

### Type-based parsing tools

```{r}
parse_datetime("1979-10-14T10:11:12.12345")
```

```{r}
fr <- locale("fr")
parse_date("1 janv. 2010", "%d %b %Y", locale = fr)
```

```{r}
de <- locale("de", decimal_mark = ",")
parse_number("1.697,31", local = de)
```

```{r}
parse_number("$1,123,456.00")
```

```{r}
## This is perhaps Python
parse_logical(c("True", "False"))
```

### Type guesser

```{r}
parse_guess(c("True", "TRUE", "false", "F"))
```

```{r}
parse_guess(c("123.45", "1990", "7619.0"))
```

```{r}
res <- parse_guess(c("2019-07-21", "2019-08-31", "2019-10-01", "IDK"), na = "IDK")
res
```

```{r}
str(res)
```

## Differences: `readr` vs `minty`

Unlike `readr` and `vroom`, please note that `minty` is mainly for **non-interactive usage**. Therefore, `minty` emits fewer messages and warnings than `readr` and `vroom`.

```{r}
data <- minty::type_convert(text_only)
data
```

```{r}
data <- readr::type_convert(text_only)
data
```

`verbose` option is added if you like those messages, default to `FALSE`. To keep this package as minimal as possible, these optional messages are printed with base R (not `cli`).

```{r}
data <- minty::type_convert(text_only, verbose = TRUE)
```

At the moment, `minty` does not use [the `problems` mechanism](https://vroom.r-lib.org/reference/problems.html) by default.

```{r}
minty::parse_logical(c("true", "fake", "IDK"), na = "IDK")
```

```{r}
readr::parse_logical(c("true", "fake", "IDK"), na = "IDK")
```

Some features from `vroom` have been ported to `minty`, but not `readr`.

```{r}
## tidyverse/readr#1526
minty::type_convert(data.frame(a = c("NaN", "Inf", "-INF"))) |> str()
```

```{r}
readr::type_convert(data.frame(a = c("NaN", "Inf", "-INF"))) |> str()
```

`guess_max` is available for `parse_guess()` and `type_convert()`, default to `NA` (same as `readr`).

```{r}
minty::parse_guess(c("1", "2", "drei"))
```

```{r}
minty::parse_guess(c("1", "2", "drei"), guess_max = 2)
```

```{r}
readr::parse_guess(c("1", "2", "drei"))
```

For `parse_guess()` and `type_convert()`, `trim_ws` is considered before type guessing (the expected behavior of `vroom::vroom()` / `readr::read_delim()`).

```{r}
minty::parse_guess(c(" 1", " 2 ", " 3 "), trim_ws = TRUE)
```

```{r}
readr::parse_guess(c(" 1", " 2 ", " 3 "), trim_ws = TRUE)
```

```{r}
##tidyverse/readr#1536
minty::type_convert(data.frame(a = "1 ", b = " 2"), trim_ws = TRUE) |> str()
```

```{r}
readr::type_convert(data.frame(a = "1 ", b = " 2"), trim_ws = TRUE) |> str()
```

## Similar packages

For parsing ambiguous date(time)

* [timeless](https://github.com/schochastics/timeless)
* [anytime](https://github.com/eddelbuettel/anytime)

Guess column types of a text file

* [hdd](https://CRAN.R-project.org/package=hdd)

## Acknowledgements

Thanks to:

* The [Tidyverse Team](https://github.com/tidyverse) for allowing us to spin off the code from `readr`