Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ColinFay/tidystringdist

String distance calculation the tidy way.
https://github.com/ColinFay/tidystringdist

Last synced: 3 months ago
JSON representation

String distance calculation the tidy way.

Awesome Lists containing this project

README

        

---
output:
md_document:
variant: markdown_github
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```

[![Coverage Status](https://img.shields.io/codecov/c/github/ColinFay/tidystringdist/master.svg)](https://codecov.io/github/ColinFay/tidystringdist?branch=master)

[![Travis-CI Build Status](https://travis-ci.org/ColinFay/tidystringdist.svg?branch=master)](https://travis-ci.org/ColinFay/tidystringdist)

# tidystringdist

Compute string distance the tidy way. Built on top of the `stringdist` package.

## Install tidystringdist

You'll get the dev version on:

```{r eval = FALSE}
devtools::install_github("ColinFay/tidystringdist")
```

Stable version is available with :

```{r eval = FALSE}
install.packages("tidystringdist")
```

## tidystringdist basic workflow

## tidycomb

First, you need to create a tibble with the combinations of words you want to compare. You can do this with the `tidy_comb` and `tidy_comb_all` functions. The first takes a base word and combines it with each elements of a list or a column of a data.frame, the 2nd combines all the possible couples from a list or a column.

If you already have a data.frame with two columns containing the strings to compare, you can skip this part.

```{r}
library(tidystringdist)

tidy_comb_all(LETTERS[1:3])
```

```{r}
tidy_comb_all(iris, Species)
```

```{r}
tidy_comb("Paris", state.name[1:3])
```

### tidy_string_dist

Once you've got this data.frame, you can use `tidy_string_dist()` to compute string distance. This function takes a data.frame, the two columns containing the strings, and one or more stringdist methods.

Note that if you've used the `tidy_comb` function to create your data.frame, you won't need to set the column names.

```{r example, warnings = FALSE, error=FALSE, message=FALSE}
library(dplyr)
data(starwars)
tidy_comb_sw <- tidy_comb_all(starwars, name)
tidy_stringdist(tidy_comb_sw)
```

Default call compute all the methods. You can use specific method with the `method` argument:

```{r}
tidy_stringdist(tidy_comb_sw, method = c("osa","jw"))
```

## Tidyverse workflow

The goal is to provide a convenient interface to work with other tools from the tidyverse.

```{r}
tidy_stringdist(tidy_comb_sw, method= "osa") %>%
filter(osa > 20) %>%
arrange(desc(osa))
```

```{r}
starwars %>%
filter(species == "Droid") %>%
tidy_comb_all(name) %>%
tidy_stringdist() %>%
summarise_if(is.numeric, mean)
```

### Contact

Questions and feedbacks [welcome](mailto:[email protected])!