An open API service indexing awesome lists of open source software.

https://github.com/brandonleekramer/tidyorgs

A tidy package that detects and standardizes organizations in unstructured text data
https://github.com/brandonleekramer/tidyorgs

academic business geography government nonprofit organizations r standardization text-classification text-mining tidyverse

Last synced: 3 months ago
JSON representation

A tidy package that detects and standardizes organizations in unstructured text data

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# tidyorgs: A tidy package that standardizes text data for organizational and sector analysis

**Authors:** [Brandon Kramer](https://www.brandonleekramer.com/) with contributions from members of the [University of Virginia's Biocomplexity Institute](https://biocomplexity.virginia.edu/institute/divisions/social-and-decision-analytics), the [National Center for Science and Engineering Statistics](https://www.nsf.gov/statistics/), and the [2020](https://dspg-young-scholars-program.github.io/dspg20oss/team/?dspg) and [2021](https://dspgtools.shinyapps.io/dspg21oss/) [UVA Data Science for the Public Good Open Source Software Teams](https://biocomplexity.virginia.edu/institute/divisions/social-and-decision-analytics/dspg)

**License:** [MIT](https://opensource.org/licenses/MIT)

### Installation

You can install this package using the `devtools` package:

```{r, eval=FALSE, warning=FALSE, message=FALSE}
install.packages("devtools")
devtools::install_github("brandonleekramer/tidyorgs")
```

The `tidyorgs` package provides several functions that help standardize messy text data for organizational analysis. More specifically, the package's two core sets of functions `detect_{sector}()` and `email_to_orgs()` standardize organizations from across the academic, business, government and nonprofit sectors based on unstructured text and email domains. The package is intended to support linkage across multiple datasets, bibliometric analysis, and sector classification for social, economic, and policy analysis.

### Matching organizations with the `detect_orgs()` function

The `detect_{sector}()` functions detects patterns in messy text data and then standardizes them into organizations based on a curated dictionary. For example, messy bio information scraped from GitHub can be easily codified so that statistical analysis can be done on academic users.

#### `detect_academic()`

```{r, warning=FALSE, message=FALSE}
library(tidyverse)
library(tidyorgs)
data(github_users)

classified_academic <- github_users %>%
detect_academic(login, company, organization, email) %>%
filter(academic == 1) %>%
select(login, organization, company)

classified_academic
```

#### `detect_business()`

```{r}
classified_businesses <- github_users %>%
detect_business(login, company, organization, email) %>%
filter(business == 1) %>%
select(login, organization, company)
classified_businesses
```

#### `detect_government()`

```{r}
classified_government <- github_users %>%
detect_government(login, company, organization, email) %>%
filter(government == 1) %>%
select(login, organization, company)
classified_government
```

#### `detect_nonprofit()`

```{r}
classified_nonprofit <- github_users %>%
detect_nonprofit(login, company, organization, email) %>%
filter(nonprofit == 1) %>%
select(login, organization, company, email)
classified_nonprofit
```

### Matching users to organizations by emails using `email_to_orgs()`

For those that only have email information, the `email_to_orgs()` function matches users to organizations based on our curated domain list.

```{r, warning=FALSE, message=FALSE}
user_emails_to_orgs <- github_users %>%
email_to_orgs(login, email, country_name, "academic")

github_users %>%
left_join(user_emails_to_orgs, by = "login") %>%
drop_na(country_name) %>%
select(email, country_name)
```