Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ianmoran11/locatr

An easier way to tidying pivoted tables.
https://github.com/ianmoran11/locatr

Last synced: about 2 months ago
JSON representation

An easier way to tidying pivoted tables.

Awesome Lists containing this project

README

        

[![Travis build status](https://travis-ci.org/ianmoran11/locatr.svg?branch=master)](https://travis-ci.org/ianmoran11/locatr)
[![Codecov test coverage](https://codecov.io/gh/ianmoran11/locatr/branch/master/graph/badge.svg)](https://codecov.io/gh/ianmoran11/locatr?branch=master)

Overview
--------

The `locatr` package makes tidying data from spreadsheets easier. It helps identify and classify table cells, and then visually inspect them. .

Installation
------------

The `locatr` package is not available on CRAN. It can be installed from github with the following script:

``` r
# install.packages("devtools")
devtools::install_github("ianmoran11/locatr")
```

Usage
-----

The locate functions work much like `unpivotr::behead.` The key difference is that, rather than progressively removing headers, locate functions annotate the tidyxl data frame with .direction, .header\_group and .value columns, leaving reshaping to a final function call.

### Minimal example : `locate`

Here's a minimal example involving a table with two row headers and two column headers.

The first step is to locate the data cells with the `locate_data` function. Calling `locate_data` and providing an expression that filters for data cells sends these cells to an attribute named `data_cells`.

``` r
locatr_example("worked-examples.xlsx") %>%
xlsx_cells_fmt(sheets = "pivot-example") %>%
locate_data(data_type == "numeric") %>%
attr("data_cells")
#> # A tibble: 16 x 24
#> .value .direction .header_label address row col data_type character
#>
#> 1 1 D4 4 4 numeric
#> 2 2 E4 4 5 numeric
#> 3 3 F4 4 6 numeric
#> 4 0 G4 4 7 numeric
#> 5 3 D5 5 4 numeric
#> 6 4 E5 5 5 numeric
#> 7 5 F5 5 6 numeric
#> 8 1 G5 5 7 numeric
#> 9 5 D6 6 4 numeric
#> 10 6 E6 6 5 numeric
#> 11 9 F6 6 6 numeric
#> 12 2 G6 6 7 numeric
#> 13 7 D7 7 4 numeric
#> 14 8 E7 7 5 numeric
#> 15 12 F7 7 6 numeric
#> 16 3 G7 7 7 numeric
#> # … with 16 more variables: numeric , date , logical ,
#> # error , is_blank , local_format_id , sheet ,
#> # character_formatted , formula , is_array ,
#> # formula_ref , formula_group , comment , height ,
#> # width , style_format
```

`plot_cells` produces a plot that indicates which cells are now labelled as data.

``` r
locatr_example("worked-examples.xlsx") %>%
xlsx_cells_fmt(sheets = "pivot-example") %>%
locate_data(data_type == "numeric") %>%
plot_cells()
```

![](README/README-unnamed-chunk-6-1.png)

Once the data cells are identified, we can add header information to the tidyxl data frame (including .direction, .header\_group and .value columns) using the `locate` function. This function requires direction and variable names. Again, `plot_cells` can be called to check that data cells have been identified correctly.

Once all header have directions and names, `migrate` reshapes the tidyxl data frame into a tidy structure.

The gif below illustrate how direction informations is progressively added to the data frame.

And below is the code used in the gif.

``` r
locatr::locatr_example("worked-examples.xlsx") %>%
xlsx_cells_fmt(sheets = "pivot-example") %>%
locate_data(data_type == "numeric") %>%
locate(direction = "WNW", name = subject_type) %>%
locate(direction = "W", name = subject) %>%
locate(direction = "NNW", name = gender) %>%
locate(direction = "N", name = name) %>%
migrate()
#> # A tibble: 16 x 7
#> row col .value gender name subject_type subject
#>
#> 1 4 4 1 Year 1 Matilda Humanities Classics
#> 2 4 5 2 Year 1 Paul Humanities Classics
#> 3 5 4 3 Year 1 Matilda Humanities History
#> 4 5 5 4 Year 1 Paul Humanities History
#> 5 6 4 5 Year 1 Matilda Performance Music
#> 6 6 5 6 Year 1 Paul Performance Music
#> 7 7 4 7 Year 1 Matilda Performance Drama
#> 8 7 5 8 Year 1 Paul Performance Drama
#> 9 4 6 3 Year 2 Matilda Humanities Classics
#> 10 4 7 0 Year 2 Paul Humanities Classics
#> 11 5 6 5 Year 2 Matilda Humanities History
#> 12 5 7 1 Year 2 Paul Humanities History
#> 13 6 6 9 Year 2 Matilda Performance Music
#> 14 6 7 2 Year 2 Paul Performance Music
#> 15 7 6 12 Year 2 Matilda Performance Drama
#> 16 7 7 3 Year 2 Paul Performance Drama
```

### Conditional headers : `locate_if`

Sometimes not all headers in the same column or row belong to the same group. For example, in the table below, the row headers in column B represent a mix of subject type and subject name.

To deal with this we create a variable that represents the indenting of cells, and then use `locate_if` to selectively associate cells with directions and header groups.

``` r
locatr_example("worked-examples.xlsx") %>%
xlsx_cells_fmt(sheets = "pivot-indent") %>%
append_fmt(fmt_alignment_indent) %>%
locate_data(data_type == "numeric") %>%
locate_if(fmt_alignment_indent == 0, direction = "WNW", name = subject_type) %>%
locate_if(fmt_alignment_indent == 1, direction = "W", name = subject) %>%
locate(direction = "N", name = student) %>%
migrate()
#> # A tibble: 8 x 6
#> row col .value student subject_type subject
#>
#> 1 4 3 1 Matilda Humanities Classics
#> 2 4 4 2 Paul Humanities Classics
#> 3 5 3 3 Matilda Humanities History
#> 4 5 4 4 Paul Humanities History
#> 5 7 3 5 Matilda Performance Music
#> 6 7 4 6 Paul Performance Music
#> 7 8 3 7 Matilda Performance Drama
#> 8 8 4 8 Paul Performance Drama
```

### A more concise syntax : `locate_groups`

We can deal with multiple headers differentiated by formatting more concisely using `locate_groups`. The `.grouping` argument allows us to indicate which formats differentiate headers. In this case, hierarchy is indicated by indenting, which can be accessed with the `fmt_alignment_indent` function. The `.hook_if` argument receives an expression with `hook` that indicates which header\_groups are "WNW" rather than "N". The `.hook_if_rev` argument will switch directions from "N" to "WSW". Importantly the `hook` expression is passed into `summarise` so it needs to reduce columns to a single boolean value. This is the reason for using `any` in the example below.

``` r
locatr_example("worked-examples.xlsx") %>%
xlsx_cells_fmt(sheets = "pivot-indent") %>%
append_fmt(fmt_alignment_indent) %>%
locate_data(data_type == "numeric") %>%
locate_groups(direction = "W",
.groupings = groupings(fmt_alignment_indent),
.hook_if = hook_if(any(fmt_alignment_indent == 0))) %>%
locate(direction = "N", name = student) %>%
plot_cells()
```

![](README/README-unnamed-chunk-11-1.png)

To browse different aspects of formatting on which to separate headers, type `fmt_` and tab

A more complicated example: Tidying new residential construction data from the US Census Bureau
-----------------------------------------------------------------------------------------------

Here's a more complicate table.

We can tidy this table by:

- filtering to include only the upper table (filtering out any cells below the first containing "RSE")
- locating the data, preventing the inclusion of the cells containing 2018 and 2019 in column 1
- differentiating row groups based on whether they are numeric cells
- identifying column headers, using the `header_fill` argument to deal with merged cells.

``` r
annotated_df <-
locatr_example("newresconst.xlsx") %>%
xlsx_cells_fmt(sheets = "Table 1 - Permits") %>%
append_fmt(fmt_font_bold) %>%
filter(row < min(row[str_detect(character,"RSE")],na.rm = TRUE)) %>%
locate_data(data_type == "numeric" & col > 1) %>%
locate_groups(direction = "W",
.groupings = groupings(is.na(numeric)),
.hook_if = hook_if(any(data_type == "numeric"))) %>%
locate_groups(direction = "N", header_fill = "style")

annotated_df %>% plot_cells()
```

![](README/README-unnamed-chunk-13-1.png)

``` r

annotated_df %>% migrate()
#> # A tibble: 156 x 7
#> row col .value N_header_label_… N_header_label_… W_header_label_…
#>
#> 1 9 2 1377 United States "Total" 2018
#> 2 9 3 851 United States "1 unit" 2018
#> 3 9 4 40 United States "2 to 4 units" 2018
#> 4 9 5 486 United States "5 units\r\n or… 2018
#> 5 9 6 135 Northeast "Total" 2018
#> 6 9 7 51 Northeast "1 unit" 2018
#> 7 9 8 203 Midwest "Total" 2018
#> 8 9 9 119 Midwest "1 unit" 2018
#> 9 9 10 652 South "Total" 2018
#> 10 9 11 456 South "1 unit" 2018
#> # … with 146 more rows, and 1 more variable: W_header_label_02
```

*Note that older versions of dplyr require substituting `filter` for `filter_fmt`.*