Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/colearendt/tidyjson

Tidy your JSON data in R with tidyjson
https://github.com/colearendt/tidyjson

Last synced: 9 days ago
JSON representation

Tidy your JSON data in R with tidyjson

Awesome Lists containing this project

README

        

---
title: 'tidyjson'
output: github_document
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/tidyjson)](https://cran.r-project.org/package=tidyjson)
[![Build Status](https://travis-ci.org/colearendt/tidyjson.svg?branch=master)](https://travis-ci.org/colearendt/tidyjson)
[![Coverage Status](https://codecov.io/github/colearendt/tidyjson/coverage.svg?branch=master)](https://codecov.io/github/colearendt/tidyjson?branch=master)

[![CRAN Activity](http://cranlogs.r-pkg.org/badges/tidyjson)](https://cran.r-project.org/package=tidyjson/index.html)
[![CRAN History](http://cranlogs.r-pkg.org/badges/grand-total/tidyjson)](https://cran.r-project.org/package=tidyjson/index.html)

![tidyjson graphs](https://cloud.githubusercontent.com/assets/2284427/18217882/1b3b2db4-7114-11e6-8ba3-07938f1db9af.png)

tidyjson provides tools for turning complex [json](http://www.json.org/) into [tidy](https://cran.r-project.org/package=tidyr/vignettes/tidy-data.html)
data.

## Installation

Get the released version from CRAN:

```R
install.packages("tidyjson")
```

or the development version from github:

```R
devtools::install_github("colearendt/tidyjson")
```

## Examples

The following example takes a character vector of
`r library(tidyjson);length(worldbank)`
documents in the `worldbank` dataset and spreads out all objects.
Every JSON object key gets its own column with types inferred, so long
as the key does not represent an array. When `recursive=TRUE` (the default behavior),
`spread_all` does this recursively for nested objects and creates column names
using the `sep` parameter (i.e. `{"a":{"b":1}}` with `sep='.'` would
generate a single column: `a.b`).

```{r, message=FALSE}
library(dplyr)
library(tidyjson)

worldbank %>% spread_all
```

Some objects in `worldbank` are arrays, which are not handled by `spread_all`. This example shows how
to quickly summarize the top level structure of a JSON collection

```{r}
worldbank %>% gather_object %>% json_types %>% count(name, type)
```

In order to capture the data in the `majorsector_percent` array, we can use `enter_object`
to enter into that object, `gather_array` to stack the array and `spread_all`
to capture the object items under the array.

```{r}
worldbank %>%
enter_object(majorsector_percent) %>%
gather_array %>%
spread_all %>%
select(-document.id, -array.index)
```

## API

### Spreading objects into columns

* `spread_all()` for spreading all object values into new columns, with nested
objects having concatenated names

* `spread_values()` for specifying a subset of object values to spread into new
columns using the `jstring()`, `jinteger()`, `jdouble()` and `jlogical()`
functions. It is possible to specify multiple parameters to extract data from
nested objects (i.e. `jstring('a','b')`).

### Object navigation

* `enter_object()` for entering into an object by name, discarding all other
JSON (and rows without the corresponding object name) and allowing further
operations on the object value

* `gather_object()` for stacking all object name-value pairs by name, expanding
the rows of the `tbl_json` object accordingly

### Array navigation

* `gather_array()` for stacking all array values by index, expanding the
rows of the `tbl_json` object accordingly

### JSON inspection

* `json_types()` for identifying JSON data types

* `json_length()` for computing the length of JSON data (can be larger than
`1` for objects and arrays)

* `json_complexity()` for computing the length of the unnested JSON, i.e.,
how many terminal leaves there are in a complex JSON structure

* `is_json` family of functions for testing the type of JSON data

### JSON summarization

* `json_structure()` for creating a single fixed column data.frame that
recursively structures arbitrary JSON data

* `json_schema()` for representing the schema of complex JSON, unioned across
disparate JSON documents, and collapsing arrays to their most complex type
representation

### Creating tbl_json objects

* `as.tbl_json()` for converting a string or character vector into a `tbl_json`
object, or for converting a `data.frame` with a JSON column using the
`json.column` argument

* `tbl_json()` for combining a `data.frame` and associated `list` derived
from JSON data into a `tbl_json` object

* `read_json()` for reading JSON data from a file

### Converting tbl_json objects

* `as.character.tbl_json` for converting the JSON attribute of a `tbl_json`
object back into a JSON character string

### Included JSON data

* `commits`: commit data for the dplyr repo from github API

* `issues`: issue data for the dplyr repo from github API

* `worldbank`: world bank funded projects from jsonstudio

* `companies`: startup company data from jsonstudio

## Philosophy

The goal is to turn complex JSON data, which is often represented as nested
lists, into tidy data frames that can be more easily manipulated.

* Work on a single JSON document, or on a collection of related documents

* Create pipelines with `%>%`, producing code that can be read from left to
right

* Guarantee the structure of the data produced, even if the input JSON
structure changes (with the exception of `spread_all`)

* Work with arbitrarily nested arrays or objects

* Handle 'ragged' arrays and / or objects (varying lengths by document)

* Allow for extraction of data in values or object names

* Ensure edge cases are handled correctly (especially empty data)

* Integrate seamlessly with `dplyr`, allowing `tbl_json` objects to pipe in and
out of `dplyr` verbs where reasonable

## Related Work

Tidyjson depends upon

* [magrritr](https://github.com/tidyverse/magrittr) for the `%>%` pipe operator
* [jsonlite](https://github.com/jeroen/jsonlite) for converting JSON strings into nested lists
* [purrr](https://github.com/tidyverse/purrr) for list operators
* [tidyr](https://github.com/tidyverse/tidyr) for unnesting and spreading

Further, there are other R packages that can be used to better understand
JSON data

* [listviewer](https://github.com/timelyportfolio/listviewer) for viewing JSON data interactively