Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/skgrange/saqgetr

Import Air Quality Monitoring Data in a Fast and Easy Way
https://github.com/skgrange/saqgetr

Last synced: about 1 month ago
JSON representation

Import Air Quality Monitoring Data in a Fast and Easy Way

Awesome Lists containing this project

README

        

# **saqgetr**

[![Lifecycle: retired](https://img.shields.io/badge/lifecycle-retired-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#retired)
[![CRAN status](https://www.r-pkg.org/badges/version/saqgetr)](https://cran.r-project.org/package=sagetr)
[![CRAN log](https://cranlogs.r-pkg.org/badges/last-week/saqgetr?color=brightgreen)](https://cran.r-project.org/package=saqgetr)

**saqgetr** is an R package to import air quality monitoring data in a fast and easy way. Currently, only European data are available, but the package is generic and therefore data from other areas may be included in the future. For documentation on what data sources are accessible, please see [**saqgetr**'s technical note](https://drive.google.com/open?id=1IgDODHqBHewCTKLdAAxRyR7ml8ht6Ods).

**saqgetr** has been made possible with the help of [Ricardo Energy & Environment](https://ee.ricardo.com).

## Retirement note

**saqgetr** will be retired in mid-2024. There are several reasons for the retirement, but the main points are that I no longer have the scope to ensure I catch all issues when they arise, the access to the remote servers used for **saqgetr** has become progressively more difficult due to my relocation and stricter security policies, and the near-real-time (E2a) data flow contains far more unreliable observations that in the past that are not being fixed or updated but the member states. Therefore, the database underlying **saqgetr** requires more maintenance than I can provide. The final update of observations was conducted on `2024-02-17`.

## Installation

**saqgetr** is available on CRAN and can be installed in the normal way:

```
# Install saqgetr package
install.packages("saqgetr")
```

If desired, the development version can be installed with the help of [**devtools**](https://github.com/r-lib/devtools) or [**remotes**](https://github.com/r-lib/remotes) like this:
```
# Install development version of saqgetr
remotes::install_github("skgrange/saqgetr")
```

## Framework

**saqgetr** acts as an interface to pre-prepared data files located on a web server. For each monitoring site serviced, there is a single file containing all observations for each year. There are a collection of metadata tables too which enable users to further understand the location and type of observations are available. The data files are compressed text files (`.csv.gz`) which allows for simple and fast importing and if other interfaces wish to be developed, this should be simple.

## Usage

### Sites

To import data with **saqgetr**, functions with the `get_saq_*` prefix are used. A monitoring site must be supplied to get observations. To find what sites are available use `get_saq_sites`:

```
# Load packages
library(dplyr)
library(saqgetr)

# Import site information
data_sites <- get_saq_sites()

# Glimpse tibble
glimpse(data_sites)

#> Observations: 9,016
#> Variables: 16
#> $ site "ad0942a", "ad0944a", "ad0945a", "al0201a", "a…
#> $ site_name "Fixa", "Fixa oz", "Estacional oz Envalira", "…
#> $ latitude 42.50969, 42.51694, 42.53488, 41.33027, 41.345…
#> $ longitude 1.539138, 1.565250, 1.716986, 19.821772, 19.85…
#> $ elevation 1080, 1637, 2515, 162, 207, 848, 25, 1, 13, 15…
#> $ country "andorra", "andorra", "andorra", "albania", "a…
#> $ country_iso_code "AD", "AD", "AD", "AL", "AL", "AL", "AL", "AL"…
#> $ site_type "background", "background", "background", NA, …
#> $ site_area "urban", "rural", "rural", NA, NA, "suburban",…
#> $ date_start 2013-12-31 23:00:00, 2013-12-31 23:00:00, 201…
#> $ date_end 2019-04-27 14:00:00, 2019-04-27 14:00:00, 201…
#> $ network "NET-AD001A", "NET-AD001A", "NET-AD001A", NA, …
#> $ eu_code "STA-AD0942A", "STA-AD0944A", "STA-AD0945A", N…
#> $ eoi_code "AD0942A", "AD0944A", "AD0945A", NA, NA, "AL02…
#> $ observation_count 309037, 45174, 18268, 168983, 140812, 247037, …
#> $ data_source "aqer:e1a; aqer:e2a", "aqer:e1a; aqer:e2a", "a…
```

### Observations

Sites are represented by a code which is prefixed with the country's ISO code, for example, a site in York, England, United Kingdom is identified as `gb0919a` (the ISO code for the United Kingdom is non-standard and GB is for Great Britain). To get observations this site, use `get_saq_observations`:

```{r}
# Get air quality monitoring data for a York site
data_york <- get_saq_observations(site = "gb0919a", start = 2005)

# Glimpse tibble
glimpse(data_york)

#> Observations: 370,235
#> Variables: 10
#> $ date 2008-01-01, 2008-01-02, 2008-01-03, 2008-01-04, 2008-…
#> $ date_end NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ site "gb0919a", "gb0919a", "gb0919a", "gb0919a", "gb0919a",…
#> $ variable "pm10", "pm10", "pm10", "pm10", "pm10", "pm10", "pm10"…
#> $ process 62392, 62392, 62392, 62392, 62392, 62392, 62392, 62392…
#> $ summary 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20…
#> $ validity 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, …
#> $ unit "µg/m3", "µg/m3", "µg/m3", "µg/m3", "µg/m3", "µg/m3", …
#> $ value 21.625, 22.708, 24.667, 21.833, 24.000, 29.875, 16.833…
```

`get_saq_observations` takes a vector of sites to import many sites at once. Beware that if a user stacks the sites, a lot of data can be returned. For example, using the two sites below returns a tibble/data frame/table with over 10 million observations.
```{r}
# Get 10 million observations, verbose is used to give an indication on
# what is occuring
data_large_ish <- get_saq_observations(
site = c("gb0036r", "gb0682a"),
start = 1960,
verbose = TRUE
)

# Glimpse tibble
glimpse(data_large_ish)

#> Observations: 9,981,977
#> Variables: 9
#> $ date 1995-09-11, 1995-09-12, 1995-09-13, 1995-09-14, 1995-…
#> $ date_end NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ site "gb0036r", "gb0036r", "gb0036r", "gb0036r", "gb0036r",…
#> $ variable "so2", "so2", "so2", "so2", "so2", "so2", "so2", "so2"…
#> $ process 57295, 57295, 57295, 57295, 57295, 57295, 57295, 57295…
#> $ summary 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20…
#> $ validity 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ unit "µg/m3", "µg/m3", "µg/m3", "µg/m3", "µg/m3", "µg/m3", …
#> $ value 0.983, 0.792, 1.362, 0.483, 14.633, 1.171, 0.821, 15.2…
```

#### Cleaning observations

Once a data are imported, valid data for a certain averaging period/summary can be isolated with `saq_clean_observations`. `saq_clean_observations` can also "spread" data where the variable/pollutants become columns:

```{r}
# Get only valid hourly data and reshape (spread)
data_york_spread <- data_york %>%
saq_clean_observations(summary = "hour", valid_only = TRUE, spread = TRUE)

# Glimpse tibble
glimpse(data_york_spread)
```

### Processes

Information on the specific time series/processes can also be retrieved.

```{r}
# Get processes
data_processes <- get_saq_processes()

# Glimpse tibble
glimpse(data_processes)

#> Observations: 171,992
#> Variables: 15
#> $ process 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ site "al0201a", "al0201a", "al0201a", "al0201a", "a…
#> $ variable "so2", "so2", "pm10", "pm10", "o3", "o3", "o3"…
#> $ variable_long "Sulphur dioxide (air)", "Sulphur dioxide (air…
#> $ period "day", "hour", "day", "hour", "day", "dymax", …
#> $ unit "ug.m-3", "ug.m-3", "ug.m-3", "ug.m-3", "ug.m-…
#> $ date_start NA, 2011-01-01 00:00:00, 2011-01-01 00:00:00,…
#> $ date_end NA, 2011-12-31 23:00:00, 2012-12-30 00:00:00,…
#> $ sample NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ sampling_point NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ sampling_process NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ observed_property 1, 1, 5, 5, 7, 7, 7, 7, 8, 8, 9, 9, 10, 10, 10…
#> $ group_code 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
#> $ data_source "airbase", "airbase", "airbase", "airbase", "a…
#> $ observation_count 0, 6806, 729, 17336, 352, 352, 16413, 8358, 69…
```

### Other metadata

Other helper tables are also available:

```{r}
# Get other helper tables
# Summary integers
data_summary_integers <- get_saq_summaries() %>%
print(n = Inf)

#> # A tibble: 20 x 2
#> averaging_period summary
#>
#> 1 hour 1
#> 2 day 20
#> 3 week 90
#> 4 var 91
#> 5 month 92
#> 6 fortnight 93
#> 7 3month 94
#> 8 2month 95
#> 9 2day 96
#> 10 3day 97
#> 11 2week 98
#> 12 4week 99
#> 13 3hour 100
#> 14 8hour 101
#> 15 hour8 101
#> 16 year 102
#> 17 dymax 21
#> 18 quarter 103
#> 19 other 91
#> 20 n-hour 104

# Validity integers
data_validity_integers <- get_saq_validity() %>%
print(n = Inf)

#> # A tibble: 6 x 4
#> validity valid description notes
#>
#> 1 NA FALSE data is considered to be invalid due to the… from aqer
#> 2 -1 FALSE invalid due to other circumstances or data … from aqer
#> 3 0 FALSE invalid smonitor nom…
#> 4 1 TRUE from aqer
#> 5 2 TRUE valid but below detection limit measurement… from aqer
#> 6 3 TRUE valid but below detection limit and number … from aqer
````

### Simple annual and monthly means of observations

Simple annual and monthly means of the daily and hourly processes have also been generated. These summaries are often useful for trend analysis or mapping.

```{r}
# Get annual means
data_annual <- get_saq_simple_summaries(summary = "annual_mean")

# Glimpse tibble
glimpse(data_annual)

#> Observations: 655,362
#> Variables: 8
#> $ date 2013-01-01, 2014-01-01, 2015-01-01, 2016-01-01, …
#> $ date_end 2013-12-31 23:59:59, 2014-12-31 23:59:59, 2015-1…
#> $ site "ad0942a", "ad0942a", "ad0942a", "ad0942a", "ad09…
#> $ variable "co", "co", "co", "co", "co", "co", "co", "no", "…
#> $ summary_source 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ summary 102, 102, 102, 102, 102, 102, 102, 102, 102, 102,…
#> $ count 1, 8438, 8385, 8171, 8441, 8217, 5990, 1, 8310, 8…
#> $ value 0.5000000, 0.3224579, 0.3582230, 0.3168768, 0.259…

# What was York Fishergate's (hourly) PM10 concentraion in 2017?
data_annual %>%
filter(site == "gb0682a",
lubridate::year(date) == 2017L,
variable == "pm10",
summary_source == 1L) %>%
select(date,
site,
variable,
count,
value)

#> # A tibble: 1 x 5
#> date site variable count value
#>
#> 1 2017-01-01 00:00:00 gb0682a pm10 8442 23.8
```