https://github.com/cutterkom/destatiscleanr

Imports and cleans data from official German statistical offices to jump-start the data analysis
https://github.com/cutterkom/destatiscleanr

destatis destatis-data genesis german opendata r rstats statistical-offices

Last synced: 6 months ago
JSON representation

Imports and cleans data from official German statistical offices to jump-start the data analysis

Host: GitHub
URL: https://github.com/cutterkom/destatiscleanr
Owner: cutterkom
License: mit
Created: 2019-01-03T11:36:14.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2023-05-17T12:05:17.000Z (over 2 years ago)
Last Synced: 2025-04-05T03:01:52.065Z (7 months ago)
Topics: destatis, destatis-data, genesis, german, opendata, r, rstats, statistical-offices
Language: R
Size: 374 KB
Stars: 48
Watchers: 7
Forks: 4
Open Issues: 1
Metadata Files:
- Readme: readme.md
- License: LICENSE

Awesome Lists containing this project

README

          # destatiscleanr

---

Update May 2020: This package is no longer needed. The Federal Statistical Office of Germany, [Destatis](http://destatis.de), listened to it's users: You can now download data as a flat file csv or use [an API](https://www-genesis.destatis.de/genesis/online?Menu=Webservice#abreadcrumb). 



Danke fürs Anschubsen @cutterkom und für weitere Anregungen aus der Community #ddj und fürs Umsetzen @destatis https://t.co/XIHG5Iml64
— Susanne Hagenkort-Rieger (@hagrie) May 29, 2020

 

[This package as an online tool](http://apps.katharinabrunner.de/destatiscleaner)

---

[Destatis](http://destatis.de) is the Federal Statistical Office of Germany. Of course, it publishes a lot of datasets containing a wide range of data, from area sizes to international econonomic indicators in its database called [Genesis](https://www-genesis.destatis.de/genesis/online).

Unfortunately, the downloadable `csv` files don't comply with common standards of a tidy, ready-to-use machine-readable dataset:

* The tables have double, triple, quadruple, quintuple ... headers.

* Every file includes copyright information on the end of the file.

* positive numeric valus have a `+` sign

* ...

The problems exists throughout the federal system of different statistical offices. Therefore `destatiscleanr` works on data of [regionalstatistik.de](http://regionalstatistik.de) and other statistics offices, too.

The consequence of these messy files is time-consuming data cleaning. Everytime you want to use data from Destatis you have to do the same (or at least very similar) tasks. This package helps by doing four things:

1. it imports the file by taking care of German peculiarities concerning encoding and decimal marks

2. it deletes the copyright and metadata part

3. it combines multiline headers to a regular column name

4. it converts numeric values to `as.numeric`

Ideally, you can start your analysis right after calling `destatiscleanr("destatis_file.csv")`.

## Install

The package can be installed with `devtools`:

`devtools::install_github("cutterkom/destatiscleanr")`

## Usage

Download a `csv` file from the official Destatis/Genesis database and provide its path to the `destatiscleanr` function.

`library(destatiscleanr)`

`df <- destatiscleanr("path/to/destatis_file.csv")`

## Example

A short example to illustrate the advantage of the package is the table for *Verbraucherpreise*, German for consumer prices aka inflation.

**Without destatiscleanr**

![](img/before.png)

![](img/before_str.png)

**With destatiscleanr**

![](img/after.png)

![](img/after_str.png)

The column name `na_na` derives from the fact that the column names are built from the rows four and five in the original "Verbraucherpreise" table - and these are empty, therefore `na_na`.

## Caution

The goal is to jump start the analysis of Destatis data. This comes with two caveats: the automatic creation of column names and the handling of missing values.

### Column names

Be aware that the automatic renaming of columns doesn't work perfectly. The column names are probably not as specific as you wish. The package combines multline headers to a unique column name, including a name and unit. So you can definitly start doing your analysis without any hassle immidiately. It may be that you have to adjust at least some column names.

### Missing values

An `NA` value can have many different meanings, like `-` means no data available and `...` the value will be reported later. This distinctions *aren't* represented in the cleaned data by `destatiscleanr`: Every missing value, no matter the reason, is an `NA`.

Possible reasons for missing values:

![](img/missing_values.png)

## More ressources

The [package wiesbaden](https://github.com/sumtxt/wiesbaden) offers a way to get Destatis data directly from the database. ~~Unfortunately, this is a paid service for the main database of Destatis.~~ Destatis offers it API now [as a free service](https://www.destatis.de/DE/PresseService/Presse/Pressemitteilungen/2019/01/PD19_006_p001.html) (See documentation [here](https://www-genesis.destatis.de/genesis/misc/GENESIS-Webservices_Einfuehrung.pdf)). Just like [Regionalstatistik.de](http://regionalstatistik.de) it can be accessed now as a free registered user.

## Wishlist

- more dynamic creation of `column_names` :roll_eyes:

- Clever guessing of year/date column

- ~~Shiny app to offer it non r users~~

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cutterkom/destatiscleanr

Awesome Lists containing this project

README