Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cutterkom/destatiscleanr
Imports and cleans data from official German statistical offices to jump-start the data analysis
https://github.com/cutterkom/destatiscleanr
destatis destatis-data genesis german opendata r rstats statistical-offices
Last synced: 2 months ago
JSON representation
Imports and cleans data from official German statistical offices to jump-start the data analysis
- Host: GitHub
- URL: https://github.com/cutterkom/destatiscleanr
- Owner: cutterkom
- License: mit
- Created: 2019-01-03T11:36:14.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-05-17T12:05:17.000Z (over 1 year ago)
- Last Synced: 2023-10-20T19:08:49.094Z (over 1 year ago)
- Topics: destatis, destatis-data, genesis, german, opendata, r, rstats, statistical-offices
- Language: R
- Size: 374 KB
- Stars: 48
- Watchers: 7
- Forks: 4
- Open Issues: 1
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# destatiscleanr
---
Update May 2020: This package is no longer needed. The Federal Statistical Office of Germany, [Destatis](http://destatis.de), listened to it's users: You can now download data as a flat file csv or use [an API](https://www-genesis.destatis.de/genesis/online?Menu=Webservice#abreadcrumb).
Danke fürs Anschubsen @cutterkom und für weitere Anregungen aus der Community #ddj und fürs Umsetzen @destatis https://t.co/XIHG5Iml64
— Susanne Hagenkort-Rieger (@hagrie) May 29, 2020[This package as an online tool](http://apps.katharinabrunner.de/destatiscleaner)
---
[Destatis](http://destatis.de) is the Federal Statistical Office of Germany. Of course, it publishes a lot of datasets containing a wide range of data, from area sizes to international econonomic indicators in its database called [Genesis](https://www-genesis.destatis.de/genesis/online).
Unfortunately, the downloadable `csv` files don't comply with common standards of a tidy, ready-to-use machine-readable dataset:
* The tables have double, triple, quadruple, quintuple ... headers.
* Every file includes copyright information on the end of the file.
* positive numeric valus have a `+` sign
* ...The problems exists throughout the federal system of different statistical offices. Therefore `destatiscleanr` works on data of [regionalstatistik.de](http://regionalstatistik.de) and other statistics offices, too.
The consequence of these messy files is time-consuming data cleaning. Everytime you want to use data from Destatis you have to do the same (or at least very similar) tasks. This package helps by doing four things:
1. it imports the file by taking care of German peculiarities concerning encoding and decimal marks
2. it deletes the copyright and metadata part
3. it combines multiline headers to a regular column name
4. it converts numeric values to `as.numeric`Ideally, you can start your analysis right after calling `destatiscleanr("destatis_file.csv")`.
## Install
The package can be installed with `devtools`:
`devtools::install_github("cutterkom/destatiscleanr")`
## Usage
Download a `csv` file from the official Destatis/Genesis database and provide its path to the `destatiscleanr` function.
`library(destatiscleanr)`
`df <- destatiscleanr("path/to/destatis_file.csv")`
## Example
A short example to illustrate the advantage of the package is the table for *Verbraucherpreise*, German for consumer prices aka inflation.
**Without destatiscleanr**
![](img/before.png)
![](img/before_str.png)**With destatiscleanr**
![](img/after.png)
![](img/after_str.png)The column name `na_na` derives from the fact that the column names are built from the rows four and five in the original "Verbraucherpreise" table - and these are empty, therefore `na_na`.
## Caution
The goal is to jump start the analysis of Destatis data. This comes with two caveats: the automatic creation of column names and the handling of missing values.
### Column names
Be aware that the automatic renaming of columns doesn't work perfectly. The column names are probably not as specific as you wish. The package combines multline headers to a unique column name, including a name and unit. So you can definitly start doing your analysis without any hassle immidiately. It may be that you have to adjust at least some column names.
### Missing values
An `NA` value can have many different meanings, like `-` means no data available and `...` the value will be reported later. This distinctions *aren't* represented in the cleaned data by `destatiscleanr`: Every missing value, no matter the reason, is an `NA`.
Possible reasons for missing values:
![](img/missing_values.png)
## More ressources
The [package wiesbaden](https://github.com/sumtxt/wiesbaden) offers a way to get Destatis data directly from the database. ~~Unfortunately, this is a paid service for the main database of Destatis.~~ Destatis offers it API now [as a free service](https://www.destatis.de/DE/PresseService/Presse/Pressemitteilungen/2019/01/PD19_006_p001.html) (See documentation [here](https://www-genesis.destatis.de/genesis/misc/GENESIS-Webservices_Einfuehrung.pdf)). Just like [Regionalstatistik.de](http://regionalstatistik.de) it can be accessed now as a free registered user.
## Wishlist
- more dynamic creation of `column_names` :roll_eyes:
- Clever guessing of year/date column
- ~~Shiny app to offer it non r users~~