https://github.com/ipea/geobr_prep_data

Repo that prepares the data shared through geobr
https://github.com/ipea/geobr_prep_data

Last synced: 6 days ago
JSON representation

Repo that prepares the data shared through geobr

Host: GitHub
URL: https://github.com/ipea/geobr_prep_data
Owner: ipea
Created: 2025-08-04T14:36:59.000Z (11 months ago)
Default Branch: main
Last Pushed: 2026-05-18T18:30:51.000Z (about 1 month ago)
Last Synced: 2026-05-18T20:33:42.463Z (about 1 month ago)
Language: HTML
Size: 5.02 MB
Stars: 5
Watchers: 0
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Preparing the data for {geobr}

R pipeline that downloads, processes and standardizes Brazilian geospatial

datasets for the [`geobr`](https://github.com/ipeaGIT/geobr) package.

**Output:** zstd-compressed [GeoParquet](https://geoparquet.org/) files (with

spatial metadata via geoarrow) published to GitHub Releases of `ipeaGIT/geobr`

via [`piggyback`](https://github.com/ropensci/piggyback).

## Tech stack

R 4.5 · [targets](https://docs.ropensci.org/targets/) ·

[sf](https://r-spatial.github.io/sf/) ·

[arrow](https://arrow.apache.org/docs/r/) + [geoarrow](https://github.com/geoarrow/geoarrow-r) ·

[lwgeom](https://r-spatial.github.io/lwgeom/) ·

[crew](https://wlandau.github.io/crew/) ·

[renv](https://rstudio.github.io/renv/) ·

[piggyback](https://docs.ropensci.org/piggyback/) ·

[geocodebr](https://github.com/ipeaGIT/geocodebr) ·

[sfarrow](https://github.com/wcjochem/sfarrow) ·

[testthat](https://testthat.r-lib.org/)

## Getting started

```r

# 1. Install locked dependencies

renv::restore()

# 2. Run the full pipeline

library(targets)

tar_make()

# 3. Visualize the DAG

tar_visnetwork()

# 4. Check for warnings/errors

tar_meta(fields = warnings, complete_only = TRUE)

```

**Requirements:** R >= 4.5, internet connection (downloads from IBGE, DATASUS,

MMA, FUNAI FTP servers).

## Implemented datasets

To check what data sets have been implemented already, check [here](https://github.com/ipeaGIT/geobr#available-datasets)

**Total: 675 Parquet files (~8.6 GB)**

## Project structure

```

geobr_prep_data/

├── _targets.R                        # Pipeline definition (DAG)

├── R/

│   ├── support_harmonize_geobr.R     # Core: harmonization, projection, topology

│   ├── support_fun.R                 # Helpers: download, unzip, read/merge

│   ├── upload.R                      # Upload to GitHub Releases via piggyback

│   └── [dataset].R                   # download_X() + clean_X() per dataset

├── tests/testthat/                   # Unit tests (testthat, 22 tests)

├── ainda_sem_targets/                # Legacy scripts (reference only)

├── data/                             # Output GeoParquets (git-ignored, ~8.6 GB)

├── renv.lock                         # Locked R dependencies

├── CLAUDE.md                         # Claude Code project instructions

└── .claude/                          # Rules, plans, backlog, known issues

    ├── rules/                        # Column conventions, harmonization guide

    ├── plans/                        # Implementation plans

    ├── BACKLOG.md                    # Dataset status tracker

    └── PROBLEMS.md                   # Known bugs and fixes

```

## Data standards

All output Parquets follow these conventions:

- **CRS:** SIRGAS 2000 (EPSG:4674)

- **Geometry:** `MULTIPOLYGON` (except `POINT` for health_facilities, schools, schools_bi, capitals)

- **Format:** [GeoParquet](https://geoparquet.org/) with spatial metadata

  (CRS, geometry type, bbox) via `geoarrow`

- **Compression:** zstd, level 7

- **Column order:** `code_X`, `name_X`, `code_state`, `abbrev_state`,

  `name_state`, `code_region`, `name_region`, `year`, `geometry`

- **Types:** `code_*` = numeric, `name_*` = character (Title Case),

  `abbrev_state` = 2-letter uppercase

### Output layout

```

data/

└── [dataset]/

    └── [year]/

        ├── [dataset]_[year].parquet              # Full resolution

        └── [dataset]_[year]_simplified.parquet    # Simplified (100m tolerance)

```

## Adding a new dataset

1. Create `R/[dataset].R` with `download_X(year)` and `clean_X(raw, year)`

2. Add 3 targets in `_targets.R` (years, raw, clean)

3. Add `[dataset]_clean` to the `all_files` target at the end of `_targets.R`

4. Run `tar_make()` and validate output

See [`.claude/rules/new-dataset.md`](.claude/rules/new-dataset.md) for the

full checklist.

## Running tests

```r

# From the project root:

source("tests/testthat.R")

```

22 tests covering core harmonization functions (`snake_case_names`,

`add_state_info`, `add_region_info`, `normalize_sf_geometry`, `validate_geobr`).

The pipeline also includes a `validation` target that checks all output

GeoParquets for correct CRS, geometry types, column types, and schema.

## Documentation

| File | Description |

|------|-------------|

| [`CLAUDE.md`](CLAUDE.md) | Project instructions and conventions |

| [`.claude/rules/column-conventions.md`](.claude/rules/column-conventions.md) | Column naming, ordering, types |

| [`.claude/rules/harmonization.md`](.claude/rules/harmonization.md) | How to use `harmonize_geobr()` |

| [`.claude/rules/new-dataset.md`](.claude/rules/new-dataset.md) | Checklist for new datasets |

| [`.claude/BACKLOG.md`](.claude/BACKLOG.md) | Status of all 36 datasets |

| [`.claude/PROBLEMS.md`](.claude/PROBLEMS.md) | 21 bugs resolved (historical log) |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ipea/geobr_prep_data

Awesome Lists containing this project

README