{"id":50919510,"url":"https://github.com/ipea/geobr_prep_data","last_synced_at":"2026-06-16T18:31:23.263Z","repository":{"id":308187451,"uuid":"1031925351","full_name":"ipea/geobr_prep_data","owner":"ipea","description":"Repo that prepares the data shared through geobr","archived":false,"fork":false,"pushed_at":"2026-05-18T18:30:51.000Z","size":5267,"stargazers_count":5,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-18T20:33:42.463Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ipea.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-04T14:36:59.000Z","updated_at":"2026-05-18T18:30:55.000Z","dependencies_parsed_at":"2025-08-04T19:15:55.450Z","dependency_job_id":"9dba69ff-770d-43e8-a116-50a9f800d113","html_url":"https://github.com/ipea/geobr_prep_data","commit_stats":null,"previous_names":["ipeagit/geobr_prep_data","ipea/geobr_prep_data"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/ipea/geobr_prep_data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipea%2Fgeobr_prep_data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipea%2Fgeobr_prep_data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipea%2Fgeobr_prep_data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipea%2Fgeobr_prep_data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ipea","download_url":"https://codeload.github.com/ipea/geobr_prep_data/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipea%2Fgeobr_prep_data/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34419046,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-16T02:00:06.860Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-16T18:31:02.363Z","updated_at":"2026-06-16T18:31:23.076Z","avatar_url":"https://github.com/ipea.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Preparing the data for {geobr}\n\nR pipeline that downloads, processes and standardizes Brazilian geospatial\ndatasets for the [`geobr`](https://github.com/ipeaGIT/geobr) package.\n\n**Output:** zstd-compressed [GeoParquet](https://geoparquet.org/) files (with\nspatial metadata via geoarrow) published to GitHub Releases of `ipeaGIT/geobr`\nvia [`piggyback`](https://github.com/ropensci/piggyback).\n\n## Tech stack\n\nR 4.5 · [targets](https://docs.ropensci.org/targets/) ·\n[sf](https://r-spatial.github.io/sf/) ·\n[arrow](https://arrow.apache.org/docs/r/) + [geoarrow](https://github.com/geoarrow/geoarrow-r) ·\n[lwgeom](https://r-spatial.github.io/lwgeom/) ·\n[crew](https://wlandau.github.io/crew/) ·\n[renv](https://rstudio.github.io/renv/) ·\n[piggyback](https://docs.ropensci.org/piggyback/) ·\n[geocodebr](https://github.com/ipeaGIT/geocodebr) ·\n[sfarrow](https://github.com/wcjochem/sfarrow) ·\n[testthat](https://testthat.r-lib.org/)\n\n## Getting started\n\n```r\n# 1. Install locked dependencies\nrenv::restore()\n\n# 2. Run the full pipeline\nlibrary(targets)\ntar_make()\n\n# 3. Visualize the DAG\ntar_visnetwork()\n\n# 4. Check for warnings/errors\ntar_meta(fields = warnings, complete_only = TRUE)\n```\n\n**Requirements:** R \u003e= 4.5, internet connection (downloads from IBGE, DATASUS,\nMMA, FUNAI FTP servers).\n\n## Implemented datasets\n\nTo check what data sets have been implemented already, check [here](https://github.com/ipeaGIT/geobr#available-datasets)\n\n\n**Total: 675 Parquet files (~8.6 GB)**\n\n## Project structure\n\n```\ngeobr_prep_data/\n├── _targets.R                        # Pipeline definition (DAG)\n├── R/\n│   ├── support_harmonize_geobr.R     # Core: harmonization, projection, topology\n│   ├── support_fun.R                 # Helpers: download, unzip, read/merge\n│   ├── upload.R                      # Upload to GitHub Releases via piggyback\n│   └── [dataset].R                   # download_X() + clean_X() per dataset\n├── tests/testthat/                   # Unit tests (testthat, 22 tests)\n├── ainda_sem_targets/                # Legacy scripts (reference only)\n├── data/                             # Output GeoParquets (git-ignored, ~8.6 GB)\n├── renv.lock                         # Locked R dependencies\n├── CLAUDE.md                         # Claude Code project instructions\n└── .claude/                          # Rules, plans, backlog, known issues\n    ├── rules/                        # Column conventions, harmonization guide\n    ├── plans/                        # Implementation plans\n    ├── BACKLOG.md                    # Dataset status tracker\n    └── PROBLEMS.md                   # Known bugs and fixes\n```\n\n## Data standards\n\nAll output Parquets follow these conventions:\n\n- **CRS:** SIRGAS 2000 (EPSG:4674)\n- **Geometry:** `MULTIPOLYGON` (except `POINT` for health_facilities, schools, schools_bi, capitals)\n- **Format:** [GeoParquet](https://geoparquet.org/) with spatial metadata\n  (CRS, geometry type, bbox) via `geoarrow`\n- **Compression:** zstd, level 7\n- **Column order:** `code_X`, `name_X`, `code_state`, `abbrev_state`,\n  `name_state`, `code_region`, `name_region`, `year`, `geometry`\n- **Types:** `code_*` = numeric, `name_*` = character (Title Case),\n  `abbrev_state` = 2-letter uppercase\n\n### Output layout\n\n```\ndata/\n└── [dataset]/\n    └── [year]/\n        ├── [dataset]_[year].parquet              # Full resolution\n        └── [dataset]_[year]_simplified.parquet    # Simplified (100m tolerance)\n```\n\n## Adding a new dataset\n\n1. Create `R/[dataset].R` with `download_X(year)` and `clean_X(raw, year)`\n2. Add 3 targets in `_targets.R` (years, raw, clean)\n3. Add `[dataset]_clean` to the `all_files` target at the end of `_targets.R`\n4. Run `tar_make()` and validate output\n\nSee [`.claude/rules/new-dataset.md`](.claude/rules/new-dataset.md) for the\nfull checklist.\n\n## Running tests\n\n```r\n# From the project root:\nsource(\"tests/testthat.R\")\n```\n\n22 tests covering core harmonization functions (`snake_case_names`,\n`add_state_info`, `add_region_info`, `normalize_sf_geometry`, `validate_geobr`).\n\nThe pipeline also includes a `validation` target that checks all output\nGeoParquets for correct CRS, geometry types, column types, and schema.\n\n## Documentation\n\n| File | Description |\n|------|-------------|\n| [`CLAUDE.md`](CLAUDE.md) | Project instructions and conventions |\n| [`.claude/rules/column-conventions.md`](.claude/rules/column-conventions.md) | Column naming, ordering, types |\n| [`.claude/rules/harmonization.md`](.claude/rules/harmonization.md) | How to use `harmonize_geobr()` |\n| [`.claude/rules/new-dataset.md`](.claude/rules/new-dataset.md) | Checklist for new datasets |\n| [`.claude/BACKLOG.md`](.claude/BACKLOG.md) | Status of all 36 datasets |\n| [`.claude/PROBLEMS.md`](.claude/PROBLEMS.md) | 21 bugs resolved (historical log) |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fipea%2Fgeobr_prep_data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fipea%2Fgeobr_prep_data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fipea%2Fgeobr_prep_data/lists"}