{"id":46963468,"url":"https://github.com/datenoio/internacia-db","last_synced_at":"2026-05-29T08:02:34.613Z","repository":{"id":327867202,"uuid":"1108418551","full_name":"datenoio/internacia-db","owner":"datenoio","description":"Public registry of the intergovernmental organizations, country groups and countries. Available as JSONl, Parquet, YAML and DuckDB database datasets","archived":false,"fork":false,"pushed_at":"2026-05-28T08:18:09.000Z","size":4881,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-28T10:16:40.041Z","etag":null,"topics":["countries","data","datasets","international","international-trade","reference"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datenoio.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-02T12:26:40.000Z","updated_at":"2026-05-28T08:18:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/datenoio/internacia-db","commit_stats":null,"previous_names":["datenoio/internacia-db"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/datenoio/internacia-db","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datenoio%2Finternacia-db","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datenoio%2Finternacia-db/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datenoio%2Finternacia-db/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datenoio%2Finternacia-db/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datenoio","download_url":"https://codeload.github.com/datenoio/internacia-db/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datenoio%2Finternacia-db/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33642318,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-29T02:00:06.066Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["countries","data","datasets","international","international-trade","reference"],"created_at":"2026-03-11T10:02:32.393Z","updated_at":"2026-05-29T08:02:34.608Z","avatar_url":"https://github.com/datenoio.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Internacia Datasets\n\nComprehensive reference datasets of countries, intergovernmental organizations, and country groups. Source YAML files in `data/countries/` and `data/intblocks/` are validated, enriched, and exported to multiple formats in `data/datasets/`. The project serves as a data source for the **Dateno** search engine.\n\n## Features\n\n- **Multi-format export**: JSONL, YAML, Parquet, and DuckDB (Zstandard compression, level 22)\n- **Countries quality pipeline**: schema validation, completeness gates, entity status policy, and field-level provenance\n- **Profile enrichment**: population, area, gini, timezones, and native names from World Bank, Wikidata, and IANA tzdata\n- **Build metadata**: `countries.manifest.json` with version, commit, row count, and schema hash\n- **CI validation**: pull-request checks via `.github/workflows/validate.yml`\n- **CLI tools**: Typer-based scripts with tqdm progress bars\n\n## Installation\n\n```bash\npip install -r requirements.txt\n```\n\n## Quick start\n\n```bash\n# Inspect data sources\npython3 scripts/builder.py info\n\n# Validate country YAML (no build)\npython3 scripts/validate_countries.py\n\n# Build all datasets\npython3 scripts/builder.py build\n\n# Build specific formats only\npython3 scripts/builder.py build --formats parquet,duckdb\n```\n\n## Output files\n\nEach build writes to `data/datasets/`:\n\n| File | Description |\n|------|-------------|\n| `countries.jsonl.zst` | Countries (JSONL, zstd) |\n| `countries.yaml.zst` | Countries (YAML, zstd) |\n| `countries.parquet` | Countries (Parquet, zstd) |\n| `countries.manifest.json` | Build metadata (version, commit, row count, schema hash) |\n| `intblocks.jsonl.zst` | International blocks (JSONL, zstd) |\n| `intblocks.yaml.zst` | International blocks (YAML, zstd) |\n| `intblocks.parquet` | International blocks (Parquet, zstd) |\n| `blocktypes.jsonl.zst` | Block types (JSONL, zstd) |\n| `blocktypes.yaml.zst` | Block types (YAML, zstd) |\n| `blocktypes.parquet` | Block types (Parquet, zstd) |\n| `internacia.duckdb` | DuckDB database (`countries`, `intblocks`, `blocktypes` tables) |\n\nCurrent row counts: **252** countries, **1065** intblocks, **85** blocktypes.\n\n## Validation and quality\n\nThe builder runs `validate_countries.py` before export. Validation covers:\n\n- JSON Schema conformance (`data/schemas/countries.schema.json`)\n- ISO identifier formats and duplicate detection\n- Completeness thresholds (`data/schemas/countries_completeness.yaml`)\n- Entity status policy (`entity_type`, `code_status`)\n- Intblock cross-references (country `includes` resolve to country sources)\n\n```bash\n# Full validation with JSON report\npython3 scripts/validate_countries.py --report completeness-report.json\n\n# Enrich profile fields from external sources\npython3 scripts/enrich_countries.py\npython3 scripts/enrich_countries.py backfill-provenance\n\n# Apply entity status annotations\npython3 scripts/annotate_entity_status.py\n\n# Audit intblock include name aliases (warn-only)\npython3 scripts/report_country_include_names.py\n\n# Compare manifest to main branch baseline\npython3 scripts/diff_countries_baseline.py\n```\n\nCountry code policy (ISO vs user-assigned, filtering examples): [docs/country-code-policy.md](docs/country-code-policy.md)\n\n## Consumer migration\n\nBreaking and semantic changes in the latest countries schema (see [CHANGELOG.md](CHANGELOG.md)):\n\n- **Population / area / gini**: structured as `{value, year, source, source_id}` — use `.value` for the numeric field.\n- **Borders**: land neighbors as ISO **alpha-3** codes (e.g. `CAN`, `MEX`), not alpha-2.\n- **Entity filter**: `code_status == 'official_iso3166_1'` returns **249** current ISO-style records.\n- **Build metadata**: compare `countries.manifest.json` `schema_hash` when upgrading downstream pipelines.\n\n**Pandas example** (structured population):\n\n```python\nimport pandas as pd\n\ndf = pd.read_parquet(\"data/datasets/countries.parquet\")\npop = df[\"population\"].struct.field(\"value\")\n```\n\n**DuckDB example** (nested intblock translations):\n\n```python\nimport duckdb\n\ncon = duckdb.connect(\"data/datasets/internacia.duckdb\")\ncon.execute(\"\"\"\n    SELECT id, name, t.name AS english_name\n    FROM intblocks, UNNEST(translations) AS t\n    WHERE t.lang = 'en'\n    LIMIT 5\n\"\"\").fetchall()\n```\n\n## Countries schema\n\n252 country and territory records. Key fields:\n\n| Field | Type | Description |\n|-------|------|-------------|\n| `code` | String | ISO 3166-1 alpha-2 code (e.g. `US`) |\n| `entity_type` | String | `sovereign_state`, `dependent_territory`, `historical_entity`, etc. |\n| `code_status` | String | `official_iso3166_1`, `user_assigned`, `obsolete` |\n| `recognition_status` | Struct | Optional recognition/dispute metadata |\n| `name` | String | Common name |\n| `iso3code` | String | ISO 3166-1 alpha-3 code |\n| `capital_city` | Struct | `{name, lng, lat}` |\n| `region` | Struct | World Bank region `{id, value}` |\n| `adminregion` | Struct | World Bank admin region `{id, value}` |\n| `incomeLevel` | Struct | World Bank income level `{id, value}` |\n| `lendingType` | Struct | World Bank lending type `{id, value}` |\n| `numeric_code` | String | ISO 3166-1 numeric code |\n| `wikidata_id` | String | Wikidata item ID |\n| `official_name` | String | Official full name |\n| `languages` | List[Struct] | `{code, name, official}` |\n| `currencies` | List[Struct] | `{code, name, symbol}` |\n| `un_member` | Boolean | UN member |\n| `independent` | Boolean | Independent state |\n| `subregion` | String | UN subregion |\n| `continents` | List[String] | Continents |\n| `borders` | List[String] | Land borders as ISO **alpha-3** codes |\n| `landlocked` | Boolean | Landlocked |\n| `tld` | String | Top-level domain |\n| `calling_codes` | List[String] | Telephone codes |\n| `flag_emoji` | String | Flag emoji |\n| `car_side` | String | Driving side |\n| `start_of_week` | String | Start of week |\n| `demonyms` | Struct | `{female, male}` |\n| `m49_code` | String | UN M49 code |\n| `population` | Struct | `{value, year, source, source_id}` |\n| `area` | Struct | Land area sq km `{value, year, source, source_id}` |\n| `gini` | Struct | Gini index `{value, year, source, source_id}` |\n| `timezones` | List[String] | IANA timezone identifiers |\n| `timezone_status` | String | `not_applicable` when no zones apply |\n| `native_names` | Map | Lang code → `{official, common}` |\n| `other_names` | List[Struct] | Translations `{id, name}` |\n| `common_names` | List[String] | Aliases and common names |\n| `provenance` | List[Struct] | Field sourcing `{field, source, retrieved_at, url, license}` |\n\nNon-standard codes retained with explicit status: `AN` (obsolete), `JG` (user-assigned grouping), `KV` (user-assigned, disputed).\n\n## International blocks schema\n\n| Field | Type | Description |\n|-------|------|-------------|\n| `id` | String | Unique identifier |\n| `blocktype` | List[String] | Block types |\n| `status` | String | Current status |\n| `name` | String | Name |\n| `languages` | List[String] | Official languages |\n| `links` | List[Struct] | `{url, type}` |\n| `other_names` | List[Struct] | `{id, name}` translations |\n| `founded` | String | Foundation date |\n| `geographic_scope` | String | Scope |\n| `regions` | List[String] | Regions covered |\n| `includes` | List[Struct] | Members `{id, name, type, status, joined, role, note}` — **`id` is authoritative**; `name` is a source label |\n| `membership_count` | Integer | Member count |\n| `wikidata_id` | String | Wikidata item ID |\n| `legal_status` | String | Legal status |\n| `description` | String | Description |\n| `tags` | List[String] | Tags |\n| `topics` | List[Struct] | `{key, name}` |\n| `headquarters` | Struct | `{city, country, coordinates}` |\n| `acronyms` | List[Struct] | `{lang, value}` |\n| `partof` | List[String] | Parent organizations |\n| `dissolved` | String | Dissolution date |\n| `predecessor` | String | Predecessor |\n| `successor` | String | Successor |\n\n## Data sources\n\n**YAML sources**\n\n- `data/countries/*.yaml` — 252 country/territory records\n- `data/intblocks/**/*.yaml` — 1065 international block records\n\n**External enrichment**\n\n- [World Bank](https://data.worldbank.org/) — population, area, gini, income classifications\n- [Wikidata](https://www.wikidata.org/) — entity linking, native names, fallbacks\n- [IANA tzdata](https://data.iana.org/time-zones/) — timezone mapping (`scripts/data/zone1970.tab`)\n\n## Scripts\n\n| Script | Purpose |\n|--------|---------|\n| `scripts/builder.py` | Validate and export datasets |\n| `scripts/validate_countries.py` | Country schema, completeness, and cross-dataset checks |\n| `scripts/validate_links.py` | Intblock URL and Wikidata validation |\n| `scripts/enrich_countries.py` | Enrich country profiles; `backfill-provenance` subcommand |\n| `scripts/annotate_entity_status.py` | Set `entity_type` and `code_status` |\n| `scripts/report_country_include_names.py` | Intblock include name alias audit |\n| `scripts/diff_countries_baseline.py` | Manifest diff vs git baseline |\n\n## Notes\n\n- All text files use UTF-8 encoding; generated outputs overwrite existing files.\n- Decompress zstd files: `zstd -d data/datasets/countries.jsonl.zst`\n- Gap analysis research: `dev/research/countries_gaps_,manus_20260528.md`\n\n## Related projects\n\n- [internacia-api](../internacia-api) — REST API\n- [internacia-python](../internacia-python) — Python SDK\n\n## Roadmap\n\n- [x] Python SDK — [internacia-python](../internacia-python)\n- [x] REST API — [internacia-api](../internacia-api)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatenoio%2Finternacia-db","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatenoio%2Finternacia-db","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatenoio%2Finternacia-db/lists"}