{"id":18484981,"url":"https://github.com/weso/wd2duckdb","last_synced_at":"2025-12-12T12:03:35.548Z","repository":{"id":165737848,"uuid":"611179752","full_name":"weso/wd2duckdb","owner":"weso","description":"Transform a Wikidata JSON dump into a DuckDB database","archived":false,"fork":false,"pushed_at":"2024-05-24T19:05:12.000Z","size":17494,"stargazers_count":4,"open_issues_count":2,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-08-09T20:32:06.553Z","etag":null,"topics":["duckdb","json","rust","wikidata","wikidata-dump"],"latest_commit_sha":null,"homepage":"https://crates.io/crates/wikidata-rs","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/weso.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-03-08T09:35:20.000Z","updated_at":"2024-05-21T09:53:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"64dae687-bb9f-47cb-ae10-dd46e9a6fc28","html_url":"https://github.com/weso/wd2duckdb","commit_stats":null,"previous_names":["angelip2303/wd2duckdb"],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/weso%2Fwd2duckdb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/weso%2Fwd2duckdb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/weso%2Fwd2duckdb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/weso%2Fwd2duckdb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/weso","download_url":"https://codeload.github.com/weso/wd2duckdb/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223342480,"owners_count":17129848,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["duckdb","json","rust","wikidata","wikidata-dump"],"created_at":"2024-11-06T12:43:49.592Z","updated_at":"2025-12-12T12:03:35.130Z","avatar_url":"https://github.com/weso.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `wd2duckdb`\r\n\r\n`wd2duckdb` is a tool transforming\r\n[Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) JSON dumps\r\ninto a fully indexed DuckDB database ~80% smaller than the original\r\ndump, yet contains most of its information. Note that only the English version of\r\nthe Wikidata items are stored. For you to change so, please refer to \r\n[this line in the code](https://github.com/angelip2303/wd2duckdb/blob/777f47d4ed386e79dba0d8529fced0efb78c6325/src/main.rs#LL23C1).\r\nThe resulting database enables high-performance queries to be executed on commodity\r\nhardware without the need to install and configure specialized triplestore software. \r\nThis project is heavily based on [wd2sql](https://github.com/p-e-w/wd2sql).\r\n\r\n## Installation\r\n\r\nMake sure that you install the latest stable version of \r\n[Rust](https://www.rust-lang.org/); that is, as of May the 5th, version 1.69 or\r\nlater, then run:\r\n\r\n```\r\ncargo install wd2duckdb\r\n```\r\n\r\nThis will compile `wd2duckdb` for your native architecture, increasing the performance.\r\n\r\n## Usage\r\n\r\n```\r\nwd2duckdb --json \u003cJSON_FILE\u003e --database \u003cDUCKDB_FILE\u003e\r\n```\r\n\r\nUse `-` as `\u003cJSON_FILE\u003e` to read from standard input instead of from a file.\r\nThis makes it possible to build a pipeline that processes JSON data as it is\r\nbeing decompressed, without having to decompress the full dump to disk. In case\r\nof a `.bz2` file, you can use the following instruction:\r\n\r\n```\r\nbzcat latest-all.json.bz2 | wd2duckdb --json - --database \u003cDUCKDB_FILE\u003e\r\n```\r\n\r\nIn case of a `.gz` compressed file, the following is required:\r\n\r\n```\r\ngunzip latest-all.json.gz | wd2duckdb --json - --database \u003cDUCKDB_FILE\u003e\r\n```\r\n\r\nIn case you want to write changes directly to the standard ouput; that is, without\r\ncreating a file for the uncompressed `.json`, you can do the following:\r\n\r\n```\r\ngunzip -c latest-all.json.gz | wd2duckdb --json - --database \u003cDUCKDB_FILE\u003e\r\n```\r\n\r\nIf you are working with large dumps where the uncompressed `.json` file size is in\r\nthe order of Terabytes, it is best to choose the last option. The `.duckdb` file,\r\nwhich is more memory-efficient, may thus be created immediately.\r\n\r\n## Database structure\r\n\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"https://github.com/angelip2303/wd2duckdb/assets/65736636/d1380df4-834e-44a6-9b44-b6943ab1afc5\" /\u003e\r\n\u003c/p\u003e\r\n\r\n## Acknowledgments\r\n\r\nWithout the efforts of the countless people who built Wikidata and its\r\ncontents, `wd2duckdb` would be useless. It's truly impossible to praise\r\nthis amazing open data project enough.\r\n\r\n## Related projects\r\n\r\n1. [wd2sql](https://github.com/p-e-w/wd2sql) is this project's main \r\ninspiration.\r\n\r\n## License\r\n\r\nCopyright \u0026copy; 2023 Ángel Iglesias Préstamo (\u003cangel.iglesias.prestamo@gmail.com\u003e)\r\n\r\nThis program is free software: you can redistribute it and/or modify\r\nit under the terms of the GNU General Public License as published by\r\nthe Free Software Foundation, either version 3 of the License, or\r\n(at your option) any later version.\r\n\r\nThis program is distributed in the hope that it will be useful,\r\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\r\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\nGNU General Public License for more details.\r\n\r\nYou should have received a copy of the GNU General Public License\r\nalong with this program.  If not, see \u003chttps://www.gnu.org/licenses/\u003e.\r\n\r\n**By contributing to this project, you agree to release your\r\ncontributions under the same license.**\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fweso%2Fwd2duckdb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fweso%2Fwd2duckdb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fweso%2Fwd2duckdb/lists"}