{"id":18010175,"url":"https://github.com/p-e-w/wd2sql","last_synced_at":"2025-09-10T14:22:35.142Z","repository":{"id":63939156,"uuid":"571125378","full_name":"p-e-w/wd2sql","owner":"p-e-w","description":"Transform a Wikidata JSON dump into an SQLite database","archived":false,"fork":false,"pushed_at":"2023-03-20T09:04:11.000Z","size":52,"stargazers_count":9,"open_issues_count":4,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-16T17:33:28.344Z","etag":null,"topics":["json","sqlite","wikidata"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/p-e-w.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-27T08:43:55.000Z","updated_at":"2024-11-12T01:08:39.000Z","dependencies_parsed_at":"2024-10-30T02:34:03.375Z","dependency_job_id":null,"html_url":"https://github.com/p-e-w/wd2sql","commit_stats":{"total_commits":2,"total_committers":1,"mean_commits":2.0,"dds":0.0,"last_synced_commit":"9791d5af03d52c575a4314c9f45dfac37c16e168"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/p-e-w%2Fwd2sql","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/p-e-w%2Fwd2sql/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/p-e-w%2Fwd2sql/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/p-e-w%2Fwd2sql/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/p-e-w","download_url":"https://codeload.github.com/p-e-w/wd2sql/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235374607,"owners_count":18979734,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["json","sqlite","wikidata"],"created_at":"2024-10-30T02:13:14.661Z","updated_at":"2025-01-24T01:44:04.080Z","avatar_url":"https://github.com/p-e-w.png","language":"Rust","readme":"# `wd2sql`\n\n`wd2sql` is a tool that transforms a\n[Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) JSON dump\ninto a fully indexed SQLite database that is 90% smaller than the original\ndump, yet contains most of its information. The resulting database enables\nhigh-performance queries to be executed on commodity hardware without the\nneed to install and configure specialized triplestore software. Most\nprogramming languages have excellent support for SQLite, and lots of relevant\ntools exist. I believe this to be by far the easiest option for working with\na local copy of Wikidata that is currently available.\n\n`wd2sql` is *much* faster than most other dump processing tools. In fact,\nit can usually process JSON data as fast as `bzip2` can decompress it.\nIt uses native code, SIMD-accelerated JSON parsing, an optimized allocator,\nbatched transactions, prepared statements, and other SQLite optimizations\nto achieve that performance. On a 2015 consumer laptop, it processes a full\ndump of Wikidata (1.5 Terabytes) in less than 12 hours, using only around\n10 Megabytes of RAM.\n\n`wd2sql` is **not**\n\n* a general-purpose triplestore. It makes assumptions about the structure\n  of the dump that are specific to Wikidata, and will fail when run on other\n  semantic databases.\n* a complete replacement for traditional datastores such as Wikibase.\n  In particular, the SQLite database currently does not contain sitelinks,\n  aliases, qualifiers, references, non-English labels and descriptions,\n  and a few other pieces of information that are present in the dumps.\n* in any way affiliated with, or endorsed by, the Wikidata project and/or\n  the Wikimedia Foundation.\n\n\n## Installation\n\nInstall [Rust](https://www.rust-lang.org/) 1.61 or later, then run\n\n```\ncargo install wd2sql\n```\n\nThis will compile `wd2sql` for your native CPU architecture, which is crucial\nfor performance.\n\nNote that while `wd2sql` *should* work on all platforms, I have only tested\nit on Linux.\n\n\n## Usage\n\n```\nwd2sql \u003cJSON_FILE\u003e \u003cSQLITE_FILE\u003e\n```\n\nUse `-` as `\u003cJSON_FILE\u003e` to read from standard input instead of from a file.\nThis makes it possible to build a pipeline that processes JSON data as it is\nbeing decompressed, without having to decompress the full dump to disk:\n\n```\nbzcat latest-all.json.bz2 | wd2sql - output.db\n```\n\n\n## Database structure\n\n### IDs\n\nWikidata IDs consist of a type prefix (`Q`/`P`/`L`) plus an integer.\n`wd2sql` encodes both of these as a single 32-bit integer (64-bit for\nform and sense IDs):\n\n* **Entity IDs** are simply represented as the integer part of their ID.\n  For example, `Q42` becomes `42`.\n* **Property IDs** are represented as the integer part of their ID,\n  plus 1 billion. For example, `P31` becomes `1000000031`.\n* **Lexeme IDs** are represented as the integer part of their ID,\n  plus 2 billion. For example, `L234` becomes `2000000234`.\n* **Form IDs** are represented as the encoded ID of their associated lexeme\n  (see above), plus 100 billion times their integer form ID.\n  For example, `L99-F2` becomes `202000000099`.\n* **Sense IDs** are represented as the encoded ID of their associated lexeme\n  (see above), plus 100 billion times their integer sense ID, plus 10 billion.\n  For example, `L99-S1` becomes `112000000099`.\n\nThis encoding is simple and compact, and can be easily applied both\nautomatically by algorithms, and manually by humans.\n\n### Tables\n\nIn all tables, the `id` column contains the Wikidata ID of the subject entity,\nencoded as described above. The following tables are generated:\n\n* `meta`, which contains the English `label` and `description` for each entity,\n  or `NULL` if the entity doesn't have an English label or description.\n* `string`, `entity`, `coordinates`, `quantity`, and `time`, which contain\n  the values of claims associated with each entity. The table in which an\n  individual claim value is stored corresponds to the property's\n  [value type](https://www.wikidata.org/wiki/Special:ListDatatypes),\n  and the property is identified by the `property_id` column.\n* `none` and `unknown`, which contain `id`/`property_id` pairs identifying\n  claims whose value is \"no value\" and \"unknown value\", respectively.\n\n### Example: Finding red fruits\n\nFirst, we need to obtain the IDs of the relevant entities:\n\n```\nsqlite\u003e SELECT * FROM meta WHERE label = 'red';\n\nid         label  description\n---------  -----  ----------------------------------------------------------\n17126729   red    eye color\n101063203  red    2018 video game by Bart Bonte\n3142       red    color\n29713895   red    genetic element in the species Drosophila melanogaster\n29714596   red    protein-coding gene in the species Drosophila melanogaster\n```\n\nFrom these results, we can see that the entity we are interested in\n(the color red) has ID `3142`. Repeating this procedure reveals that\n\"fruit (food)\" has ID `3314483`, and the properties \"subclass of\" and\n\"color (of subject)\" have IDs `1000000279` and `1000000462`, respectively.\n\nBoth \"red\" and \"fruit\" are entities, so claims about them can be found\nin the table `entity`. We can now easily construct a query that returns\nthe desired information:\n\n```\nsqlite\u003e SELECT * FROM meta WHERE\n   ...\u003e id IN (SELECT id FROM entity WHERE property_id = 1000000462 AND entity_id = 3142)\n   ...\u003e AND id IN (SELECT id FROM entity WHERE property_id = 1000000279 AND entity_id = 3314483);\n\nid        label        description\n--------  -----------  --------------------------------------------------------------------------------------------------------\n89        apple        fruit of the apple tree\n196       cherry       fruit of the cherry tree\n503       banana       elongated, edible fruit produced by several kinds of large herbaceous flowering plants in the genus Musa\n2746643   fig          edible fruit of Ficus carica\n13202263  peach        fruit, use Q13189 for the species\n13222088  pomegranate  fruit of Punica granatum\n```\n\nAll of these queries have sub-second execution times, and the results\nare identical to those that can be obtained with the SPARQL query\n\n```\nSELECT ?item ?itemLabel\nWHERE\n{\n  ?item wdt:P462 wd:Q3142.\n  ?item wdt:P279 wd:Q3314483.\n  SERVICE wikibase:label { bd:serviceParam wikibase:language \"en\". }\n}\n```\n\nfrom the Wikidata Query Service.\n\n\n## Acknowledgments\n\n`wd2sql` depends on the crates\n[`lazy_static`](https://github.com/rust-lang-nursery/lazy-static.rs),\n[`clap`](https://github.com/clap-rs/clap),\n[`rusqlite`](https://github.com/rusqlite/rusqlite),\n[`simd-json`](https://github.com/simd-lite/simd-json),\n[`wikidata`](https://github.com/Smittyvb/wikidata),\n[`chrono`](https://github.com/chronotope/chrono),\n[`humansize`](https://github.com/LeopoldArkham/humansize),\n[`humantime`](https://github.com/tailhook/humantime),\nand [`jemallocator`](https://github.com/tikv/jemallocator).\n\nWithout the efforts of the countless people who built Wikidata and its\ncontents, `wd2sql` would be useless. It's truly impossible to praise\nthis amazing open data project enough.\n\n\n## Related projects\n\n[import-wikidata-dump-to-couchdb](https://github.com/maxlath/import-wikidata-dump-to-couchdb)\nis a tool that transfers Wikidata dumps to a CouchDB document database.\n\n[Knowledge Graph Toolkit](https://github.com/usc-isi-i2/kgtk) (KGTK)\nis a (much more comprehensive) system for working with semantic data,\nwhich includes functionality for importing Wikidata dumps.\n\n[dumpster-dive](https://github.com/spencermountain/dumpster-dive)\nis a conceptually similar tool that parses *Wikipedia* dumps and\nstores the result in a MongoDB database.\n\n\n## License\n\nCopyright \u0026copy; 2022  Philipp Emanuel Weidmann (\u003cpew@worldwidemann.com\u003e)\n\nThis program is free software: you can redistribute it and/or modify\nit under the terms of the GNU General Public License as published by\nthe Free Software Foundation, either version 3 of the License, or\n(at your option) any later version.\n\nThis program is distributed in the hope that it will be useful,\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\nGNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License\nalong with this program.  If not, see \u003chttps://www.gnu.org/licenses/\u003e.\n\n**By contributing to this project, you agree to release your\ncontributions under the same license.**\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fp-e-w%2Fwd2sql","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fp-e-w%2Fwd2sql","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fp-e-w%2Fwd2sql/lists"}