{"id":16857667,"url":"https://github.com/brawer/wikidata-qsdump","last_synced_at":"2025-03-18T12:14:53.457Z","repository":{"id":162086185,"uuid":"636700491","full_name":"brawer/wikidata-qsdump","owner":"brawer","description":"experiment for a new dump format for Wikidata","archived":false,"fork":false,"pushed_at":"2023-05-13T02:46:35.000Z","size":164,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-24T18:12:09.072Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brawer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-05T12:51:29.000Z","updated_at":"2023-12-19T08:38:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"7e2c27e1-12d2-4091-806c-f5f8310ec86a","html_url":"https://github.com/brawer/wikidata-qsdump","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brawer%2Fwikidata-qsdump","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brawer%2Fwikidata-qsdump/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brawer%2Fwikidata-qsdump/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brawer%2Fwikidata-qsdump/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brawer","download_url":"https://codeload.github.com/brawer/wikidata-qsdump/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244217948,"owners_count":20417677,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T14:09:02.435Z","updated_at":"2025-03-18T12:14:53.432Z","avatar_url":"https://github.com/brawer.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Experiment: New Format for Wikidata Dumps?\n\nThis is an experiment for a simpler, smaller and faster (to decompress)\ndata format for [Wikidata dumps](https://www.wikidata.org/wiki/Wikidata:Database_download).\n\n| Format      |     Size¹ | Tool   | Decompression time² |\n| ----------- | --------: | ------ | ------------------: |\n| `.qs.zst`   |  26.6 GiB | zstd   |           6 minutes |\n| `.qs.bz2`   |  26.3 GiB | lbzip2 |           7 minutes |\n| `.qs.br`    |  25.9 GiB | brotli |          11 minutes |\n| `.qs.gz`    |  38.9 GiB | gzip   |          23 minutes |\n| `.json.zst` |  72.3 GiB | zstd   |          19 minutes |\n| `.json.br`  |  67.8 GiB | brotli |          25 minutes |\n| `.json.bz2` |  75.9 GiB | lbzip2 |          59 minutes |\n| `.json.gz`  | 115.2 GiB | gzip   |          89 minutes |\n| `.qs.bz2`   |  26.3 GiB | pbzip2 |          94 minutes |\n| `.json.bz2` |  75.9 GiB | pbzip2 |         926 minutes |\n\nThe proposed new format,\n[QuickStatements](https://www.wikidata.org/wiki/Help:QuickStatements)\nwith [Zstandard](https://en.wikipedia.org/wiki/Zstd) compression,\ntakes about a third of the current best file size. On a typical modern\ncloud server, decompression is about 10 times faster compared to lbzip2\non the current JSON dumps.\n\n\n## Motivation\n\nAs of May 2023, the most compact format for [Wikidata\ndumps](https://dumps.wikimedia.org/wikidatawiki/entities/20230424/) is\nJSON with bzip2 compression.  However, the current JSON syntax is very\nverbose, which makes it slow to process. Another issue is bzip2: since\nits invention 27 years ago, newer algorithms have been designed for\nfast decompression on today’s machines.\n\nAs a frequent user of Wikidata dumps, I got annoyed by the high cost of\nprocessing the current format, and I wondered how much could be gained\nfrom a better format. Specifically, a new format should be significantly\nsmaller in size; much faster to decompress; and easy to understand.\n\nWikidata editors frequently use the [QuickStatements\ntool](https://www.wikidata.org/wiki/Help:QuickStatements) for bulk\nmaintenance. The tool accepts statements in a text syntax that is easy\nto understand and quite compact. I wondered if Wikidata dumps could be\nencoded as QuickStatements, and compressed with a modern algorithm\nsuch as [Zstandard](https://en.wikipedia.org/wiki/Zstd).\n\n\n## Extensions to QuickStatements syntax\n\nNote that the current QuickStatements syntax cannot express all of\nWikidata; the major missing piece is ranking. For this experiment, I\nencocded preferred and deprecated rank with ↑ and ↓ arrows, as in\n`Q12|P9|↑\"foo\"`. All other missing parts are minor and rare, such as\ncoordinates on Venus and Mars; for this experiment, I pretended these\nwere on Earth. To fully encode all of Wikidata as QuickStatements,\nsuitable syntax would need to be defined and properly documented.\nObviously, it would then also make sense to support this new syntax\nin the live QuickStatments tool.\n\nCurrently, QuickStatements does not seem to define an escape mechanism\nfor quote characters. In my experiment, I used an Unicode escape sequence\nwhen a quoted string contained a quote, as in `\"foo \\u0022 bar\"`.\n\nA nice property of the current JSON format is that each item is encoded\non a separate line. It might be nice to preserve this property. This would\nneed (small, backwards-compatible) extensions to the QuickStatement syntax:\n(a) allow multiple labels, aliases\nand sitelinks, as in `Q2|Len|\"Earth\"|Aen|\"Planet Earth\"|Lfr|\"Terre\"`;\n(b) allow multiple claims (not just multiple qualifiers) on the same\nline, perhaps with a `!P` construct similar to the existing `!S`.\nThis would also make the format slightly more compact.\n\n## Other issues with Wikidata dumps\n\nIn a new version of Wikidata dumps, I think it would be good to\naddress some other things.\n\n1. Wikidata dumps should be atomic snapshots, taken at a defined point\nin time. Currently, each item is getting dumped at a slightly different\ntime. This fuzziness makes it difficult to build reliable systems.\nGenerating consistent snapshots should be possible since Wikidata’s\nproduction database contains the edit history; the generator could simply\nignore any changes to the live database that are more recent than\nthe snapshot time.\n\n2. It would be nice if the dump could also include redirects, and indicate\nwhich items have been deleted. For consistency, this should be snapshotted\nat the same point in time as the actual data.\n\n3. Statements should be sorted by subject entity ID. This would\nallow data consumers to build their own data structures (eg. an LMDB\nB-tree or similar) without having to re-shuffle all of Wikidata.\n\nFor this experiment, I have not bothered with any of this since it does\nnot affect the format. (Actually, sorting as in #3 might slightly\nchange the file size, perhaps making it smaller by a small amount;\nbut the difference is unlikely to be significant). I’m just noting this\nas a wishlist for re-implementing Wikidata dumps.\n\n\n## Footnotes\n\n1. Size\n    * `wikidata-20230424-all.json.br`:   72848722597 bytes =  67.8 GiB\n    * `wikidata-20230424-all.json.bz2`:  81539742715 bytes =  75.9 GiB\n    * `wikidata-20230424-all.json.gz`:  123717867013 bytes = 115.2 GiB\n    * `wikidata-20230424-all.json.zst`:  77593744874 bytes =  72.3 GiB\n    * `wikidata-20230424-all.qs.br`:     27787010556 bytes =  25.9 GiB\n    * `wikidata-20230424-all.qs.bz2`:    28229997539 bytes =  26.3 GiB\n    * `wikidata-20230424-all.qs.gz`:     41820873140 bytes =  38.9 GiB\n    * `wikidata-20230424-all.qs.zst`:    28567267401 bytes =  26.6 GiB\n\n2. Decompression time measured on [Hetzner Cloud](https://www.hetzner.com/cloud), Falkenstein data center, virtual machine model CAX41, Ampere ARM64 CPU, 16 cores, 32 GB RAM, Debian GNU/Linux 11 (bullseye), Kernel 5.10.0-21-arm64, data files located on a mounted ext4 volume\n\n    * `time brotli -cd wikidata-20230424-all.json.br \u003e/dev/null`, brotli version 1.0.9 → real 25m1.450s, user 20m6.214s, sys 0m47.106s\n    * `time pbzip2 -cd wikidata-20230424-all.json.bz2 \u003e/dev/null`, parallel pbzip2 version 1.1.13 → real 926m39.401s, user 930m39.828s, sys 3m30.333s\n    * `time lbzcat -cd wikidata-20230424-all.json.bz2 \u003e/dev/null`, lbzip2 version 2.5 → real 59m30.694s, user 943m48.935s, sys 7m30.243s\n    * `time gzip -cd wikidata-20230424-all.json.gz \u003e/dev/null`, gzip version 1.10 → real 88m44.009s, user 86m25.866s, sys 1m18.897s\n    * `time zstdcat wikidata-20230424-all.json.zst \u003e/dev/null`, zstd version 1.4.8 → real 21m46.846s, user 18m53.957s, sys 1m6.578s\n    * `time brotli -cd wikidata-20230424-all.qs.br \u003e/dev/null`, brotli version 1.0.9 → real 10m31.041s, user 8m25.385s, sys 0m17.338s\n    * `time pbzip2 -cd wikidata-20230424-all.qs.bz2 \u003e/dev/null`, parallel pbzip2 version 1.1.13 → real 93m36.174s, user 94m50.751s, sys 0m40.565s\n    * `time lbzcat -cd wikidata-20230424-all.qs.bz2 \u003e/dev/null`, lbzip2 version 2.5 → real 7m3.783s, user 109m57.272s, sys 2m19.303s\n    * `time gzip -cd wikidata-20230424-all.qs.gz \u003e/dev/null`, gzip version 1.10 → real 22m48.054s, user 22m8.762s, sys 0m21.047s\n    * `time zstdcat wikidata-20230424-all.qs.zst \u003e/dev/null`, zstd version 1.4.8 → run 1: real 5m58.011s, user 5m51.994s, sys 0m5.996s;\n    run 2: real 5m55.021s, user 5m47.642s, sys 0m7.364s;\n    run 3: real 5m53.228s, user 5m47.401s, sys 0m5.820s;\n    average: real 5m55.420s, user 5m49.012s, sys 0m6.393s\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrawer%2Fwikidata-qsdump","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrawer%2Fwikidata-qsdump","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrawer%2Fwikidata-qsdump/lists"}