{"id":13559146,"url":"https://github.com/ipfs/distributed-wikipedia-mirror","last_synced_at":"2025-04-12T21:36:30.655Z","repository":{"id":20450502,"uuid":"89945298","full_name":"ipfs/distributed-wikipedia-mirror","owner":"ipfs","description":"Putting Wikipedia Snapshots on IPFS","archived":false,"fork":false,"pushed_at":"2024-08-19T09:41:04.000Z","size":5522,"stargazers_count":645,"open_issues_count":42,"forks_count":58,"subscribers_count":46,"default_branch":"main","last_synced_at":"2025-04-04T01:08:23.954Z","etag":null,"topics":["decentralized","distributed","ipfs","p2p","peer-to-peer","wikipedia"],"latest_commit_sha":null,"homepage":"https://github.com/ipfs/distributed-wikipedia-mirror#readme","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ipfs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-05-01T17:22:00.000Z","updated_at":"2025-04-03T02:28:40.000Z","dependencies_parsed_at":"2024-09-24T13:15:05.978Z","dependency_job_id":"fdbe41e0-0868-426a-9dde-4e89493d0a01","html_url":"https://github.com/ipfs/distributed-wikipedia-mirror","commit_stats":{"total_commits":166,"total_committers":13,"mean_commits":12.76923076923077,"dds":0.6506024096385542,"last_synced_commit":"00dd16d5725864fe798921bb70d190662682ef4f"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipfs%2Fdistributed-wikipedia-mirror","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipfs%2Fdistributed-wikipedia-mirror/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipfs%2Fdistributed-wikipedia-mirror/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipfs%2Fdistributed-wikipedia-mirror/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ipfs","download_url":"https://codeload.github.com/ipfs/distributed-wikipedia-mirror/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248637589,"owners_count":21137534,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["decentralized","distributed","ipfs","p2p","peer-to-peer","wikipedia"],"created_at":"2024-08-01T12:05:22.589Z","updated_at":"2025-04-12T21:36:30.635Z","avatar_url":"https://github.com/ipfs.png","language":"TypeScript","funding_links":[],"categories":["TypeScript","ipfs"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://bafybeia6plrlomsxobezyatrbie3f7rgucidbomfeuyv6lcqhv3pdc24qi.ipfs.dweb.link/?filename=wikipedia-on-ipfs.jpg\" width=\"40%\" /\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eDistributed Wikipedia Mirror Project\u003c/h1\u003e\n\u003cp align=\"center\"\u003e\nPutting Wikipedia Snapshots on IPFS and working towards making it fully read-write.\n\u003cbr /\u003e\n\u003cbr /\u003e\n\n## Existing Mirrors\n\nThere are various ways one can access the mirrors: through a [DNSLink](https://docs.ipfs.tech/concepts/glossary/#dnslink), public [gateway](https://docs.ipfs.tech/concepts/glossary/#gateway) or directly with a [CID](https://docs.ipfs.tech/concepts/glossary/#cid). \n\nYou can [read all about the available methods here](https://blog.ipfs.tech/2021-05-31-distributed-wikipedia-mirror-update/#improved-access-to-wikipedia-mirrors).\n\n### DNSLinks\n\n- https://en.wikipedia-on-ipfs.org\n- https://tr.wikipedia-on-ipfs.org\n- https://my.wikipedia-on-ipfs.org\n- https://ar.wikipedia-on-ipfs.org\n- https://zh.wikipedia-on-ipfs.org\n- https://uk.wikipedia-on-ipfs.org\n- https://ru.wikipedia-on-ipfs.org\n- https://fa.wikipedia-on-ipfs.org\n\n### CIDs\n\nThe latest CIDs that the DNSLinks point at can be found in [snapshot-hashes.yml](snapshot-hashes.yml).\n\n---\n\nEach mirror has a link to the original [Kiwix](https://kiwix.org) ZIM archive in the footer. It can be dowloaded and opened offline with the [Kiwix Reader](https://www.kiwix.org/en/download/).\n\n## Table of Contents\n\n- [Purpose](#purpose)\n- [How to add new Wikipedia snapshots to IPFS](#how-to-add-new-wikipedia-snapshots-to-ipfs)\n  - [Manual build](#manual-build)\n  - [Docker](#docker-build)\n- [How to help](#how-to-help)\n  - [Cohost a lazy copy](#cohost-a-lazy-copy)\n  - [Cohost a full copy](#cohost-a-full-copy)\n\n## Purpose\n\n“We believe that information—knowledge—makes the world better. That when we ask questions, get the facts, and are able to understand all perspectives on an issue, it allows us to build the foundation for a more just and tolerant society”\n-- Katherine Maher, Executive Director of the Wikimedia Foundation\n\n## Wikipedia on IPFS -- Background\n\n### What does it mean to put Wikipedia on IPFS?\n\nThe idea of putting Wikipedia on IPFS has been around for a while. Every few months or so someone revives the threads. You can find such discussions in [this github issue about archiving wikipedia](https://github.com/ipfs/archives/issues/20), [this issue about possible integrations with Wikipedia](https://github.com/ipfs/notes/issues/46), and [this proposal for a new project](https://github.com/ipfs/notes/issues/47#issuecomment-140587530).\n\nWe have two consecutive goals regarding Wikipedia on IPFS: Our first goal is to create periodic read-only snapshots of Wikipedia. A second goal will be to create a full-fledged read-write version of Wikipedia. This second goal would connect with the Wikimedia Foundation’s bigger, longer-running conversation about decentralizing Wikipedia, which you can read about at https://strategy.wikimedia.org/wiki/Proposal:Distributed_Wikipedia\n\n### (Goal 1) Read-Only Wikipedia on IPFS\n\nThe easy way to get Wikipedia content on IPFS is to periodically -- say every week -- take snapshots of all the content and add it to IPFS. That way the majority of Wikipedia users -- who only read wikipedia and don’t edit -- could use all the information on wikipedia with all the benefits of IPFS. Users couldn't edit it, but users could download and archive swaths of articles, or even the whole thing. People could serve it to each other peer-to-peer, reducing the bandwidth load on Wikipedia servers. People could even distribute it to each other in closed, censored, or resource-constrained networks -- with IPFS, peers do not need to be connected to the original source of the content, being connected to anyone who has the content is enough. Effectively, the content can jump from computer to computer in a peer-to-peer way, and avoid having to connect to the content source or even the internet backbone. We've been in discussions with many groups about the potential of this kind of thing, and how it could help billions of people around the world to access information better -- either free of censorship, or circumventing serious bandwidth or latency constraints.\n\nSo far, we have achieved part of this goal: we have static snapshots of all of Wikipedia on IPFS. This is already a huge result that will help people access, keep, archive, cite, and distribute lots of content. In particular, we hope that this distribution helps people in Turkey, who find themselves in a tough situation. We are still working out a process to continue updating these snapshots, we hope to have someone at Wikimedia in the loop as they are the authoritative source of the content. **If you could help with this, please get in touch with us at `wikipedia-project \u003cAT\u003e ipfs.io`**\n\n### (Goal 2) Fully Read-Write Wikipedia on IPFS\n\nThe long term goal is to get the full-fledged read-write Wikipedia to work on top of IPFS. This is much more difficult because for a read-write application like Wikipedia to leverage the distributed nature of IPFS, we need to change how the applications write data. A read-write wikipedia on IPFS would allow it to be completely decentralized, and create an extremely difficult to censor operation. In addition to all the benefits of the static version above, the users of a read-write Wikipedia on IPFS could write content from anywhere and publish it, even without being directly connected to any wikipedia.org servers. There would be automatic version control and version history archiving. We could allow people to view, edit, and publish in completely encrypted contexts, which is important to people in highly repressive regions of the world.\n\nA full read-write version (2) would require a strong collaboration with Wikipedia.org itself, and finishing work on important dynamic content challenges -- we are working on all the technology (2) needs, but it's not ready for prime-time yet. We will update when it is.\n\n# How to add new Wikipedia snapshots to IPFS\n\nThe process can be nearly fully automated, however it consists of many stages\nand understanding what happens during each stage is paramount if ZIM format\nchanges and our build toolchain requires a debug and update.\n\n- [Manual build](#manual-build) are useful in debug situations, when specific stage  needs to be executed multiple times to fix a bug.\n  - [mirrorzim.sh](#mirrorzimsh) automates some steps for QA purposes and ad-hoc experimentation\n\u003c!--\n- [Docker build](#docker-build) is fully automated blackbox which takes ZIM file and produces CID and `IPFS_PATH` with datastore.\n--\u003e\n\n**Note: This is a work in progress.**. We intend to make it easy for anyone to\ncreate their own wikipedia snapshots and add them to IPFS, making sure those\nbuilds are deterministic and auditable, but our first emphasis has been to get\nthe initial snapshots onto the network. This means some of the steps aren't as\neasy as we want them to be. If you run into trouble, seek help through a github\nissue, commenting in [chat](https://docs.ipfs.tech/community/#chat), or by posting a thread on\n[https://discuss.ipfs.tech](https://discuss.ipfs.tech/c/help/13).\n\n## Manual build\n\nIf you would like to create an updated Wikipedia snapshot on IPFS, you can follow these steps.\n\n\n### Step 0: Clone this repository\n\nAll commands assume to be run inside a cloned version of this repository\n\nClone the distributed-wikipedia-mirror git repository\n\n```sh\n$ git clone https://github.com/ipfs/distributed-wikipedia-mirror.git\n```\n\nthen `cd` into that directory\n\n```sh\n$ cd distributed-wikipedia-mirror\n```\n\n### Step 1: Install dependencies\n\n`Node` and `yarn` are required. On Mac OS X you will need `sha256sum`, available in coreutils.\n\nInstall the node dependencies:\n\n```sh\n$ yarn\n```\n\nThen, download the latest [zim-tools](https://download.openzim.org/release/zim-tools/) and add `zimdump` to your `PATH`.\nThis tool is necessary for unpacking ZIM.\n\n### Step 2: Configure your IPFS Node\n\nIt is advised to use separate IPFS node for this:\n\n```console\n$ export IPFS_PATH=/path/to/IPFS_PATH_WIKIPEDIA_MIRROR\n$ ipfs init -p server,local-discovery,flatfs,randomports --empty-repo\n```\n\n#### Tune DHT for speed\n\nWikipedia has a lot of blocks, to publish them as fast as possible,\nenable [Accelerated DHT Client](https://github.com/ipfs/go-ipfs/blob/master/docs/experimental-features.md#accelerated-dht-client):\n\n```console\n$ ipfs config --json Experimental.AcceleratedDHTClient true\n```\n\n#### Tune datastore for speed\n\nMake sure repo uses `flatfs` with  `sync` set to `false`:\n\n```console\n$ ipfs config --json 'Datastore.Spec.mounts' \"$(ipfs config 'Datastore.Spec.mounts' | jq -c '.[0].child.sync=false')\"\n```\n\n**NOTE:** While badgerv1 datastore is faster is nome configurations, we choose to avoid using it with bigger builds like English because of [memory issues due to the number of files](https://github.com/ipfs/distributed-wikipedia-mirror/issues/85). Potential workaround is to use [`filestore`](https://github.com/ipfs/go-ipfs/blob/master/docs/experimental-features.md#ipfs-filestore) that avoids duplicating data and reuses unpacked files as-is.\n\n#### HAMT sharding\n\nMake sure you use go-ipfs 0.12 or later, it has automatic sharding of big directories.\n\n### Step 3: Download the latest snapshot from kiwix.org\n\nSource of ZIM files is at https://download.kiwix.org/zim/wikipedia/\nMake sure you download `_all_maxi_` snapshots, as those include images.\n\nTo automate this, you can also use the `getzim.sh` script:\n\nFirst, download the latest wiki lists using `bash ./tools/getzim.sh cache_update`\n\nAfter that create a download command using `bash ./tools/getzim.sh choose`, it should give an executable command e.g.\n\n```sh\nDownload command:\n    $ ./tools/getzim.sh download wikipedia wikipedia tr all maxi latest\n```\n\nRunning the command will download the choosen zim file to the `./snapshots` directory.\n\n\n\n### Step 4: Unpack the ZIM snapshot\n\nUnpack the ZIM snapshot using `extract_zim`:\n\n```sh\n$ zimdump dump ./snapshots/wikipedia_tr_all_maxi_2021-01.zim --dir ./tmp/wikipedia_tr_all_maxi_2021-01\n```\n\n\u003e ### ℹ️ ZIM's main page\n\u003e\n\u003e Each ZIM file has \"main page\" attribute which defines the landing page set for the ZIM archive.\n\u003e It is often different than the \"main page\" of upstream Wikipedia.\n\u003e Kiwix Main page needs to be passed in the next step, so until there is an automated way to determine \"main page\" of ZIM, you need to open ZIM in Kiwix reader and eyeball the name of the landing page.\n\n### Step 5: Convert the unpacked zim directory to a website with mirror info\n\nIMPORTANT: The snapshots must say who disseminated them. This effort to mirror Wikipedia snapshots is not affiliated with the Wikimedia foundation and is not connected to the volunteers whose contributions are contained in the snapshots. The snapshots must include information explaining that they were created and disseminated by independent parties, not by Wikipedia.\n\nThe conversion to a working website and the appending of necessary information is is done by the node program under `./bin/run`.\n\n```sh\n$ node ./bin/run --help\n```\n\nThe program requires main page for ZIM and online versions as one of inputs. For instance, the ZIM file for Turkish Wikipedia has a main page of `Kullanıcı:The_other_Kiwix_guy/Landing` but `https://tr.wikipedia.org` uses `Anasayfa` as the main page. Both must be passed to the node script.\n\nTo determine the original main page use `./tools/find_main_page_name.sh`:\n\n```console\n$ ./tools/find_main_page_name.sh tr.wikiquote.org\nAnasayfa\n```\n\nTo determine the main page in ZIM file open in in a [Kiwix reader](https://www.kiwix.org/en/kiwix-reader) or use `zimdump info` (version 3.0.0 or later) and ignore the `A/` prefix:\n\n```console\n$ zimdump info wikipedia_tr_all_maxi_2021-01.zim\ncount-entries: 1088190\nuuid: 840fc82f-8f14-e11e-c185-6112dba6782e\ncluster count: 5288\nchecksum: 50113b4f4ef5ddb62596d361e0707f79\nmain page: A/Kullanıcı:The_other_Kiwix_guy/Landing\nfavicon: -/favicon\n\n$ zimdump info wikipedia_tr_all_maxi_2021-01.zim | grep -oP 'main page: A/\\K\\S+'\nKullanıcı:The_other_Kiwix_guy/Landing\n```\n\nThe conversion is done on the unpacked zim directory:\n\n```sh\nnode ./bin/run ./tmp/wikipedia_tr_all_maxi_2021-02 \\\n  --hostingdnsdomain=tr.wikipedia-on-ipfs.org \\\n  --zimfile=./snapshots/wikipedia_tr_all_maxi_2021-02.zim \\\n  --kiwixmainpage=Kullanıcı:The_other_Kiwix_guy/Landing \\\n  --mainpage=Anasayfa\n```\n\n### Step 6: Import website directory to IPFS\n\n#### Increase the limitation of opening files\n\nIn some cases, you will meet an error like `could not create socket: Too many open files` when you add files to the IPFS store. It happens when IPFS needs to open more files than it is allowed by the operating system and you can temporarily increase this limitation to avoid this error using this command.\n\n```sh\nulimit -n 65536\n```\n\n#### Add immutable copy\n\nAdd all the data to your node using `ipfs add`. Use the following command, replacing `$unpacked_wiki` with the path to the website that you created in Step 4 (`./tmp/wikipedia_en_all_maxi_2018-10`).\n\n```sh\n$ ipfs add -r --cid-version 1 --offline $unpacked_wiki\n```\n\nSave the last hash of the output from the above process. It is the CID of the website.\n\n### Step 7: Share the root CID\n\nShare the CID of your new snapshot so people can access it and replicate it onto their machines.\n\n### Step 8: Update *.wikipedia-on-ipfs.org\n\nMake sure at least two full reliable copies exist before updating DNSLink.\n\n## mirrorzim.sh\n\nIt is possible to automate steps 3-6 via a wrapper script named `mirrorzim.sh`.\nIt will download the latest snapshot of specified language (if needed), unpack it, and add it to IPFS.\n\nTo see how the script behaves try running it on one of the smallest wikis, such as `cu`:\n\n```console\n$ ./mirrorzim.sh --languagecode=cu --wikitype=wikipedia --hostingdnsdomain=cu.wikipedia-on-ipfs.org\n```\n\n## Docker build\n\nA `Dockerfile` with all the software requirements is provided.\nFor now it is only a handy container for running the process on non-Linux\nsystems or if you don't want to pollute your system with all the dependencies.\nIn the future it will be end-to-end blackbox that takes ZIM and spits out CID\nand repo.\n\nTo build the docker image:\n\n```sh\ndocker build . -t distributed-wikipedia-mirror-build\n```\n\nTo use it as a development environment:\n\n```sh\ndocker run -it -v $(pwd):/root/distributed-wikipedia-mirror --net=host --entrypoint bash distributed-wikipedia-mirror-build\n```\n\n# How to Help\n\nIf you don't mind command line interface and have a lot of disk space,\nbandwidth, or code skills, continue reading.\n\n## Share mirror CID with people who can't trust DNS\n\nSharing a CID instead of a DNS name is useful when DNS is not reliable or\ntrustworthy.  The latest CID for specific language mirror can be found via\nDNSLink:\n\n```console\n$ ipfs resolve -r /ipns/tr.wikipedia-on-ipfs.org\n/ipfs/bafy..\n```\n\nCID can then be opened via `ipfs://bafy..` in a web browser with [IPFS Companion](https://github.com/ipfs-shipyard/ipfs-companion) extension\nresolving IPFS addresses via [IPFS Desktop](https://docs.ipfs.tech/install/ipfs-desktop/) node.\n\nYou can also try [Brave browser](https://brave.com), which ships with [native support for IPFS](https://brave.com/ipfs-support/).\n\n## Cohost a lazy copy\n\nUsing MFS makes it easier to protect snapshots from being garbage collected\nthan low level pinning because you can assign meaningful names and it won't\nprefetch any blocks unless you explicitly ask.\n\nEvery mirrored Wikipedia article you visit will be added to your lazy\ncopy, and will be contributing to your partial mirror. , and you won't need to host\nthe entire thing.\n\nTo cohost a lazy copy, execute:\n\n```console\n$ export LNG=\"tr\"\n$ ipfs files mkdir -p /wikipedia-mirror/$LNG\n$ ipfs files cp $(ipfs resolve -r /ipns/$LNG.wikipedia-on-ipfs.org) /wikipedia-mirror/$LNG/$LNG_$(date +%F_%T)\n```\n\nThen simply start browsing the `$LNG.wikipedia-on-ipfs.org` site via your node.\nEvery visited page will be cached, cohosted, and protected from garbage collection.\n\n## Cohost a full copy\n\nSteps are the same as  for a lazy copy, but you execute additional preload\nafter a lazy copy is in place:\n\n```console\n$ # export LNG=\"tr\"\n$ ipfs refs -r /ipns/$LNG.wikipedia-on-ipfs.org\n```\n\nBefore you execute this, check if you have enough disk space to fit `CumulativeSize`:\n\n```console\n$ # export LNG=\"tr\"\n$ ipfs object stat --human /ipns/$LNG.wikipedia-on-ipfs.org                                                                                                                                 ...rror MM?fix/build-2021\nNumLinks:       5\nBlockSize:      281\nLinksSize:      251\nDataSize:       30\nCumulativeSize: 15 GB\n```\n\nWe are working on improving deduplication between snapshots, but for now YMMV.\n\n## Code\n\nIf you would like to contribute more to this effort, look at the [issues](https://github.com/ipfs/distributed-wikipedia-mirror/issues) in this github repo. Especially check for [issues marked with the \"wishlist\" label](https://github.com/ipfs/distributed-wikipedia-mirror/labels/wishlist) and issues marked [\"help wanted\"](https://github.com/ipfs/distributed-wikipedia-mirror/labels/help%20wanted).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fipfs%2Fdistributed-wikipedia-mirror","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fipfs%2Fdistributed-wikipedia-mirror","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fipfs%2Fdistributed-wikipedia-mirror/lists"}