{"id":13588191,"url":"https://github.com/pirate/wikipedia-mirror","last_synced_at":"2025-05-16T02:07:55.526Z","repository":{"id":53921005,"uuid":"207148332","full_name":"pirate/wikipedia-mirror","owner":"pirate","description":"🌐 Guide and tools to run a full offline mirror of Wikipedia.org with three different approaches: Nginx caching proxy, Kiwix + ZIM dump, and MediaWiki/XOWA + XML dump","archived":false,"fork":false,"pushed_at":"2025-03-25T06:49:02.000Z","size":10715,"stargazers_count":446,"open_issues_count":0,"forks_count":33,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-04-06T10:14:24.055Z","etag":null,"topics":["archiving","datascience","docker","docker-compose","html","internet-archiving","kiwix","kiwix-offline-wikipedia","mediawiki","mwdumper","nginx","openzim","wiki","wikipedia","wikipedia-dump","wikipedia-mirror","xowa","zim"],"latest_commit_sha":null,"homepage":"https://docs.sweeting.me/s/self-host-a-wikipedia-mirror","language":"PLpgSQL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pirate.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null},"funding":{"github":"pirate","patreon":"theSquashSH","custom":"https://paypal.me/NicholasSweeting"}},"created_at":"2019-09-08T17:29:03.000Z","updated_at":"2025-03-28T12:07:39.000Z","dependencies_parsed_at":"2022-08-13T04:20:37.271Z","dependency_job_id":null,"html_url":"https://github.com/pirate/wikipedia-mirror","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pirate%2Fwikipedia-mirror","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pirate%2Fwikipedia-mirror/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pirate%2Fwikipedia-mirror/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pirate%2Fwikipedia-mirror/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pirate","download_url":"https://codeload.github.com/pirate/wikipedia-mirror/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254453653,"owners_count":22073617,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archiving","datascience","docker","docker-compose","html","internet-archiving","kiwix","kiwix-offline-wikipedia","mediawiki","mwdumper","nginx","openzim","wiki","wikipedia","wikipedia-dump","wikipedia-mirror","xowa","zim"],"created_at":"2024-08-01T15:06:33.708Z","updated_at":"2025-05-16T02:07:55.442Z","avatar_url":"https://github.com/pirate.png","language":"PLpgSQL","funding_links":["https://github.com/sponsors/pirate","https://patreon.com/theSquashSH","https://paypal.me/NicholasSweeting"],"categories":["Shell","PLpgSQL"],"sub_categories":[],"readme":"\u003cdiv align=\"center\" style=\"text-align:center\"\u003e\n\n\u003ch1\u003eHow to self-host a mirror of Wikipedia.org:\u003cbr/\u003ewith Nginx, Kiwix, or MediaWiki/XOWA + Docker\u003c/h1\u003e\n\u003ci\u003eOriginally published 2019-09-08 on \u003ca href=\"https://docs.sweeting.me/s/blog\"\u003edocs.sweeting.me\u003c/a\u003e.\u003cbr/\u003eThe pretty \u003ca href=\"https://docs.sweeting.me/s/self-host-a-wikipedia-mirror\"\u003eHTML version is here\u003c/a\u003e and the \u003ca href=\"https://github.com/pirate/wikipedia-mirror\"\u003esource for this guide is on Github\u003c/a\u003e.\u003c/i\u003e\u003cbr/\u003e\u003cbr/\u003e\nA summary of how to set up a full Wikipedia.org mirror using three different approaches.\u003cbr/\u003e\n\u003cb\u003eDEMO: https://other-wiki.zervice.io\u003c/b\u003e\n\u003chr/\u003e\n\u003cimg src=\"https://chrischapman.co/images/kiwix/home-page-internal.png\" width=\"500px\"/\u003e\n\u003c/div\u003e\n\n# Intro\n\n\u003e **Did you know that Wikipedia.org just runs a mostly-traditional LAMP stack on [~350 servers](https://meta.wikimedia.org/wiki/Wikimedia_servers)**? (as of 2019)\n\n**Unfortunately, Wikipedia attracts lots of hate from people and nation-states who object to certain articles or want to hide information from the public eye.**\n\nWikipedia's infrastructure (2 racks the USA, 1 in Holland, and 1 in Singapore, + CDNs) [cant always stand up to large DDoS attacks](https://wikimediafoundation.org/news/2019/09/07/malicious-attack-on-wikipedia-what-we-know-and-what-were-doing/), but thankfully they provide regular database dumps and static HTML archives to the public, and have permissive licensing that allows for rehosting with modification (even for profit!).\n\nGrowing up in China [behind the GFC I often experienced Wikipedia unavailability](https://www.cnet.com/news/the-great-firewall-of-china-blocks-off-wikipedia/), and in light of the [recent DDoS](https://wikimediafoundation.org/news/2019/09/07/malicious-attack-on-wikipedia-what-we-know-and-what-were-doing/) I decided to make a guide for people to help demystify the process of running a mirror. I'm also a big advocate for free access to information, and I'm the maintainer of a major internet archiving project called [ArchiveBox](https://archivebox.io) (a self-hosted internet archiver powered by headless Chromium).\n\n**This aim of this guide is to encourage people to use these publicly available dumps to host Wikipedia mirrors, so that malicious actors don't succeed in limiting public access to one of the *world's best sources of information*.**\n\n---\n\n## Quickstart\n\nA *full* English Wikipedia.org clone in 3 steps.\n\n**DEMO: https://other-wiki.zervice.io**\n\n```bash\n# 1. Download the Kiwix-Serve static binary from https://www.kiwix.org/en/downloads/kiwix-serve/\nwget 'https://download.kiwix.org/release/kiwix-tools/kiwix-tools_linux-x86_64.tar.gz'\ntar -xzf kiwix-tools_linux-x86_64-3.0.1.tar.gz \u0026\u0026 cd kiwix-tools_linux-x86_64-3.0.1\n\n# 2. Download a compressed Wikipedia dump from https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/ (79GB, images included!)\nwget --continue \"https://download.kiwix.org/zim/wikipedia_en_all_maxi.zim\"\n\n# 3. Start the kiwix server, then visit http://127.0.0.1:8888\n./kiwix-serve --verbose --port 8888 \"$PWD/wikipedia_en_all_maxi_2018-10.zim\"\n```\n---\n\n\n## Getting Started\n\nWikipedia.org itself is powered by a PHP backend called [WikiMedia](https://en.wikipedia.org/wiki/MediaWiki), using MariaDB for data storage, Varnish and Memcached for request and query caching, and ElasticSearch for full-text search. Production Wikipedia.org also runs a number of extra plugins and modules on top of MediaWiki.\n\n**🖥 There are several ways to host your own mirror of Wikipedia (with varying complexity):**\n\n1. [**Run a caching proxy in front of Wikipedia.org**](#) (disk used on-demand for cache, low CPU use)\n2. [**Serve the static HTML ZIM archive with Kiwix**](#) (10~80GB for compressed archive, low CPU use)\n3. [**Run a full MediaWiki server**](#) (hardest to set up, ~600GB for XML \u0026 database, high CPU use)\n\n**💅Don't expect it to look perfect on the first try**\n\nSetting up a Wikipidea mirror involves a complex dance between software, data, and devops, so beginners are encouraged to start with the static html archive or proxy and before attempting to run a full MediaWiki Server. Users should expect their mirrors to be able to serve articles with images and search, but should not expect it to look exactly like Wikipedia.org on the first try, or the second...\n\n**✅ Choosing an approach**\n\nEach method in this guide has its pros and cons. A caching proxy is the most lightweight option, but if the upstream servers go down and a request comes in that hasn't been seen before and cached it will 404, so it's not a fully redundant mirror. The static ZIM mirror is lightweight to download and host (and requests are easy to cache), it has full-text search, but it has no interactivity, talk page history, or Wikipedia-style category pages (though they are coming soon). MediaWiki/XOWA are the most complex, but they can provide a full working Wikipedia mirror complete with history revisions, users, talk pages, search, and more. \n\nRunning a full MediaWiki server is by far the hardest method to set up. Expect it to take multiple days/weeks depending on available system resources, and expect it to look fairly broken since the production Wikipedia.org team run many tweaks and plugins that take extra work to set up locally.\n\nFor more info, see the [Wikipedia.org index of all dump types available, with descriptions](https://dumps.wikimedia.org/).\n\n\n## Responsible Rehosting Warning\n\n⚠️ Be aware that running a publicly-accessible mirror of Wikipedia.org with any kind of framing / content modifications / ads is *strongly discouraged*. Framing mirrors / proxy mirrors are still a good option for private use, but you need to take additional steps to mirror responsibly if you're setting up a proxy for public use (e.g. robots:noindex, takedown contact info, blocking unlicensed images, etc.).\n\n\u003e \u003cspan style=\"font-size:14px\"\u003eSome mirrors load a page from the Wikimedia servers directly every time someone requests a page from them. They alter the text in some way, such as framing it with ads, then send it on to the reader. **This is called remote loading, and it is an unacceptable use of Wikimedia server resources.** Even remote loading websites with little legitimate traffic can generate significant load on our servers, due to search engine web crawlers.\n*https://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks#Remote_loading*\u003c/span\u003e\n\n\nLuckily, regardless of how you choose to rehost Wikipedia ***text***, you are not breaking any terms and conditions or violating copyright law as long as you don't remove their copyright statements (however, note the article images and videos on Wikimedia.org may not be licensed for re-use).\n\n\u003e \u003cspan style=\"font-size: 14px\"\u003eEvery contribution to the English Wikipedia has been licensed for re-use, including commercial, for-profit websites. Republication is not necessarily a breach of copyright, so long as the appropriate licenses are complied with.\n*https://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks#Things_you_need_to_know*\n\u003c/span\u003e\n\n---\n\n# [Table of Contents](https://docs.sweeting.me/s/self-host-a-wikipedia-mirror#TOC)\n\n[TOC]\n\nSee the [HTML version](https://docs.sweeting.me/s/self-host-a-wikipedia-mirror#TOC) of this guide for the best browsing experience. See [pirate/wikipedia-mirror](https://github.com/pirate/wikipedia-mirror) on Github for example config source, docker-compose files, binaries, folder structure, and more.\n\n---\n\n# Tutorial\n\n---\n\n## Prerequisites\n\n1. **Provision a server to act as your Wikipedia mirror**\n\n   You can use a cheap VPS provider like DigitalOcean, Vultr, Hetzner, etc. For the static ZIM archive and MediaWiki server methods you will need significant disk space, so a home server with a cheap external HD may be a better option.\n   \n   *The setup examples below are based on Ubuntu 19.04* running on a home server, however they should work across many other OS's with minimal tweaking (e.g. FreeBSD, macOS, Arch, etc.).\n\n2. **Purchase a new domain or create a subdomain to host your mirror**\n\n   You can use Google Domains, NameCheap, GoDaddy, etc. any registrar will work.\n\n   *In the setup examples below, replace `wiki.example.com` with the domain you chose.*\n\n3. **Point the DNS records for the domain to your mirror server**\n\n   Configure these records via your DNS provider (e.g. NameCheap, DigitalOcean, CloudFlare, etc.):\n\n   - `wiki.example.com` `A` -\u003e `your server's public ip` (the root domain)\n   - `en.wiki.example.com` `CNAME` -\u003e `wiki.example.com` (the wiki domain)\n   - `upload.wiki.example.com` `CNAME` -\u003e  `wiki.example.com` (the uploads/media domain)\n\n4. **Create a directory to store the project, and a dotenv file for your config options**\n\n    Not all of these values are needed for all the methods, but it's easier to just define all of them in one place and remove things later that turn out to be unneeded.\n\n    ```bash\n    mkdir -p /opt/wiki                  # change PROJECT_DIR below to match\n    nano /opt/wiki/.env\n    ```\n    Create the `.env` config file in [`dotenv`](https://docs.docker.com/compose/env-file/)/`bash` syntax with the contents below.\n    *Make sure to replace the example values like `wiki.example.com` with your own.*\n    ```bash\n    PROJECT_DIR=\"/opt/wiki\"                   # folder for all project state\n    CONFIG_DIR=\"$PROJECT_DIR/etc/nginx\"\n    CACHE_DIR=\"$PROJECT_DIR/data/cache\"\n    CERTS_DIR=\"$PROJECT_DIR/data/certs\"\n    LOGS_DIR=\"$PROJECT_DIR/data/logs\"\n\n    LANG=\"en\"                                 # Wikipedia language to mirror\n    LISTEN_PORT_HTTP=\"80\"                     # public-facing HTTP port to bind\n    LISTEN_PORT_HTTPS=\"443\"                   # public-facing HTTPS port to bind\n    LISTEN_HOST=\"wiki.example.com\"            # root domain to listen on\n    LISTEN_WIKI=\"$LANG.$LISTEN_HOST\"          # wiki domain to listen on\n    LISTEN_MEDIA=\"upload.$LISTEN_HOST\"        # uploads domain to listen on\n\n    UPSTREAM_HOST=\"wikipedia.org\"             # main upstream domain\n    UPSTREAM_WIKI=\"$LANG.$UPSTREAM_HOST\"      # upstream domain for wiki\n    UPSTREAM_MEDIA=\"upload.wikimedia.org\"     # upstream domain for uploads\n\n    # Only needed if using an nginx reverse proxy:\n    SSL_CRT=\"$CERTS_DIR/$LISTEN_HOST.crt\"\n    SSL_KEY=\"$CERTS_DIR/$LISTEN_HOST.key\"\n    SSL_DH=\"$CERTS_DIR/$LISTEN_HOST.dh\"\n\n    CACHE_SIZE=\"100G\"                         # or \"500GB\", \"1GB\", \"200MB\", etc.\n    CACHE_REQUESTS=\"GET HEAD POST\"            # or \"GET HEAD\", \"any\", etc.\n    CACHE_RESPONSES=\"200 206 302\"             # or \"200 302 404\", \"any\", etc.\n    CACHE_DURATION=\"max\"                      # or \"1d\", \"30m\", \"12h\", etc.\n\n    ACCESS_LOG=\"'$LOGS_DIR/nginx.out' trace\"  # or \"off\", etc.\n    ERROR_LOG=\"'$LOGS_DIR/nginx.err' warn\"    # or \"off\", etc.\n    ```\n    \n    *\u003cspan style=\"color:orange\"\u003eThe setup steps below depend on this file existing and the config values being correct,\u003c/span\u003e\n    so make sure you create it and replace all example values with your own before proceeding!*\n\n---\n\n## Choosing a Wikipedia archive dump\n\n- https://download.kiwix.org/zim/wikipedia/ (for BitTorrent add `.torrent` to the end of any `.zim` url)\n- https://en.wikipedia.org/wiki/MediaWiki\n- https://www.mediawiki.org/wiki/MediaWiki\n- https://www.mediawiki.org/wiki/Download\n- https://www.wikidata.org/wiki/Wikidata:Database_download\n- https://dumps.wikimedia.org/backup-index.html\n\n### ZIM Static HTML Dump\n\nWikipedia HTML dumps are provided in a highly-compressed web-archiving format called [ZIM](https://openzim.org). They can be served using a ZIM server like Kiwix (the most common one), or [ZimReader](https://openzim.org/wiki/Zimreader), [GoZIM](https://github.com/akhenakh/gozim), \u0026 [others](https://openzim.org/wiki/Readers).\n\n- [Kiwix.org full ZIM archive list](https://wiki.kiwix.org/wiki/Content_in_all_languages) or [Kiwix.org Wikipedia-specific ZIM archive list](https://library.kiwix.org/#lang=eng\u0026q=wikipedia)\n- [Wikimedia.org ZIM archive list](https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/)\n- [List of ZIM BitTorrent links](https://gist.github.com/maxogden/70674db0b5b181b8eeb1d3f9b638ab2a)\n\nZIM archive dumps are usually published yearly, but the release schedule is not guaranteed. As of August 2019 the latest available dump containing all English articles is from October 2018:\n\n[`wikipedia_en_all_mini_2019-09.zim`](https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_mini_2019-09.zim) ([torrent](https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_mini_2019-09.zim.torrent)) (10GB, mini English articles, no pictures or video)\n\n[`wikipedia_en_all_nopic_2018-09.zim`](https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_nopic_2018-09.zim) ([torrent](https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_nopic_2018-09.zim.torrent)) (35GB, all English articles, no pictures or video)\n\n**[`wikipedia_en_all_maxi_2018-10.zim`](https://download.kiwix.org/zim/wikipedia_en_all_maxi.zim)** ([torrent](https://download.kiwix.org/zim/wikipedia_en_all_maxi.zim.torrent)) (79GB, all English articles w/ pictures, no video)\n\n[`wikipedia_en_simple_all_maxi_2020-01.zim`](https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/wikipedia_en_simple_all_maxi_2020-01.zim) (1.6GB, SimpleWiki English only, good for testing)\n\n**Download your chosen Wikipedia ZIM archive** (e.g. `wikipedia_en_all_maxi_2018-10.zim`)\n\n```bash\nmkdir -p /opt/wiki/data/dumps \u0026\u0026 cd /opt/wiki/data/dumps\n\n# Download via BitTorrent:\ntransmission-cli --download-dir . 'magnet:?xt=urn:btih:O2F3E2JKCEEBCULFP2E2MRUGEVFEIHZW'\n\n# Or download via HTTPS from one of the mirrors:\nwget -c 'https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/wikipedia_en_all_maxi_2018-10.zim'\nwget -c 'https://ftpmirror.your.org/pub/kiwix/zim/wikipedia/wikipedia_en_all_maxi_2018-10.zim'\nwget -c 'https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2018-10.zim'\n\n# Optionally after download, verify the length (fast) or MD5 checksum (slow):\nstat --printf=\"%s\" wikipedia_en_all_maxi_2018-10.zim | grep 83853668638\nmd5sum wikipedia_en_all_maxi_2018-10.zim | openssl dgst -md5 -binary | openssl enc -base64 | grep 01eMQki29P9vD5F2h6zWwQ\n```\n\n### XML Database Dump\n\n- [WikiData.org Dump Types (JSON, RDF, XML)](https://www.wikidata.org/wiki/Wikidata:Database_download)\n- [List of Dumps (XML dumps)](https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia)\n- [List of Mirrors (XML dumps)](https://dumps.wikimedia.org/mirrors.html)\n\nDatabase dumps are usually published monthly.  As of August 2019, the latest dump containing all English articles is from July 2019:\n\n **[`enwiki-20190720-pages-articles.xml.bz2`](https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia)** (15GB, all English articles, no pictures/videos)\n\n[`simplewiki-20170820-pages-meta-current.xml.bz2`](https://itorrents.org/torrent/B23A2BDC351E58E041D79F335A3CF872DEBAE919.torrent) (180MB, SimpleWiki only, good for testing)\n\n**Download your chosen Wikipedia XML dump** (e.g. `enwiki-20190720-pages-articles.xml.bz2`)\n\n```bash\nmkdir -p /opt/wiki/data/dumps \u0026\u0026 cd /opt/wiki/data/dumps\n\n# Download via BitTorrent:\ntransmission-cli --download-dir . 'magnet:?xl=16321006399\u0026dn=enwiki-20190720-pages-articles.xml.bz2'\n\n# Download via HTTP:\n# lol no. no one wants to serve you a 15GB file via HTTP\n```\n\n---\n\n## Method #1: Run a caching proxy in front of Wikipedia.org\n\n\u003e \u003cspan style=\"color:#444\"\u003e**Complexity:**\u003c/span\u003e \u003cspan style=\"color:green\"\u003eLow\u003c/span\u003e  \n\u003e Minimal setup and operations requirements, no download of large dumps needed.  \n\u003e \u003cspan style=\"color:#444\"\u003e**Disk space requirements:**\u003c/span\u003e \u003cspan style=\"color:orange\"\u003eOn-Demand\u003c/span\u003e  \n\u003e Disk is only used as pages are requested (can be 1gb up to 2TB+ depending on usage).  \n\u003e \u003cspan style=\"color:#444\"\u003e**CPU requirements:**\u003c/span\u003e \u003cspan style=\"color:green\"\u003eVery Low\u003c/span\u003e  \n\u003e Lowest out of the three options, can be run on a tiny VPS or home-server.  \n\u003e \u003cspan style=\"color:#444\"\u003e**Content freshness:**\u003c/span\u003e \u003cspan style=\"color:green\"\u003eVery Fresh\u003c/span\u003e  \n\u003e Configurable to cache content indefinitely or pull fresh data for every request.  \n\n### a. Running with Nginx\n\nSet the following options in your `/opt/wiki/.env` config file:\n  `UPSTREAM_HOST=wikipedia.org`\n  `UPSTREAM_WIKI=en.wikipedia.org`\n  `UPSTREAM_MEDIA=upload.wikimedia.org`\n\nThen run all the setup steps below under [Nginx Reverse Proxy](#) to set up Nginx.\n\nThen restart nginx to apply your config with `systemctl restart nginx`.\n\nYour mirror should now be running and proxying requests to Wikipedia.org!\n\nVisit https://en.yourdomainhere.com to see it in action (e.g. https://en.wiki.example.com).\n\n### b. Running with Caddy\n\nAlternatively, check out a similar setup that uses Caddy instead of Nginx as the reverse proxy: https://github.com/CristianCantoro/wikiproxy\n\n---\n\n## Method #2: Serve the static HTML ZIM archive with Kiwix\n\n\u003e \u003cspan style=\"color:#444\"\u003e**Complexity:**\u003c/span\u003e \u003cspan style=\"color:orange\"\u003eModerate\u003c/span\u003e  \n\u003e Static binary makes it easy to run, but it requires downloading a large dump file.  \n\u003e \u003cspan style=\"color:#444\"\u003e**Disk space requirements:**\u003c/span\u003e \u003cspan style=\"color:green\"\u003e\u0026gt;80GB\u003c/span\u003e  \n\u003e The ZIM archive is a highly-compressed collection of static HTML articles only.  \n\u003e \u003cspan style=\"color:#444\"\u003e**CPU requirements:**\u003c/span\u003e \u003cspan style=\"color:green\"\u003eVery Low\u003c/span\u003e  \n\u003e Low, especially with a CDN in front (more than a proxy, but less than a full server).  \n\u003e \u003cspan style=\"color:#444\"\u003e**Content freshness:**\u003c/span\u003e \u003cspan style=\"color:red\"\u003eOften Stale\u003c/span\u003e  \n\u003e ZIM archives are published yearly (ish) by Wikipedia.org.  \n\nFirst download a ZIM archive dump like `wikipedia_en_all_maxi_2018-10.zim` into `/opt/wiki/data/dumps` as described above.\n\n\n### a. Running with Docker\n\nRun `kiwix-serve` with docker like so:\n\n```bash\ndocker run \\\n    -v '/opt/wiki/data/dumps:/data' \\\n    -p 8888:80 \\\n    kiwix/kiwix-serve \\\n    'wikipedia_en_all_maxi_2018-10.zim'\n```\n\nOr create `/opt/wiki/docker-compose.yml` and run `docker-compose up`:\n```yml\nversion: '3'\nservices:\n  kiwix:\n    image: kiwix/kiwix-serve\n    command: 'wikipedia_en_all_maxi_2018-10.zim'\n    ports:\n      - '8888:80'\n    volumes:\n      - \"./data/dumps:/data\"\n```\n\n### b. Running with the static binary\n\n1. **Download the latest `kiwix-serve` binary for your OS \u0026 CPU architecture**\n\n    Find the latest release for your architecture here and copy its URL to download it below:\n    https://download.kiwix.org/release/kiwix-tools/\n\n    ```bash\n    cd /opt/wiki\n    wget 'https://download.kiwix.org/release/kiwix-tools/kiwix-tools_linux-x86_64-3.0.1.tar.gz'\n    tar -xzf 'kiwix-tools_linux-x86_64-3.0.1.tar.gz'\n    mv 'kiwix-tools_linux-x86_64-3.0.1' 'bin'\n    ```\n\n2. **Run `kiwix-serve`, passing it a port to listen on and your ZIM archive file**\n\n    ```bash\n    /opt/wiki/bin/kiwix-serve --port 8888 /opt/wiki/data/dumps/wikipedia_en_all_maxi_2018-10.zim\n    ```\n\n    Your server should now be running!\n\n    Visit http://en.yourdomainhere.com:8888 to see it in action!\n\n### Optional Nginx Reverse Proxy\n\nSet the following options in your `/opt/wiki/.env` config file:\n```bash\nUPSTREAM_HOST=localhost:8888\nUPSTREAM_WIKI=localhost:8888\nUPSTREAM_MEDIA=upload.wikimedia.org\n```\n\nThen run all the setup steps below under [Nginx Reverse Proxy](#) to set up Nginx. To run nginx inside docker-compose next to Kiwix, see the [Run Nginx via docker-compose](#) section below.\n\nYour mirror should now be running and proxying requests to `kiwix-serve`!\n\nVisit https://en.yourdomainhere.com to see it in action (e.g. https://en.wiki.example.com).\n\n\n---\n\n## Method #3: Run a full MediaWiki server\n\n\u003e \u003cspan style=\"color:#444\"\u003e**Complexity:**\u003c/span\u003e \u003cspan style=\"color:red\"\u003eVery High\u003c/span\u003e  \n\u003e Complex multi-component setup with an intricate setup process and high resource use.  \n\u003e \u003cspan style=\"color:#444\"\u003e**Disk space requirements:**\u003c/span\u003e \u003cspan style=\"color:red\"\u003e\u0026gt;550GB (\u003e2TB needed for import phase)\u003c/span\u003e  \n\u003e  The uncompressed database is very large (multiple TB with revision history and stubs).  \n\u003e \u003cspan style=\"color:#444\"\u003e**CPU requirements:**\u003c/span\u003e \u003cspan style=\"color:orange\"\u003eModerate (very high during import phase)\u003c/span\u003e  \n\u003e  Depends on usage, but it's the most demanding out of the 3 options.  \n\u003e \u003cspan style=\"color:#444\"\u003e**Content freshness:**\u003c/span\u003e \u003cspan style=\"color:green\"\u003eVery fresh\u003c/span\u003e  \n\u003e Udpated database dumps are published monthly (ish) by Wikipedia.org.  \n\nFirst download a database dump like [`enwiki-20190720-pages-articles.xml.bz2`](magnet:?xl=16321006399\u0026dn=enwiki-20190720-pages-articles.xml.bz2\u0026xt=urn:tree:tiger:zpqgda3rbnycgtcujwpqi72aiv7tyasw7rp7sdi\u0026xt=urn:ed2k:3b291214eb785df5b21cdb62623dd319\u0026xt=urn:aich:zuy4dfbo2ppdhsdtmlev72fggdnka6ch\u0026xt=urn:btih:9f08161276bc95ec594ce89ed52fe18fc41168a3\u0026xt=urn:sha1:54cbdd5e5d1ca22b7dbd16463f81fdbcd6207bab\u0026xt=urn:md5:9be9c811e0cc5c8418c869bb33eb516c\u0026tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80\u0026as=http%3a%2f%2fdumps.wikimedia.freemirror.org%2fenwiki%2f20190720%2fenwiki-20190720-pages-articles.xml.bz2\u0026as=http%3a%2f%2fdumps.wikimedia.your.org%2fenwiki%2f20190720%2fenwiki-20190720-pages-articles.xml.bz2\u0026as=http%3a%2f%2fftp.acc.umu.se%2fmirror%2fwikimedia.org%2fdumps%2fenwiki%2f20190720%2fenwiki-20190720-pages-articles.xml.bz2\u0026as=https%3a%2f%2fdumps.wikimedia.freemirror.org%2fenwiki%2f20190720%2fenwiki-20190720-pages-articles.xml.bz2\u0026as=https%3a%2f%2fdumps.wikimedia.your.org%2fenwiki%2f20190720%2fenwiki-20190720-pages-articles.xml.bz2\u0026as=https%3a%2f%2fftp.acc.umu.se%2fmirror%2fwikimedia.org%2fdumps%2fenwiki%2f20190720%2fenwiki-20190720-pages-articles.xml.bz2\u0026as=https%3a%2f%2fdumps.wikimedia.org%2fenwiki%2f20190720%2fenwiki-20190720-pages-articles.xml.bz2) into `/opt/wiki/data/dumps` as described above.\n\nIf you need to decompress it, `pbzip2` is much faster than `bzip2`:\n```bash\npbzip2 -v -d -k -m10000 enwiki-20190720-pages-articles.xml.bz2\n# -m10000 tells it to use 10GB of RAM, adjust accordingly\n```\n\n### a. Running with XOWA in Docker\n\nhttps://github.com/QuantumObject/docker-xowa\n\n```bash\ndocker run \\\n    -v /opt/wiki/data/xowa:/opt/xowa/ \\\n    -p 8888 \\\n    sblop/xowa_offline_wikipedia\n```\n```yaml\nversion: '3'\nservices:\n  xowa:\n    image: sblop/xowa_offline_wikipedia\n    ports:\n      - 8888:80\n    volumes:\n      - './data/xowa:/opt/xowa'\n```\n\n### b. Running with MediaWiki in Docker\n\n- https://hub.docker.com/_/mediawiki\n- https://github.com/wikimedia/mediawiki-docker\n- https://github.com/AirHelp/mediawiki-docker\n- https://en.wikipedia.org/wiki/MediaWiki\n- https://www.mediawiki.org/wiki/MediaWiki\n- https://www.mediawiki.org/wiki/Download\n- https://www.wikidata.org/wiki/Wikidata:Database_download\n- https://dumps.wikimedia.org/backup-index.html\n\n\n**Configure your `docker-compose.yml` file**\n\nDefault MediaWiki config file: https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/DefaultSettings.php\n\nCreate the following `/opt/wiki/docker-compose.yml` file then run `docker-compose up`:\n```yml\nversion: '3'\nservices:\n  database:\n    image: mariadb\n    command: --max-allowed-packet=256M\n    environment:\n      MYSQL_DATABASE: wikipedia\n      MYSQL_USER: wikipedia\n      MYSQL_PASSWORD: wikipedia\n      MYSQL_ROOT_PASSWORD: wikipedia\n      \n  mediawiki:\n    image: mediawiki\n    ports:\n      - 8080:80\n    depends_on:\n      - database\n    volumes:\n      - './data/html:/var/www/html'\n      # After initial setup, download LocalSettings.php into ./data/html\n      # and uncomment the following line, then docker-compose restart\n      # - ./LocalSettings.php:/var/www/html/LocalSettings.php\n```\n\n\n**Then import the XML dump into the MediaWiki database:**\n- https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps\n- https://hub.docker.com/r/ueland/mwdumper/\n- https://www.mail-archive.com/wikitech-l@lists.wikimedia.org/msg02108.html\n    \n**Do not attempt to import it directly with `importDump.php`, it will take months:**\n```bash\nphp /var/www/html/maintenance/importDump.php enwiki-20170320-pages-articles-multistream.xml\n```\n\n**Instead, convert the XML dump into compressed chunks of SQL then import individually:**\n\n*Warning: For large imports (e.g. English) this process can still take 5+ days depending on the system.*\n\n```bash\napt install -y openjdk-8-jre zstd pbzip2\n\n# Download patched mwdumper version and pre/post import SQL scripts\nwget \"https://github.com/pirate/wikipedia-mirror/raw/master/bin/mwdumper-1.26.jar\"\nwget \"https://github.com/pirate/wikipedia-mirror/raw/master/preimport.sql\"\nwget \"https://github.com/pirate/wikipedia-mirror/raw/master/postimport.sql\"\n\nDUMP_NAME=\"enwiki-20190720-pages-articles\"\n\n# Decompress the XML dump using all available cores and 10GB of memory\npbzip2 -v -d -k -m10000 \"$DUMP.xml.bz2\"\n\n# Convert the XML file into a SQL file using mwdumper\njava -server \\\n    -jar ./wikipedia-importing-tools/mwdumper-1.26.jar \\\n    --format=sql:1.5 \\\n    \"$DUMP.xml\" \\\n\u003e wikipedia.sql\n\n# Split the generated SQL file into compressed chunks\nsplit --additional-suffix=\".sql\" --lines=1000 wikipedia.sql\nfor partial in $(ls *.sql); do\n    zstd -z $partial\ndone\n\n# Fix a schema issue that may otherwise cause import bugs\ndocker-compose exec database \\\n    mysql --user=wikipedia --password=wikipedia --database=wikipedia \\\n        \"ALTER TABLE page ADD page_counter bigint unsigned NOT NULL default 0;\"\n\n# Import the compressed chunks into the database\nfor partial in $(ls *.sql.zst); do\n    zstd -dc preimport.sql.zst $partial postimport.sql.zst \\\n    | docker-compose exec database \\\n        mysql --force --user=wikipedia --password=wikipedia --database=wikipedia\ndone\n```\n\n\u003csup\u003eCredit for these steps goes to https://github.com/wayneworkman/wikipedia-importing-tools.\u003c/sup\u003e\n\n\n### Optional Nginx Reverse Proxy\n\nSet the following options in your `/opt/wiki/.env` config file:\n```bash\nUPSTREAM_HOST=localhost:8888\nUPSTREAM_WIKI=localhost:8888\nUPSTREAM_MEDIA=upload.wikimedia.org\n```\n\nThen run all the setup steps below under [Nginx Reverse Proxy](#) to set up Nginx. To run nginx inside docker-compose next to MediaWiki, see the [Run Nginx via docker-compose](#) section below.\n\nYour mirror should now be running and proxying requests to your wiki server!\n\nVisit https://en.yourdomainhere.com to see it in action (e.g. https://en.wiki.example.com).\n\n---\n\n## Nginx Reverse Proxy\n\nYou can optionally set up an Nginx reverse proxy in front of `kiwix-serve`, `Wikipedia.org`, or a `MediaWiki` server to add caching and HTTPS support.\n\nMake sure the options in `/opt/wiki/.env` are configured correctly for the type of setup you're trying to achieve.\n\n- To run nginx in front of `kiwix-serve` on localhost, set:\n  `UPSTREAM_HOST=localhost:8888`\n  `UPSTREAM_WIKI=localhost:8888`\n  `UPSTREAM_MEDIA=upload.wikimedia.org`\n- To run nginx in front of Wikipedia.org, set:\n  `UPSTREAM_HOST=wikipedia.org`\n  `UPSTREAM_WIKI=en.wikipedia.org`\n  `UPSTREAM_MEDIA=upload.wikimedia.org`\n- To run nginx in front of a MediaWiki server on localhost, set:\n  `UPSTREAM_HOST=localhost:8888`\n  `UPSTREAM_WIKI=localhost:8888`\n  `UPSTREAM_MEDIA=upload.wikimedia.org`\n- To run nginx in front of a docker container via docker-compose:\n  *See [Run Nginx via docker-compose](#) section below.*\n\n### Install LetsEncrypt and Nginx\n\n```bash\n# Install the dependencies: nginx and certbot\nadd-apt-repository -y -n universe\nadd-apt-repository -y -n ppa:certbot/certbot\nadd-apt-repository -y -n ppa:nginx/stable\napt update -qq\napt install -y nginx-extras certbot python3-certbot-nginx\nsystemctl enable nginx\nsystemctl start nginx\n```\n\n### Obtain an SSL certificate via LetsEncrypt\n```bash\n# Load your config values from step 4 into the environment, and create dirs\nsource /opt/wiki/.env\nmkdir -p \"$CONFIG_DIR\" \"$CACHE_DIR\" \"$CERTS_DIR\" \"$LOGS_DIR\" \n\n# Get an SSL certificate and generate the Diffie-Hellman parameters file\ncertbot certonly \\\n    --nginx \\\n    --agree-tos \\\n    --non-interactive \\\n    -m \"ssl@$LISTEN_HOST\" \\\n    --domain \"$LISTEN_HOST,$LISTEN_WIKI,$LISTEN_MEDIA\"\nopenssl dhparam -out \"$PROJECT_DIR/data/certs/$DOMAIN.dh\" 2048\n\n# Link the certs into your project directory\nln -s /etc/letsencrypt/live/$DOMAIN/fullchain.pem $PROJECT_DIR/data/certs/$DOMAIN.crt\nln -s /etc/letsencrypt/live/$DOMAIN/privkey.pem $PROJECT_DIR/data/certs/$DOMAIN.key\n```\n\nLetsEncrypt certs must be renewed every 90 days or they'll expire and you'll get \"Invalid Certificate\" errors. To have certs automatically renewed periodically, add a systemd timer or cron job to run `certbot renew`. Here's an example tutorial on how to do that:\n    https://gregchapple.com/2018/02/16/auto-renew-lets-encrypt-certs-with-systemd-timers/\n\n### Populate the nginx.conf template with your config\n\u003c!-- {% raw %} --\u003e\n```bash\n# Load your config options into the environment\nsource /opt/wiki/.env\n\n\n# Download the nginx config template\ncurl --silent \\\n    \"https://github.com/pirate/wikipedia-mirror/raw/master/etc/nginx/nginx.conf.template\" \\\n    \u003e \"$CONFIG_DIR/nginx.conf.template\"\n\n# Fill your config options into nginx.conf.template to create nginx.conf\nenvsubst \\\n    \"$(printf '${%s} ' $(bash -c \"compgen -A variable\"))\"\\\n    \u003c \"$CONFIG_DIR/nginx.conf.template\" \\\n    \u003e \"$CONFIG_DIR/nginx.conf\"\n```\n\u003c!-- {% endraw %} --\u003e\n\n### Run Nginx via systemd\n```bash\n# Link the your nginx.conf into the system's default nginx config location\nln -s -f \"$CONFIG_DIR/nginx.conf\" \"/etc/nginx/nginx.conf\"\n\n# Restart nginx to load the new config\nsystemctl restart nginx\n```\n\nNow you can visit https://en.yourdomainhere.com to see it in action with HTTPS!\n\nFor troubleshooting, you can find the nginx logs here:\n  `/opt/wiki/data/logs/nginx.err`\n  `/opt/wiki/data/logs/nginx.out`\n\n### Run Nginx via docker-compose\n\nSet the config values in your `/opt/wiki/.env` file to correspond to the docker container's hostname that you want to proxy, and tweak the directory paths to be the paths inside the container. e.g. for `mediawiki`:\n```bash\nUPSTREAM_HOST=mediawiki:8888`\nUPSTREAM_WIKI=mediawiki:8888`\nUPSTREAM_MEDIA=upload.wikimedia.org\n\nCERTS_DIR=/certs\nCACHE_DIR=/cache\nLOGS_DIR=/logs\n```\n\nThen regenerate your `nginx.conf` file with `envsubst` as described in [Nginx Reverse Proxy](#Nginx-Reverse-Proxy) below.\n\nThen add the `nginx` service to your existing `/opt/wiki/docker-compose.yml` file:\n```bash\nversion: '3'\nservices:\n    \n  ...\n\n  nginx:\n    image: nginx:latest\n    volumes:\n      - ./etc/nginx/nginx.conf:/etc/nginx/nginx.conf\n      - ./data/certs:/certs\n      - ./data/cache:/cache\n      - ./data/logs:/logs\n    ports:\n      - 80:80\n      - 443:443\n```\n\n---\n\n# Further Reading\n\n- https://github.com/openzim/mwoffliner (archiving only, no serving)\n- https://www.yunqa.de/delphi/products/wikitaxi/index (Windows only)\n- https://www.nongnu.org/wp-mirror/ (last updated in 2014, [Dockerfile](https://github.com/futpib/docker-wp-mirror/blob/master/Dockerfile))\n- https://github.com/dustin/go-wikiparse\n- https://www.learn4master.com/tools/python-and-java-libraries-to-parse-wikipedia-dump-dataset\n- https://dkpro.github.io/dkpro-jwpl/\n- https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c\n- https://meta.wikimedia.org/wiki/Data_dumps/Import_examples#Import_into_an_empty_wiki_of_a_subset_of_en_wikipedia_on_Linux_with_MySQL\n- https://github.com/shimondoodkin/wikipedia-dump-import-script/blob/master/example-result.sh\n- https://github.com/wayneworkman/wikipedia-importing-tools\n- https://github.com/chrisbo246/mediawiki-loader\n- https://dzone.com/articles/how-clone-wikipedia-and-index\n- https://www.xarg.org/2016/06/importing-entire-wikipedia-into-mysql/\n- https://dengruo.com/blog/running-mediawiki-your-own-copy-restore-whole-mediwiki-backup\n- https://brionv.com/log/2007/10/02/wiki-data-dumps/\n- https://www.evanjones.ca/software/wikipedia2text.html\n- https://lists.gt.net/wiki/wikitech/160482\n- https://helpful.knobs-dials.com/index.php/Harvesting_wikipedia\n- https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpirate%2Fwikipedia-mirror","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpirate%2Fwikipedia-mirror","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpirate%2Fwikipedia-mirror/lists"}