Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/openzim/mwoffliner
Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://github.com/openzim/mwoffliner
archive mediawiki nodejs offline openzim scraper wikipedia zim
Last synced: 3 days ago
JSON representation
Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
- Host: GitHub
- URL: https://github.com/openzim/mwoffliner
- Owner: openzim
- License: gpl-3.0
- Created: 2016-02-08T01:01:27.000Z (almost 9 years ago)
- Default Branch: main
- Last Pushed: 2025-01-23T00:26:47.000Z (4 days ago)
- Last Synced: 2025-01-23T05:47:32.657Z (4 days ago)
- Topics: archive, mediawiki, nodejs, offline, openzim, scraper, wikipedia, zim
- Language: TypeScript
- Homepage: https://www.npmjs.com/package/mwoffliner
- Size: 10.1 MB
- Stars: 309
- Watchers: 19
- Forks: 79
- Open Issues: 205
-
Metadata Files:
- Readme: README.md
- Changelog: Changelog
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# MWoffliner
MWoffliner is a tool for making a local offline HTML snapshot of any
online [MediaWiki](https://mediawiki.org) instance. It goes through
all online articles (or a selection if specified) and create the
corresponding [ZIM](https://openzim.org) file. It has mainly been
tested against Wikimedia projects like
[Wikipedia](https://wikipedia.org) and
[Wiktionary](https://wiktionary.org) --- but it should also work for
any recent MediaWiki.Read [CONTRIBUTING.md](./CONTRIBUTING.md) to know more about
MWoffliner development.User Help is available in the for a a
[FAQ](https://github.com/openzim/mwoffliner/wiki/Frequently-Asked-Questions).[![NPM](https://nodei.co/npm/mwoffliner.png)](https://www.npmjs.com/package/mwoffliner)
[![npm](https://img.shields.io/npm/v/mwoffliner.svg)](https://www.npmjs.com/package/mwoffliner)
[![node](https://img.shields.io/node/v/mwoffliner.svg)](https://www.npmjs.com/package/mwoffliner)
[![Docker](https://ghcr-badge.egpl.dev/openzim/mwoffliner/latest_tag?label=container)](https://ghcr.io/openzim/mwoffliner)
[![Build Status](https://github.com/openzim/mwoffliner/workflows/CI/badge.svg?query=branch%3Amain)](https://github.com/openzim/mwoffliner/actions/workflows/ci.yml?query=branch%3Amain)
[![codecov](https://codecov.io/gh/openzim/mwoffliner/branch/main/graph/badge.svg)](https://codecov.io/gh/openzim/mwoffliner)
[![CodeFactor](https://www.codefactor.io/repository/github/openzim/mwoffliner/badge)](https://www.codefactor.io/repository/github/openzim/mwoffliner)
[![License](https://img.shields.io/npm/l/mwoffliner.svg)](LICENSE)
[![Join Slack](https://img.shields.io/badge/Join%20us%20on%20Slack%20%23mwoffliner-2EB67D)](https://slack.kiwix.org)## Features
- Scrape with or without image thumbnail
- Scrape with or without audio/video multimedia content
- S3 cache (optional)
- Image size optimiser / Webp converter
- Scrape all articles in namespaces or title list based
- Specify additional/non-main namespaces to scrapeRun `mwoffliner --help` to get all the possible options.
## Prerequisites
- *NIX Operating System (GNU/Linux, macOS, ...)
- [Redis](https://redis.io/)
- [NodeJS](https://nodejs.org/en/) version 16 or greater
- [Libzim](https://github.com/openzim/libzim) (On GNU/Linux & macOS we automatically download it)
- Various build tools which are probably already installed on your
machine (packages `libjpeg-dev`, `libglu1`, `autoconf`, `automake`, `gcc` on
Debian/Ubuntu)... and an online MediaWiki with its API available.
## Usage
To install MWoffliner globally:
```bash
npm i -g mwoffliner
```You might need to run this command with the `sudo` command, depending
how your `npm` is configured.`npm` permission checking can be a bit annoying for a
newcomer. Please read the documentation carefully if you hit
problems: https://docs.npmjs.com/cli/v7/using-npm/scripts#userThen to run it:
```bash
mwoffliner --help
```To install and run it locally:
```bash
npm i
npm run mwoffliner -- --help
```To use MWoffliner with a S3 cache, you should provide a S3 URL like
this:
```bash
--optimisationCacheUrl="https://wasabisys.com/?bucketName=my-bucket&keyId=my-key-id&secretAccessKey=my-sac"
```## API
MWoffliner provides also an API and therefore can be used as a NodeJS
library. Here a stub example that could go in your index.mjs file:
```javascript
import * as mwoffliner from 'mwoffliner';const parameters = {
mwUrl: "https://es.wikipedia.org",
adminEmail: "[email protected]",
verbose: true,
format: "nopic",
articleList: "./articleList"
};
mwoffliner.execute(parameters); // returns a Promise
```## Background
Complementary information about MWoffliner:
* MediaWiki software is used by thousands of wikis, the most
famous ones being the Wikimedia ones, including [Wikipedia](https://wikipedia.org).
* MediaWiki is a PHP wiki runtime engine.
* Wikitext is the name of the markup language that MediaWiki uses.
* MediaWiki includes a parser for WikiText into HTML, and this
parser creates the HTML pages displayed in your browser.### GNU/Linux - Debian based distributions
Install NodeJS:
Read https://nodejs.org/en/download/current/Install Redis:
```bash
sudo apt-get install redis-server
```## Troubleshooting
Older GNU/Linux distributions and/or versions of Node.js might be
shipped with a deprecated version of `npm`. Older versions of `npm`
have incompatbilities with certain versions of Node.js and might
simply fail to install `mwoffliner` package.We recommend to use a recent version of `npm`. Recent versions can
perfectly deal with older Node.js 10. Do install the packaged
version of `npm` and then use it to install a newer version like:```bash
sudo npm install --unsafe-perm -g npm
```Don't forget to remove the packaged version of `npm` afterward.
License
-------[GPLv3](https://www.gnu.org/licenses/gpl-3.0) or later, see
[LICENSE](LICENSE) for more details.