https://github.com/openzim/wp1_selection_tools
Create selections with the best articles of a WM project
https://github.com/openzim/wp1_selection_tools
selection wikipedia wp1
Last synced: 12 months ago
JSON representation
Create selections with the best articles of a WM project
- Host: GitHub
- URL: https://github.com/openzim/wp1_selection_tools
- Owner: openzim
- License: gpl-3.0
- Created: 2016-01-31T10:01:52.000Z (about 10 years ago)
- Default Branch: main
- Last Pushed: 2025-01-10T14:49:52.000Z (over 1 year ago)
- Last Synced: 2025-04-06T08:01:54.528Z (about 1 year ago)
- Topics: selection, wikipedia, wp1
- Language: Perl
- Homepage: https://download.kiwix.org/wp1/
- Size: 7.2 MB
- Stars: 6
- Watchers: 3
- Forks: 3
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
The **WP1 Selection tools** gather and compile multiple indicators to
provide [Wikipedia](http://wikipedia.org) article subset
selections. It has been created for the [Wikipedia
1.0](https://en.wikipedia.org/wiki/Wikipedia:1) project and is
complementary of the [WP1 engine](https://github.com/openzim/wp1).
The results are made available at
[https://download.openzim.org/wp1](https://download.openzim.org/wp1).
[](https://www.codefactor.io/repository/github/openzim/wp1_selection_tools)
[](https://www.gnu.org/licenses/gpl-3.0)
Requirements
------------
To run it, you need:
* MANDATORY: a GNU/Linux system
* MANDATORY: an access to Internet
* MANDATORY: an access to a Wikipedia database
* OPTION: an access to enwp10 rating database for Wikipedia in English
Context
-------
Many Wikipedias, in different languages, have more than 500.000
articles and even if we can provide offline versions with a
reasonnable size, this is still too much for many devices. That's why
we need to build offline versions with only a selections with the TOP
best articles.
Principle
---------
This tool builds lists of key values (pageviews, links, ...) about
Wikipedia articles and put them in a directory. These key values are
everything we have as input to build smart selection algorithms. To
get more detalis about the list, read the README in the language based
directory.
Tools
-----
* build_biggest_wikipedia_list.sh give you the list of all
wikipedia/languages with more than 500.000 entries.
* build_selections.sh takes a language code ('en' for example) as first
argument and create the directory with all the key values.
* build_all_selections.sh to build/upload lists for all Wikipedia with
more than 500.000 pages.
* build_en_vital_articles_list.sh generates a the list Wikipedia in
English vital articles
(https://en.wikipedia.org/wiki/Wikipedia:Vital_articles)
* build_custom_selections.sh generates selections which need custom
(non-standard) handling.
* build_projects_lists.pl generates the lists for projects with
articles sorted (reverse order) by scores. Works only for Wikipedia
in English.
* build_translated_list.pl translates a list in the given language
based on Wikipedia in English language links and local language
scores.
Download
--------
You can download the output of that scripts directly from
download.kiwix.org/wp1/ using FTP, HTTP(s) or rsync.
You might be interested by downloading only the last version, here is
a small command (based on rsync) to retrieve the right directory name.
```bash
for ENTRY in $(rsync --recursive --list-only download.kiwix.org::download.kiwix.org/wp1/ | tr -s ' ' | cut -d ' ' -f5 | grep wiki | grep -v '/' | sort -r)
do
RADICAL=`echo $ENTRY | sed 's/_20[0-9][0-9]-[0-9][0-9]//g'`;
if [[ $LAST != $RADICAL ]]
then
echo $ENTRY
LAST=$RADICAL
fi
done
```
VPS
---
To run it on VPS via Docker:
```bash
docker run -d --name wp1_selection_tools
-v /srv/wp1_selection_tools/data:/data \
-v /srv/wp1_selection_tools/.ssh/:/root/.ssh \
-v /srv/wp1_selection_tools/replica.my.cnf:/root/replica.my.cnf \
ghcr.io/openzim/wp1_selection_tools
```
License
-------
[GPLv3](https://www.gnu.org/licenses/gpl-3.0) or later, see
[LICENSE](LICENSE) for more details.