Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/happybravo/wikidump_search
Tool for offline Wikipedia search with WikiDumps / Archives
https://github.com/happybravo/wikidump_search
keyword-search offline-search offline-tool pyhton3 search utility wikidump wikipedia wikipedia-dump
Last synced: about 4 hours ago
JSON representation
Tool for offline Wikipedia search with WikiDumps / Archives
- Host: GitHub
- URL: https://github.com/happybravo/wikidump_search
- Owner: HappyBravo
- Created: 2024-04-21T15:16:48.000Z (7 months ago)
- Default Branch: master
- Last Pushed: 2024-07-11T20:43:50.000Z (4 months ago)
- Last Synced: 2024-07-12T21:53:30.996Z (4 months ago)
- Topics: keyword-search, offline-search, offline-tool, pyhton3, search, utility, wikidump, wikipedia, wikipedia-dump
- Language: Jupyter Notebook
- Homepage:
- Size: 69.3 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🔍 WIKIDUMP SEARCH
It is an offline utility/tool made for searching 'keywords' in [Wikipedia Archive](https://dumps.wikimedia.org/enwiki/) instead of using any online WikipediaAPI.
---
# 🎯 BENEFITS
- when you need to search for many 'keywords' in Wikipedia. WikipediaAPI such as [Wikipedia](https://pypi.org/project/wikipedia/) may slow down after few dozens of calls.
- if your internet connection is not fast, then this is beneficial as it is an offline search.
- uses very minimal onboard resource.---
# 🛠️ REQUIREMENTS
- tested on Python 3.11
- [Wikipedia](https://pypi.org/project/wikipedia/)
- [FuzzyWuzzy](https://pypi.org/project/fuzzywuzzy/)
- [Beautifulsoup](https://pypi.org/project/beautifulsoup4/)
- [tdqm](https://pypi.org/project/tqdm/)
- [joblib](https://pypi.org/project/joblib/)
- atleast 25 GB free storage spaceor you can install using `pip install -r "./requirements.txt" `
Also, you need to download one image/backup from this [wiki-archive page](https://dumps.wikimedia.org/enwiki/)
---
# ⚙️ SETUP
Download
- `enwiki-{data}-pages-articles-multistream.xml.bz2` (~23 GB)
- `enwiki-{date}-pages-articles-multistream-index.txt.bz2` (~250 MB)
- Extract this file. It will contain `enwiki-{date}-pages-articles-multistream-index.txt` (~1.2 GB)These file's filepaths will be required when initializing thhe offline wiki class
---
# 📝 EXAMPLE
See [testing.ipynb](./testing.ipynb)