https://github.com/andreiregiani/wikipedia-crawler
Extracts plain-text from Wikipedia articles, ideal to perform linguistic analysis on a specific topic
https://github.com/andreiregiani/wikipedia-crawler
Last synced: 29 days ago
JSON representation
Extracts plain-text from Wikipedia articles, ideal to perform linguistic analysis on a specific topic
- Host: GitHub
- URL: https://github.com/andreiregiani/wikipedia-crawler
- Owner: AndreiRegiani
- License: gpl-3.0
- Created: 2016-10-04T13:30:38.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2025-02-04T10:52:29.000Z (4 months ago)
- Last Synced: 2025-04-30T20:25:28.927Z (29 days ago)
- Language: Python
- Homepage:
- Size: 204 KB
- Stars: 39
- Watchers: 2
- Forks: 13
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# wikipedia-crawler
Extracts plain-text from series of Wikipedia articles and saves to a local text file.The goal is to have text samples of a specific language on a specific topic, so the output can be used on computer analysis applied to linguistics (word frequency, distribution, etc), **or to generate wordlists of any language on Wikipedia (294 languages)**.
## Usage:
```
python3 wikipedia-crawler.py https://en.wikipedia.org/wiki/Biology
```
Generates `output.txt`, extracting only a single article. **Parameters to go crawling:**
```
--articles=10 --interval=5 --output=biology.txt
```Generates [`biology.txt`](https://raw.githubusercontent.com/AndreiRegiani/wikipedia-crawler/master/example_output/biology_english.txt), crawling `10` articles related to `Biology`. Requests interval set to `5` seconds (default) to not abuse their servers.
Session log containing all visited URLs is saved as `session_biology.txt`. Running with the same output will use the same session file.In this example the initial article is [Biology](https://en.wikipedia.org/wiki/Biology), the crawler will continue extracting related pages: [Natural Science](https://en.wikipedia.org/wiki/Natural_science), [Evolution](https://en.wikipedia.org/wiki/Evolution), ...
## Dependencies:
* [BeautifulSoap 4](https://www.crummy.com/software/BeautifulSoup/)
* [Requests](http://docs.python-requests.org/)```
pip install -r requirements.txt
```