https://github.com/andreiregiani/wikipedia-crawler

Extracts plain-text from Wikipedia articles, ideal to perform linguistic analysis on a specific topic
https://github.com/andreiregiani/wikipedia-crawler

Last synced: 29 days ago
JSON representation

Extracts plain-text from Wikipedia articles, ideal to perform linguistic analysis on a specific topic

Host: GitHub
URL: https://github.com/andreiregiani/wikipedia-crawler
Owner: AndreiRegiani
License: gpl-3.0
Created: 2016-10-04T13:30:38.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2025-02-04T10:52:29.000Z (4 months ago)
Last Synced: 2025-04-30T20:25:28.927Z (29 days ago)
Language: Python
Homepage:
Size: 204 KB
Stars: 39
Watchers: 2
Forks: 13
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# wikipedia-crawler
Extracts plain-text from series of Wikipedia articles and saves to a local text file.

The goal is to have text samples of a specific language on a specific topic, so the output can be used on computer analysis applied to linguistics (word frequency, distribution, etc), **or to generate wordlists of any language on Wikipedia (294 languages)**.

## Usage:
```
python3 wikipedia-crawler.py https://en.wikipedia.org/wiki/Biology
```
Generates `output.txt`, extracting only a single article. **Parameters to go crawling:**
```
--articles=10 --interval=5 --output=biology.txt
```

Generates [`biology.txt`](https://raw.githubusercontent.com/AndreiRegiani/wikipedia-crawler/master/example_output/biology_english.txt), crawling `10` articles related to `Biology`. Requests interval set to `5` seconds (default) to not abuse their servers.
Session log containing all visited URLs is saved as `session_biology.txt`. Running with the same output will use the same session file.

In this example the initial article is [Biology](https://en.wikipedia.org/wiki/Biology), the crawler will continue extracting related pages: [Natural Science](https://en.wikipedia.org/wiki/Natural_science), [Evolution](https://en.wikipedia.org/wiki/Evolution), ...

## Dependencies:
* [BeautifulSoap 4](https://www.crummy.com/software/BeautifulSoup/)
* [Requests](http://docs.python-requests.org/)

```
pip install -r requirements.txt
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/andreiregiani/wikipedia-crawler

Awesome Lists containing this project

README