Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/yjg30737/pyqt-wikipedia-crawler

Crawling the Wikipedia with Python powered by BeautifulSoup4, Supporting GUI/CUI
https://github.com/yjg30737/pyqt-wikipedia-crawler

beautifulsoup4 crawler pyqt pyqt5 wikipedia

Last synced: 5 days ago
JSON representation

Crawling the Wikipedia with Python powered by BeautifulSoup4, Supporting GUI/CUI

Awesome Lists containing this project

README

        

# pyqt-wikipedia-crawler
Crawling the Wikipedia with Python Desktop Application powered by BeautifulSoup4

This requires beautifulsoup4 and requests to use it as only CUI.

If you want to use this as GUI, you have to install PyQt5.

## Requirements
* requests
* beautifulsoup4
* PyQt5 >= 5.14

## How To Run
1. clone this repo
2. pip install -r requirements.txt
3. in the src folder, you can find the script.py which you can run it right away. You can find the sample code in the very bottom of the script.
4. If you want to use the GUI, run main.py as
```
python main.py
```

## Method Overview (CUI only)
```python
wikidoc_to_txt(wiki_lang, doc_name, save_dir=None) # download single document
wikicate_to_txt(wiki_lang, category, save_dir=None, max_len=None) # download every documents in certain category
```

Both methods are pretty self-explanatory.

You can download document from the Wikipedia and you can get it in your local directory as text file.

There are two types of document you can download. One is single document and the other is category document.

The latter one, you can crawl every documents in that category. Each document will be saved as a separate text file.

## GUI Preview
![image](https://github.com/yjg30737/pyqt-wikipedia-crawler/assets/55078043/62481f73-8c4b-4b79-92ae-372e1c3305c5)

## Note
If you want to add new language you can do it in script.py

Find the "ADD YOUR CODE HERE" and add your language code in the list.