https://github.com/anorprogrammer/wiker
library for wikipedia dataset collection
https://github.com/anorprogrammer/wiker
beautifulsoup4 dataset requests wiki wikipedia
Last synced: 1 day ago
JSON representation
library for wikipedia dataset collection
- Host: GitHub
- URL: https://github.com/anorprogrammer/wiker
- Owner: anorprogrammer
- License: mit
- Created: 2023-01-09T07:37:08.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-01-17T12:52:19.000Z (about 3 years ago)
- Last Synced: 2025-12-15T10:32:40.487Z (4 months ago)
- Topics: beautifulsoup4, dataset, requests, wiki, wikipedia
- Language: Python
- Homepage: https://pypi.org/project/wiker/
- Size: 20.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Wiker
library for wikipedia text dataset collection
# Installation
```
pip install wiker
```
# Quickstart
**!Warning!**
_Before running the code, create "data" and "extra" folders inside the project folder and "pre_urls.txt" and "post_urls.txt" files inside the "extra" folder_
File structure
```
my-app/
├─ data/
├─ extra/
│ ├─ pre_urls.txt
│ ├─ post_urls.txt
├─ main.py # your file
```
```python
from wiker import Wiker
wk = Wiker(lang='uz', first_article_link="Turkiston")
wk.run(scrape_limit=50)
```
### Another methods
```python
from wiker import Wiker
wk = Wiker(lang='uz', first_article_link="Turkiston")
wk.reader() # read the pre_urls.txt file and return the result as a list
wk.read_url_count() # The number of all links that read the pre_urls.txt file
wk.extra_file_writer() # if the pre_urls.txt file is empty, the function writes first_article_link to the file
wk.scraper() # Get all articles from links in pre_urls.txt file
wk.text_cleaner() # clean up the html and other tags in the retrieved articles
wk.next_urls() # get links for further scraping
wk.dir_scanner() # scan the "data" folder to count files
wk.cleaned_text_writer(text_dict=wk.text_cleaner()) #
wk.post_url_writer(url_list=wk.scraper().keys()) # writing the name of the saved articles to the file
wk.pre_url_writer(url_list=wk.next_urls()) # write names in next_urls to files for next process
```