https://github.com/anorprogrammer/wiker

library for wikipedia dataset collection
https://github.com/anorprogrammer/wiker

beautifulsoup4 dataset requests wiki wikipedia

Last synced: 2 months ago
JSON representation

library for wikipedia dataset collection

Host: GitHub
URL: https://github.com/anorprogrammer/wiker
Owner: anorprogrammer
License: mit
Created: 2023-01-09T07:37:08.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-01-17T12:52:19.000Z (over 3 years ago)
Last Synced: 2025-12-15T10:32:40.487Z (6 months ago)
Topics: beautifulsoup4, dataset, requests, wiki, wikipedia
Language: Python
Homepage: https://pypi.org/project/wiker/
Size: 20.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Wiker

library for wikipedia text dataset collection

# Installation

```

pip install wiker

```

# Quickstart

**!Warning!**

_Before running the code, create "data" and "extra" folders inside the project folder and "pre_urls.txt" and "post_urls.txt" files inside the "extra" folder_

File structure

```

my-app/

├─ data/

├─ extra/

│  ├─ pre_urls.txt

│  ├─ post_urls.txt

├─ main.py # your file 

```

```python

from wiker import Wiker

wk = Wiker(lang='uz', first_article_link="Turkiston")

wk.run(scrape_limit=50)

```

### Another methods

```python

from wiker import Wiker

wk = Wiker(lang='uz', first_article_link="Turkiston")

wk.reader() # read the pre_urls.txt file and return the result as a list

wk.read_url_count() # The number of all links that read the pre_urls.txt file

wk.extra_file_writer() # if the pre_urls.txt file is empty, the function writes first_article_link to the file

wk.scraper() # Get all articles from links in pre_urls.txt file

wk.text_cleaner() # clean up the html and other tags in the retrieved articles

wk.next_urls() # get links for further scraping

wk.dir_scanner() # scan the "data" folder to count files

wk.cleaned_text_writer(text_dict=wk.text_cleaner()) # 

wk.post_url_writer(url_list=wk.scraper().keys()) # writing the name of the saved articles to the file

wk.pre_url_writer(url_list=wk.next_urls()) # write names in next_urls to files for next process

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/anorprogrammer/wiker

Awesome Lists containing this project

README