Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zawlinnnaing/my-wiki-crawler
A simple program for crawling Burmese wikipedia using Media wiki API.
https://github.com/zawlinnnaing/my-wiki-crawler
crawler myanmar-tools python wikipedia-api
Last synced: 16 days ago
JSON representation
A simple program for crawling Burmese wikipedia using Media wiki API.
- Host: GitHub
- URL: https://github.com/zawlinnnaing/my-wiki-crawler
- Owner: zawlinnnaing
- License: mit
- Created: 2020-06-06T13:38:32.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T04:20:11.000Z (about 2 years ago)
- Last Synced: 2024-11-06T08:31:22.418Z (2 months ago)
- Topics: crawler, myanmar-tools, python, wikipedia-api
- Language: Python
- Size: 18.6 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: License.md
Awesome Lists containing this project
README
# Myanmar Wiki Crawler
![](https://img.shields.io/badge/python-3.6-blue.svg)
![](https://img.shields.io/badge/License-MIT-green)Simple program for crawling [Burmese Wikipedia](https://my.wikipedia.org) using media wiki API by querying page from "က" first and **sequentially** crawling until specified size reaches or no more pages to crawl.
[TOC]
## Getting started
Install requirements and you are good to go.
```shell script
pip install -r requirements.txt
```## Step-by-step Procedure
- This program will first query for pages using Media Wiki API to get page titles in batches (500 pages per batch - maximum page limit allowed by Media Wiki).
- It then uses these titles to make html request to individual page and collect text from content field of that page.
- It then stores text into file by using sentence-level segmentation and regex to store only Burmese characters. (from unicode u1000 to u1100).
- It stores one text file per batch using batch index which starts from 0.
- This program will automatically resume from last batch it saved before stopping using `meta.json`.
## Usage
```
python extract.py -h
usage: extract.py [-h] [-l LOG_DIR] [--max_size MAX_SIZE]
[--output_dir OUTPUT_DIR]Web Crawler for Burmese wiki.
optional arguments:
-h, --help show this help message and exit
-l LOG_DIR, --log_dir LOG_DIR
Specify logs directory for errors (default: logs)
--max_size MAX_SIZE Specify max size (in MB) to crawl wiki. (default: 1000)
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
Output directory for storing corpus (default: results)```
## TODOS
- Remove max_size limit.
- Better filtering of burmese characters.
- Optimize corpus storing.
## License
[MIT license](/License.md)