Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/zawlinnnaing/my-wiki-crawler

A simple program for crawling Burmese wikipedia using Media wiki API.
https://github.com/zawlinnnaing/my-wiki-crawler

crawler myanmar-tools python wikipedia-api

Last synced: about 12 hours ago
JSON representation

A simple program for crawling Burmese wikipedia using Media wiki API.

Awesome Lists containing this project

README

        

# Myanmar Wiki Crawler

![](https://img.shields.io/badge/python-3.6-blue.svg)
![](https://img.shields.io/badge/License-MIT-green)

Simple program for crawling [Burmese Wikipedia](https://my.wikipedia.org) using media wiki API by querying page from "က" first and **sequentially** crawling until specified size reaches or no more pages to crawl.

[TOC]

## Getting started

Install requirements and you are good to go.

```shell script
pip install -r requirements.txt
```

## Step-by-step Procedure

- This program will first query for pages using Media Wiki API to get page titles in batches (500 pages per batch - maximum page limit allowed by Media Wiki).

- It then uses these titles to make html request to individual page and collect text from content field of that page.

- It then stores text into file by using sentence-level segmentation and regex to store only Burmese characters. (from unicode u1000 to u1100).

- It stores one text file per batch using batch index which starts from 0.

- This program will automatically resume from last batch it saved before stopping using `meta.json`.

## Usage

```

python extract.py -h

usage: extract.py [-h] [-l LOG_DIR] [--max_size MAX_SIZE]
[--output_dir OUTPUT_DIR]

Web Crawler for Burmese wiki.

optional arguments:
-h, --help show this help message and exit
-l LOG_DIR, --log_dir LOG_DIR
Specify logs directory for errors (default: logs)
--max_size MAX_SIZE Specify max size (in MB) to crawl wiki. (default: 1000)
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
Output directory for storing corpus (default: results)

```

## TODOS

- Remove max_size limit.

- Better filtering of burmese characters.

- Optimize corpus storing.

## License

[MIT license](/License.md)