https://github.com/zawlinnnaing/my-wiki-crawler

A simple program for crawling Burmese wikipedia using Media wiki API.
https://github.com/zawlinnnaing/my-wiki-crawler

crawler myanmar-tools python wikipedia-api

Last synced: about 1 month ago
JSON representation

A simple program for crawling Burmese wikipedia using Media wiki API.

Host: GitHub
URL: https://github.com/zawlinnnaing/my-wiki-crawler
Owner: zawlinnnaing
License: mit
Created: 2020-06-06T13:38:32.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2022-12-08T04:20:11.000Z (over 2 years ago)
Last Synced: 2025-04-23T18:12:54.490Z (about 1 month ago)
Topics: crawler, myanmar-tools, python, wikipedia-api
Language: Python
Size: 18.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: License.md

Awesome Lists containing this project

README

# Myanmar Wiki Crawler

![](https://img.shields.io/badge/python-3.6-blue.svg)
![](https://img.shields.io/badge/License-MIT-green)

Simple program for crawling [Burmese Wikipedia](https://my.wikipedia.org) using media wiki API by querying page from "က" first and **sequentially** crawling until specified size reaches or no more pages to crawl.

[TOC]

## Getting started

Install requirements and you are good to go.

```shell script
pip install -r requirements.txt
```

## Step-by-step Procedure

- This program will first query for pages using Media Wiki API to get page titles in batches (500 pages per batch - maximum page limit allowed by Media Wiki).

- It then uses these titles to make html request to individual page and collect text from content field of that page.

- It then stores text into file by using sentence-level segmentation and regex to store only Burmese characters. (from unicode u1000 to u1100).

- It stores one text file per batch using batch index which starts from 0.

- This program will automatically resume from last batch it saved before stopping using `meta.json`.

## Usage

```

python extract.py -h

usage: extract.py [-h] [-l LOG_DIR] [--max_size MAX_SIZE]
[--output_dir OUTPUT_DIR]

Web Crawler for Burmese wiki.

optional arguments:
-h, --help show this help message and exit
-l LOG_DIR, --log_dir LOG_DIR
Specify logs directory for errors (default: logs)
--max_size MAX_SIZE Specify max size (in MB) to crawl wiki. (default: 1000)
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
Output directory for storing corpus (default: results)

```

## TODOS

- Remove max_size limit.

- Better filtering of burmese characters.

- Optimize corpus storing.

## License

[MIT license](/License.md)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zawlinnnaing/my-wiki-crawler

Awesome Lists containing this project

README