Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/raphaelmerx/panlex_scraper
Scrape bilingual vocabulary data from PanLex
https://github.com/raphaelmerx/panlex_scraper
Last synced: 28 days ago
JSON representation
Scrape bilingual vocabulary data from PanLex
- Host: GitHub
- URL: https://github.com/raphaelmerx/panlex_scraper
- Owner: raphaelmerx
- License: mit
- Created: 2022-02-05T06:53:58.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-08-19T00:38:16.000Z (4 months ago)
- Last Synced: 2024-08-19T01:47:26.266Z (4 months ago)
- Language: Python
- Size: 8.79 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PanLex scraper
Scrape bilingual vocabulary data from [vocab.panlex.org](https://vocab.panlex.org/)## Usage
#### 1. Clone this repo: `git clone https://github.com/raphaelmerx/panlex_scraper`
#### 2. Install requirements: `pip install -r requirements.txt`
#### 3. Run the scraping command:
```
scrapy crawl panlex -O panlex.jl -L INFO -a lang1={lang1} -a lang2={lang2}
```Replacing {lang1} and {lang2} with the 3-letter [ISO 639-2](https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes) code of the two languages you want to get parallel vocabulary for.
For example to get Igbo (igo) and English (eng) bitext vocabulary:
```
scrapy crawl panlex -O panlex.jl -L INFO -a lang1=igo -a lang2=eng
```The scraped data will be in `panlex.jl`, with one json line per vocabulary bitext, e.g.:
```
{"ibo": "\u00e0", "eng": "this"}
{"ibo": "a\u00e0", "eng": "oh"}
{"ibo": "a\u0101", "eng": "oh"}
{"ibo": "\u00e0a", "eng": ""}
{"ibo": "Aba", "eng": "Aba"}
{"ibo": "\u00e0ba", "eng": "flatness"}
{"ibo": "\u00e0b\u00e0ch\u00e0", "eng": "cassava"}
{"ibo": "\u00e0b\u00e0da", "eng": ""}
{"ibo": "abadaba", "eng": "breadth"}
{"ibo": "Abakaliki", "eng": "Abakaliki"}
{"ibo": "\u00e0bal\u00e0", "eng": "fruit of iroko"}
{"ibo": "\u00e0bal\u00e0 \u1ecdj\u1ecb\u0300", "eng": "fruit of iroko"}
{"ibo": "abali", "eng": "night"}
{"ibo": "abal\u00ef", "eng": "night"}
...
```