https://github.com/raymelon/tagalog-dictionary-scraper

Builds a Tagalog dictionary by collecting Tagalog words from tagalog.pinoydictionary.com
https://github.com/raymelon/tagalog-dictionary-scraper

beautiful-soup database dictionary python scraper tagalog tagalog-dictionary web-scraper web-scraping

Last synced: about 2 months ago
JSON representation

Builds a Tagalog dictionary by collecting Tagalog words from tagalog.pinoydictionary.com

Host: GitHub
URL: https://github.com/raymelon/tagalog-dictionary-scraper
Owner: raymelon
License: gpl-3.0
Created: 2016-11-13T16:06:01.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2023-02-19T20:46:33.000Z (over 3 years ago)
Last Synced: 2023-10-20T22:46:47.957Z (over 2 years ago)
Topics: beautiful-soup, database, dictionary, python, scraper, tagalog, tagalog-dictionary, web-scraper, web-scraping
Language: Python
Homepage:
Size: 997 KB
Stars: 23
Watchers: 1
Forks: 14
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Tagalog Dictionary Scraper :ledger: [![Tweet](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/intent/tweet?text=Check%20out%20Tagalog%20Dictionary%20Scraper!%20Ating%20pag-ibayuhin%20ang%20ating%20talahuluganan.%20%40github%20https://github.com/raymelon/tagalog-dictionary-scraper)

> **_Ating pag-ibayuhin ang ating talahuluganan!_**

Collects [Tagalog](http://tagaloglang.com/) words from [tagalog.pinoydictionary.com](http://tagalog.pinoydictionary.com/), a database of [Tagalog](http://tagaloglang.com/) words powered by Cyberspace.ph Web Hosting. This script uses a common web scraping technique known as HTML parsing.

## 42,723 words (as of Feb 19, 2023)

**See the word list at `tagalog_dict.txt`**

![](https://reposs.herokuapp.com/?path=raymelon/tagalog-dictionary-scraper)

[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](http://www.gnu.org/licenses/gpl-3.0)

[![Build Status](https://travis-ci.org/raymelon/tagalog-dictionary-scraper.svg)](https://travis-ci.org/raymelon/tagalog-dictionary-scraper)

[![codecov](https://codecov.io/gh/raymelon/tagalog-dictionary-scraper/branch/master/graph/badge.svg)](https://codecov.io/gh/raymelon/tagalog-dictionary-scraper)

[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)]()

## API Resource

Served through GitHub Pages, the scraped words are accessible via REST resource.

**Host**

[https://raymelon.github.io/tagalog-dictionary-scraper/](https://raymelon.github.io/tagalog-dictionary-scraper/)

**Method**

GET

**Resources Available**

| Resource | Display      | Endpoint                                                                                                  |

| -------- | ------------ | --------------------------------------------------------------------------------------------------------- |

| `csv`    | `default`    | [/tagalog_dict.csv](https://raymelon.github.io/tagalog-dictionary-scraper/tagalog_dict.csv)               |

| `csv`    | `with lines` | [/tagalog_dict_lines.csv](https://raymelon.github.io/tagalog-dictionary-scraper/tagalog_dict_lines.csv)   |

| `json`   | `default`    | [/tagalog_dict.json](https://raymelon.github.io/tagalog-dictionary-scraper/tagalog_dict.json)             |

| `json`   | `with lines` | [/tagalog_dict_lines.json](https://raymelon.github.io/tagalog-dictionary-scraper/tagalog_dict_lines.json) |

| `txt`    | `default`    | [/tagalog_dict.txt](https://raymelon.github.io/tagalog-dictionary-scraper/tagalog_dict.txt)               |

## How is it done? :muscle:

Each webpage is loaded and parsed, extracting the words enclosed in `
` tag.

Included is [`tagalog.pinoydictionary.com`](http://tagalog.pinoydictionary.com/) `html` [snippet](https://github.com/raymelon/tagalog-dictionary-scraper/blob/master/tagalog.pinoydictionary.com%20html%20snippet.html) containing the source of

[`http://tagalog.pinoydictionary.com/list/a/`](http://tagalog.pinoydictionary.com/list/a/) to serve as point of reference on how dictionary words from the page are extracted.

**Disclaimer:**

I do not own the `html` code cited above, it is owned by [tagalog.pinoydictionary.com](http://tagalog.pinoydictionary.com/).

## How did the project started? :thought_balloon:

The main purpose of this project is for a [Scrabble ®](http://www.scrabble.com/) Tagalog dictionary database, but other uses may vary.

## Tools :pencil2:

- [Python3 v3.5+](https://www.python.org/) :snake:

- [beautifulsoup4 v4.5.1](https://www.crummy.com/software/BeautifulSoup/) :ramen: :package: for parsing html pages

```

  python -m pip install -U pip beautifulsoup4

```

- [requests-futures v1.0.0](https://github.com/ross/requests-futures) :zap: for request concurrency

```

  python -m pip install -U pip requests-futures

```

## Notes :pushpin:

- Run the scraper script [`collect_tagalog.py`](https://github.com/raymelon/tagalog-dictionary-scraper/blob/master/collect_tagalog.py)

- See the output of collected words at [`tagalog_dict.txt`](https://github.com/raymelon/tagalog-dictionary-scraper/blob/master/tagalog_dict.txt)

- Match [`max_workers`](https://github.com/raymelon/tagalog-dictionary-scraper/blob/master/collect_tagalog.py#L57) value with the CPU and network capacity of the environment. See the [comment](https://github.com/raymelon/tagalog-dictionary-scraper/blob/master/collect_tagalog.py#L41-L56) for estimated values and expected download rates.

## License [![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](http://www.gnu.org/licenses/gpl-3.0)

[GNU General Public License 3.0](https://www.gnu.org/licenses/gpl-3.0.en.html)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/raymelon/tagalog-dictionary-scraper

Awesome Lists containing this project

README