https://github.com/raymelon/tagalog-dictionary-scraper
Builds a Tagalog dictionary by collecting Tagalog words from tagalog.pinoydictionary.com
https://github.com/raymelon/tagalog-dictionary-scraper
beautiful-soup database dictionary python scraper tagalog tagalog-dictionary web-scraper web-scraping
Last synced: about 1 month ago
JSON representation
Builds a Tagalog dictionary by collecting Tagalog words from tagalog.pinoydictionary.com
- Host: GitHub
- URL: https://github.com/raymelon/tagalog-dictionary-scraper
- Owner: raymelon
- License: gpl-3.0
- Created: 2016-11-13T16:06:01.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2023-02-19T20:46:33.000Z (over 3 years ago)
- Last Synced: 2023-10-20T22:46:47.957Z (over 2 years ago)
- Topics: beautiful-soup, database, dictionary, python, scraper, tagalog, tagalog-dictionary, web-scraper, web-scraping
- Language: Python
- Homepage:
- Size: 997 KB
- Stars: 23
- Watchers: 1
- Forks: 14
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Tagalog Dictionary Scraper :ledger: [](https://twitter.com/intent/tweet?text=Check%20out%20Tagalog%20Dictionary%20Scraper!%20Ating%20pag-ibayuhin%20ang%20ating%20talahuluganan.%20%40github%20https://github.com/raymelon/tagalog-dictionary-scraper)
> **_Ating pag-ibayuhin ang ating talahuluganan!_**
Collects [Tagalog](http://tagaloglang.com/) words from [tagalog.pinoydictionary.com](http://tagalog.pinoydictionary.com/), a database of [Tagalog](http://tagaloglang.com/) words powered by Cyberspace.ph Web Hosting. This script uses a common web scraping technique known as HTML parsing.
## 42,723 words (as of Feb 19, 2023)
**See the word list at `tagalog_dict.txt`**

[](http://www.gnu.org/licenses/gpl-3.0)
[](https://travis-ci.org/raymelon/tagalog-dictionary-scraper)
[](https://codecov.io/gh/raymelon/tagalog-dictionary-scraper)
[]()
## API Resource
Served through GitHub Pages, the scraped words are accessible via REST resource.
**Host**
[https://raymelon.github.io/tagalog-dictionary-scraper/](https://raymelon.github.io/tagalog-dictionary-scraper/)
**Method**
GET
**Resources Available**
| Resource | Display | Endpoint |
| -------- | ------------ | --------------------------------------------------------------------------------------------------------- |
| `csv` | `default` | [/tagalog_dict.csv](https://raymelon.github.io/tagalog-dictionary-scraper/tagalog_dict.csv) |
| `csv` | `with lines` | [/tagalog_dict_lines.csv](https://raymelon.github.io/tagalog-dictionary-scraper/tagalog_dict_lines.csv) |
| `json` | `default` | [/tagalog_dict.json](https://raymelon.github.io/tagalog-dictionary-scraper/tagalog_dict.json) |
| `json` | `with lines` | [/tagalog_dict_lines.json](https://raymelon.github.io/tagalog-dictionary-scraper/tagalog_dict_lines.json) |
| `txt` | `default` | [/tagalog_dict.txt](https://raymelon.github.io/tagalog-dictionary-scraper/tagalog_dict.txt) |
## How is it done? :muscle:
Each webpage is loaded and parsed, extracting the words enclosed in `
` tag.
Included is [`tagalog.pinoydictionary.com`](http://tagalog.pinoydictionary.com/) `html` [snippet](https://github.com/raymelon/tagalog-dictionary-scraper/blob/master/tagalog.pinoydictionary.com%20html%20snippet.html) containing the source of
[`http://tagalog.pinoydictionary.com/list/a/`](http://tagalog.pinoydictionary.com/list/a/) to serve as point of reference on how dictionary words from the page are extracted.
**Disclaimer:**
I do not own the `html` code cited above, it is owned by [tagalog.pinoydictionary.com](http://tagalog.pinoydictionary.com/).
## How did the project started? :thought_balloon:
The main purpose of this project is for a [Scrabble ®](http://www.scrabble.com/) Tagalog dictionary database, but other uses may vary.
## Tools :pencil2:
- [Python3 v3.5+](https://www.python.org/) :snake:
- [beautifulsoup4 v4.5.1](https://www.crummy.com/software/BeautifulSoup/) :ramen: :package: for parsing html pages
```
python -m pip install -U pip beautifulsoup4
```
- [requests-futures v1.0.0](https://github.com/ross/requests-futures) :zap: for request concurrency
```
python -m pip install -U pip requests-futures
```
## Notes :pushpin:
- Run the scraper script [`collect_tagalog.py`](https://github.com/raymelon/tagalog-dictionary-scraper/blob/master/collect_tagalog.py)
- See the output of collected words at [`tagalog_dict.txt`](https://github.com/raymelon/tagalog-dictionary-scraper/blob/master/tagalog_dict.txt)
- Match [`max_workers`](https://github.com/raymelon/tagalog-dictionary-scraper/blob/master/collect_tagalog.py#L57) value with the CPU and network capacity of the environment. See the [comment](https://github.com/raymelon/tagalog-dictionary-scraper/blob/master/collect_tagalog.py#L41-L56) for estimated values and expected download rates.
## License [](http://www.gnu.org/licenses/gpl-3.0)
[GNU General Public License 3.0](https://www.gnu.org/licenses/gpl-3.0.en.html)