https://github.com/bact/constitution
Convert Thai constitution from PDF to plaintext and correct encoding glitches
https://github.com/bact/constitution
law open-data pdf-conversion thai-language
Last synced: about 1 year ago
JSON representation
Convert Thai constitution from PDF to plaintext and correct encoding glitches
- Host: GitHub
- URL: https://github.com/bact/constitution
- Owner: bact
- License: apache-2.0
- Archived: true
- Created: 2018-02-03T23:04:55.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2018-02-03T23:31:40.000Z (over 8 years ago)
- Last Synced: 2025-05-06T21:55:29.355Z (about 1 year ago)
- Topics: law, open-data, pdf-conversion, thai-language
- Language: HTML
- Size: 1.73 MB
- Stars: 9
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# constitution
แปลงรัฐธรรมนูญ (ร่างต้นปี 2559) จาก PDF เป็น HTML
ดูคำอธิบายได้ในสไลด์ https://www.slideshare.net/arthit/pdf-plain-text และโน๊ต https://www.facebook.com/notes/10154493302702646
Convert Thai constitution draft (early 2016) from PDF to plaintext and correct encoding glitches. It is crafted to work with a specific set of PDF. Mainly one from https://www.parliament.go.th/ewtcommittee/ewt/draftconstitution2/download/article/article_20160129132217.pdf and following versions. Cannot be guarantted to work with other PDFs. This is like web scraping, you have to tailor it to a particular website.
- Use [Apache PDFBox](https://pdfbox.apache.org/) for PDF to HTML
- ```java -jar pdfbox-app.jar ExtractText -html file.pdf file.html```
- Cannot convert directly to plaintext, as there are Thai characters in the PDF that use codepoints in Private User Area (PUA) -- all the PUAs will be discarded for conversion to plaintext
- Convert Thai characters that encoded as HTML entities to UTF-8. The same process will also convert PUAs to valid codepoints.
```python
pua = {
'63233': 'ิ', # 0xf701 Sara I
'63234': 'ี', # 0xf702
'63235': 'ึ', # 0xf703
'63236': 'ื', # 0xf704
'63237': '่', # 0xf705
'63238': '้', # 0xf706 Mai Tho (on Po Pla)
'63242': '่', # 0xf70a Mai Ek
'63243': '้', # 0xf70b Mai Tho
'63246': '์', # 0xf70e Thantakat
'63248': 'ั', # 0xf710 Mai Han Akhat (on Po Pla)
'63250': '็', # 0xf712 Mai Tai Khu (on Po Pla)
'63251': '่', # 0xf713
'63252': '้' # 0xf714
}
```
- Correct wrong order of Thai characters, like tonemark + vowel --> vowel + tonemark
- Basic reformatting
More explanation (in Thai): [slides](https://www.slideshare.net/arthit/pdf-plain-text), [notes](https://www.facebook.com/notes/10154493302702646)
Ideally, there should be no need for a script like this. All laws should be available in search friendly and machine-readable format.