https://github.com/sbbic/khmerlbdict

Khmer wordlist for line and word breaking
https://github.com/sbbic/khmerlbdict

Last synced: about 1 month ago
JSON representation

Khmer wordlist for line and word breaking

Host: GitHub
URL: https://github.com/sbbic/khmerlbdict
Owner: sbbic
License: mit
Fork: true (silnrsi/khmerlbdict)
Created: 2016-02-09T09:35:59.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2016-02-09T09:35:41.000Z (about 9 years ago)
Last Synced: 2024-11-06T09:39:18.808Z (6 months ago)
Language: Makefile
Size: 397 KB
Stars: 3
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-khmer-language - Khmer LineBreaking Dictionary

README

# Khmer LineBreaking Dictionary

The aim of this project is to produce a frequency based wordlist for line and word breaking
Khmer language. This will then be used in ICU (if they accept it).

Sources are:

* seafreq.txt. Taken from the SEALang Khmer frequency based wordlist [http://sealang.net/project/list/]
* villages.txt. A list of all village and region names
* places.txt. Language, script, territory and exemplar city names taken from CLDR.
* names.txt. Various first and last names.
* KHOV.txt. Word list of the Khmer Bible Old Version.
* KHSV.txt. Word List of the Khmer Bible Standard Version.

The files are edited to remove bad data, for example villages called 'number1' or zero-width-spaces, also removed terms like 'upper', 'lower', 'eastern' from village and place names as long as the remaining part of the name had a length of at least 3 clusters.

A program then calculates the log frequencies needed for CLDR and adds equivalences for bad spellings.
This will mean badly spelled data that is hard to spot will break correctly and it will be up to a
spelling checker to sort that mess out.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sbbic/khmerlbdict

Awesome Lists containing this project

README