https://github.com/minthanthtoo/myanmar-collation-stats
Myanmar lexicon analyzer - Sorting and Segmentation
https://github.com/minthanthtoo/myanmar-collation-stats
Last synced: about 2 months ago
JSON representation
Myanmar lexicon analyzer - Sorting and Segmentation
- Host: GitHub
- URL: https://github.com/minthanthtoo/myanmar-collation-stats
- Owner: minthanthtoo
- Created: 2015-08-15T10:18:09.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2021-08-11T05:36:41.000Z (over 3 years ago)
- Last Synced: 2024-07-31T20:30:20.479Z (9 months ago)
- Language: Java
- Homepage:
- Size: 367 KB
- Stars: 10
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- Awesome-Myanmar - Myanmar Collation Stats - Sorting and Segmentation | (Myanmar NLP)
README
Myanmar Collation Stats
=======================Myanmar lexicon analyzer
Functions:
* Statistical analysis of Myanmar words in UTF-8 encoded plain-text
* Segmentation of Myanmar syllables phonologically
* Sorting of Myanmar wordsSample Codes
============Analysis
Words, syllables, letters, syllable-heads, syllable-tails, types of letter-order,used in the source text, can be counted by the following code:
File srcFile = new File("/path/to/file");
Utils.toFile(Utils.toLexicon(f), f.getAbsolutePath() + ".analyzed.txt", Utils.LEX_TO_FILE_FLAG_WRITE_STATS);
Segmentation
Segmentation of Myanmar text into syllables can be done by the following code.
File srcFile = new File("/path/to/file");
Utils.toFile(Utils.toLexicon(srcFile), srcFile.getAbsolutePath() + ".segmented.txt", Utils.LEX_TO_FILE_FLAG_SEGMENTATION);
Myanmar Sorting
Sorting of Myanmar words according to dictionary rules can be done by the following code:
File srcFile = new File("/path/to/file");
Utils.toFile(Utils.toLexicon(f), f.getAbsolutePath() + ".sorted.txt", Utils.LEX_TO_FILE_FLAG_SORT);
Or by the following customizable code:Lexicon lex = Utils.toLexicon(srcFile);
lex.analyze();
List words = new ArrayList(lex.stats.words.values());
Collections.sort(words, new LexComparator.WordComparator());Theory
======
Myanmar letters can be classified into -* Consonants (C)
* Dependent vowels (v) and independent vowels(V)
* Medials (M)
* Finals (F)
* Symbols (each has special sort order)Further reading can be found here:
["Representing myanmar in Unicode - Unicode Consortium"](http://unicode.org/notes/tn11)Input
=====
Currently this module can read only simple word lists.
The word list must express each word in a single line.
You can write comments,starting each line with a _#_ character.
The input file must be extended with `.list` and shoud be under `/data/wordlists/`.Output
======Analysis
Analysis result of each source file is printed in a separate file whose filename has been extended by ".analyzed.txt".
Analysis result shows -* Words and word count
* Syllables and syllable count
* Syllable-heads and syllable-head count
* Syllable-tails and syllable-tail count
* Letter-orders and count
* Total analysis time in milliseconds
* Hex string of each letter(debugging)The following is a sample output of analysis.
Words:
ကကတစ် :1
ကကုသန် :1
ကကူရံ :1
...
count: 2388letters:
1000 က C :7043
1001 ခ C :2171
1002 ဂ C :292
...
count: 65Syllables:
ကက္ :6
ကက် :16
ကင်း :46
ကင်္ :1
...
count: 65Syllables Heads:
က :4422
ခ :2168
ဂ :265
...
count: 48Syllables Tails:
က္ :114
က် :684
ဂ္ :7
...
count: 494Analysis time: 663ms
Note:
_Syllable_ = a combination of myanmar letters; one or more syllables join to form a myanmar word
_Syllable-head_ = the main or first(in standard storage) consonant or independent vowel in a syllable
_Syllable-tail_ = the remaining part in a syllable except syllable-head
Rules
=====
We consider the following words to have different types of syllable: ယောက္ခမ, ယောက်ျား, ယောက်ဖ.
So we count them as different syllables.ယောက္ :5
ယောက်ျား :20
ယောက် :45Purposes
========
This module can be used in NLP (Natural Language Processing) research in the following ways -* Myanmar Word segmentation (this module can identify syllables)
* Myanmar Sorting (fully functional,with all burmese words; still testing)
* Analysis of Myanmar letter frequency in a lexicon (still working to support a real lexicon not just a wordlist)Contact
=======
Don't hesitate if you want to contact us.
We appreciate your feedback
email:[email protected]