https://github.com/minthanthtoo/myanmar-collation-stats

Myanmar lexicon analyzer - Sorting and Segmentation
https://github.com/minthanthtoo/myanmar-collation-stats

Last synced: about 2 months ago
JSON representation

Myanmar lexicon analyzer - Sorting and Segmentation

Host: GitHub
URL: https://github.com/minthanthtoo/myanmar-collation-stats
Owner: minthanthtoo
Created: 2015-08-15T10:18:09.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2021-08-11T05:36:41.000Z (over 3 years ago)
Last Synced: 2024-07-31T20:30:20.479Z (9 months ago)
Language: Java
Homepage:
Size: 367 KB
Stars: 10
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

Awesome-Myanmar - Myanmar Collation Stats - Sorting and Segmentation | (Myanmar NLP)

README

        Myanmar Collation Stats

=======================

Myanmar lexicon analyzer

Functions:

* Statistical analysis of Myanmar words in UTF-8 encoded plain-text

* Segmentation of Myanmar syllables phonologically

* Sorting of Myanmar words

Sample Codes

============

Analysis


Words, syllables, letters, syllable-heads, syllable-tails, types of letter-order,used in the source text, can be counted by the following code:

	File srcFile = new File("/path/to/file");

	Utils.toFile(Utils.toLexicon(f), f.getAbsolutePath() + ".analyzed.txt", Utils.LEX_TO_FILE_FLAG_WRITE_STATS);

	

Segmentation


Segmentation of Myanmar text into syllables can be done by the following code.

	File srcFile = new File("/path/to/file");

	Utils.toFile(Utils.toLexicon(srcFile), srcFile.getAbsolutePath() + ".segmented.txt", Utils.LEX_TO_FILE_FLAG_SEGMENTATION);

	

Myanmar Sorting


Sorting of Myanmar words according to dictionary rules can be done by the following code:

	File srcFile = new File("/path/to/file");

	Utils.toFile(Utils.toLexicon(f), f.getAbsolutePath() + ".sorted.txt", Utils.LEX_TO_FILE_FLAG_SORT);

	

Or by the following customizable code:

	Lexicon lex = Utils.toLexicon(srcFile);

	lex.analyze();

	List words = new ArrayList(lex.stats.words.values());

	Collections.sort(words, new LexComparator.WordComparator());

Theory

======

Myanmar letters can be classified into -

* Consonants (C)

* Dependent vowels (v) and independent vowels(V)

* Medials (M)

* Finals (F)

* Symbols (each has special sort order)

Further reading can be found here:


["Representing myanmar in Unicode - Unicode Consortium"](http://unicode.org/notes/tn11)

Input

=====

Currently this module can read only simple word lists.


The word list must express each word in a single line.


You can write comments,starting each line with a _#_ character.


The input file must be extended with `.list` and shoud be under `/data/wordlists/`.


Output

======

Analysis

Analysis result of each source file is printed in a separate file whose filename has been extended by ".analyzed.txt".


Analysis result shows -

* Words and word count

* Syllables and syllable count

* Syllable-heads and syllable-head count

* Syllable-tails and syllable-tail count

* Letter-orders and count

* Total analysis time in milliseconds

* Hex string of each letter(debugging)

The following is a sample output of analysis.

	Words:

	ကကတစ်	:1

	ကကုသန်	:1

	ကကူရံ	:1

	...

	count:	2388

	letters:

	1000	က	C	:7043

	1001	ခ	C	:2171

	1002	ဂ	C	:292

	...

	count:	65

	Syllables:

	ကက္	:6

	ကက်	:16

	ကင်း	:46

	ကင်္	:1

	...

	count:	65

	Syllables Heads:

	က	:4422

	ခ	:2168

	ဂ	:265

	...

	count:	48

	Syllables Tails:

	က္	:114

	က်	:684

	ဂ္	:7

	...

	count:	494

	Analysis time: 663ms

Note:

_Syllable_ = a combination of myanmar letters; one or more syllables join to form a myanmar word

_Syllable-head_ = the main or first(in standard storage) consonant or independent vowel in a syllable

_Syllable-tail_ = the remaining part in a syllable except syllable-head

Rules

=====

We consider the following words to have different types of syllable: ယောက္ခမ, ယောက်ျား, ယောက်ဖ.


So we count them as different syllables.

	ယောက္	:5

	ယောက်ျား	:20

	ယောက်	:45

Purposes

========

This module can be used in NLP (Natural Language Processing) research in the following ways -

* Myanmar Word segmentation (this module can identify syllables)

* Myanmar Sorting (fully functional,with all burmese words; still testing)

* Analysis of Myanmar letter frequency in a lexicon (still working to support a real lexicon not just a wordlist)

Contact

=======

Don't hesitate if you want to contact us.


We appreciate your feedback


email:[email protected]

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/minthanthtoo/myanmar-collation-stats

Awesome Lists containing this project

README

Functions:

Analysis

Segmentation

Myanmar Sorting

Analysis