Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/varnamproject/libvarnam

Deprecated. See https://github.com/varnamproject/govarnam
https://github.com/varnamproject/libvarnam

Last synced: about 2 months ago
JSON representation

Deprecated. See https://github.com/varnamproject/govarnam

Awesome Lists containing this project

README

        

Introduction
============

`libvarnam` is a cross platform, self learning, open source library which support transliteration and reverse transliteration for Indian languages. At the core is a C shared library providing algorithms and patterns for transliteration. `libvarnam` has a simple learning module built-in which can learn words to improve the transliteration experience.

Installing libvarnam
====================

```shell
wget http://download.savannah.gnu.org/releases/varnamproject/libvarnam/source/libvarnam-$VERSION.tar.gz
tar -xvf libvarnam-$VERSION.tar.gz
cd libvarnam-$VERSION
cmake . && make
sudo make install
```

This will install `libvarnam` shared libraries and `varnamc` command line utility. `varnamc` can be used to quickly try out varnam.

### Installation on Windows

In Windows, you can compile `libvarnam` using Visual Studio. Use the following `cmake` command to generate the project files.

```shell
cmake -DBUILD_TESTS=false -DBUILD_VST=false -DRUN_TESTS=false .
```

Usage
=====

### Transliterate

Usage: varnamc -s lang_code -t word

```shell
varnamc -s ml -t varnam
വർണം
വർണമേറിയത്
```

### Reverse Transliterate

Usage: varnamc -s lang_code -r word

```shell
varnamc -s ml -r വർണം
varnam
```

Word corpus
===========

`libvarnam` is a learning system. It works better with a word corpus. You can obtain the word corpus and make varnam learn all the words. This will enable `libvarnam` to provide intelligent suggestions.

Here is an example of loading Malayalam word corpus:

```shell
mkdir words
cd words
wget http://download.savannah.gnu.org/releases/varnamproject/words/ml/ml.tar.gz
tar -xvf ml.tar.gz
varnamc -s ml --learn-from .
```

This will take some time depends on how much words you are loading.

[Here are some more word corpus](http://mirror.rackdc.com/savannah/varnamproject/words/)

There is a `--import-learnings-from` option to import files which already has the learnt parameter. Importing these files don't take too much time as the word corpus.

What next?
==========

If you just wanted to use varnam for input, you have the following options

- [Varnam on iBUS](https://github.com/varnamproject/libvarnam-ibus) - For Linux
- [Varnam online editor](https://www.varnamproject.com/editor) - Platform agnostic

If you are a programmer, you will be interested in `libvarnam`. You can use it to provide Indian language support in your applications. `libvarnam` can be used from different programming languages.

How Varnam works
================

1. Scheme files and symbol tables
2. Transliteration
3. Learning

## Scheme files and symbol tables

Scheme file maps English letters to phonetic equivalent indic letters. In this, all vowels, consonants and consonant clusters are mapped to the indic equivalent. Varnam uses the scheme file mapping to perform transliteration.

Scheme files are plain text but uses a custom DSL to make the mapping easier. This DSL is implemented using Ruby and it can contain any valid Ruby code. It also provides many helper functions to make the mapping easier.

`schemes/` directory contains all the scheme files for the supported languages. Each language is represented with it's ISO language code.

### Symbol tables

Compiled version of Scheme file is called as *Varnam Symbol Table* (vst). This compilation is done using `varnamc` command line utility

```
varnamc --compile schemes/ml
```

Symbol tables are binary representation of the plain text scheme files. It also contains other metadata items to make the lookup easier.

libvarnam understand only the symbol table format. Because of this, every scheme file should be compiled into *vst* format before it can be used with varnam.

```
make vst
```

can be used to compile all scheme files present in the *schemes* directory.

#### Symbol table lookup

Varnam can be initialized with just the ISO language code. When this happens, varnam will scan the following directories and tries to find a matching symbol table file. If one is found, it will be loaded and used for all operations.

* "/usr/local/share/varnam/vst"
* "/usr/share/varnam/vst"
* "schemes"

## Transliteration

```
varnam_transliterate(varnam *handle, const char *input, varray **output);
```

Is the entry point for transliteration. Transliteration converts *input* to the phonetic equivalent indic text. It also provides a set of matches which are possible for the given input.

Transliteration does the following steps under the hood:

Performs tokenization on the *input*. Varnam uses a greedy tokenizer which processes *input* from left to right. Tokenizer tries all possible to combinations to generate the longest possible tokens for the given input. This token will be generated by utilizing the symbol table which is provided to varnam

Generated tokens is assembled and varnam computes all possibilities of these tokens. Assume the input is *malayalam*, varnam generates tokens like, *മ, ല, യാ, ളം ([ma], [la], [ya], [lam])* and many others. Once these tokens are generated, they are combined and tested against the learning model to get rid of garbage values and come up with most used words. Words are sorted according to the frequency value and returned to the caller function.

### Renderer

All of the processing is varnam is mostly language agnostic. It should work fine for all Indian languages. However, sometimes language specific fixes might be required. Varnam handles this using *Renderers*. Any language can register renderers and varnam will invoke the renderers just before rendering the final output. This can have language specific rules which can't be generalized otherwise.

## Learning

```
varnam_learn(varnam *handle, const char *word);
```

Varnam can learn new words. The more words it learns, the better it performs. Learning process learns the words and it's patterns.

Learning process persists the following data:

1. Patterns: All english combinations which can be used to input the given indic text
2. Words: Indic text itself
3. Prefixes: Prefixes of patterns and words

When an indic word is learned, varnam tokenizes the word using the symbol table and tries to learn all possible patterns that can be used to input the word. Internally, varnam keeps a prefix tree and frequencies of all patterns. This storage structure allows varnam to retrieve matching words efficiently when a pattern is presented. Basic stemming is also performed while learning words.

When the same word/pattern combination is learned, varnam computes frequency at which it has seen this pattern. This frequency is used to sort and pick the best candidate while performing transliteration.

Learning can be initiated by calling Varnam APIs directly or using *varnamc*.

Input tools like ibus-engine will automatically learn the words that you are typing.

Learned data is kept in one of the following locations:

* APPDATA\varnam\suggestions (Windows)
* XDG_DATA_HOME/varnam/suggestions
* HOME/.local/share/varnam/suggestions

Mozilla Public License
======================

Copyright (c) 2016 Navaneeth.K.N

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.