Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/liulalemx/felig-toolkit
A toolset for Amharic Language pre-processing. Includes an Amharic Stemmer, Transliterator, Stopword remover , Lexical analyzer, Corpus indexer and Term weighter.
https://github.com/liulalemx/felig-toolkit
amharic amharic-corpus amharic-nlp amharic-stemmer corpus lexical-analyzer linguistics stopword-removal transliterator
Last synced: 1 day ago
JSON representation
A toolset for Amharic Language pre-processing. Includes an Amharic Stemmer, Transliterator, Stopword remover , Lexical analyzer, Corpus indexer and Term weighter.
- Host: GitHub
- URL: https://github.com/liulalemx/felig-toolkit
- Owner: liulalemx
- License: mit
- Created: 2022-07-27T15:40:08.000Z (over 2 years ago)
- Default Branch: staging
- Last Pushed: 2023-05-27T10:16:03.000Z (over 1 year ago)
- Last Synced: 2024-10-29T00:44:36.310Z (21 days ago)
- Topics: amharic, amharic-corpus, amharic-nlp, amharic-stemmer, corpus, lexical-analyzer, linguistics, stopword-removal, transliterator
- Language: TypeScript
- Homepage: https://felig-toolkit-web.vercel.app/
- Size: 7.41 MB
- Stars: 29
- Watchers: 3
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Felig Toolkit
A toolset for Amharic Language pre-processing π§
Felig Toolkit WebββββββββββββββββββββββββββββββββββββNow with Typescript support!
---
## What is felig-toolkit?
It is a toolset for Amharic Language pre-processing. It includes an Amharic Stemmer, Amharic Transliterator, Amharic Stopword remover, Amharic Lexical analyzer, Amharic Corpus indexer and Term weighter.
### Amharic Lexical Analyzer
Breaks down Amharic language corpus and returns tokens by removing any whitespace, expanding abbreviations(`α .α -> α α²α΅ α α α `), removing numbers, breaking up hyphenated words, and removing punctuation (`α‘ α’ ! ? `...).
### Amharic Stopword remover
Removes commonly occuring words that have no contribution to the semantics of the corpus. Eg: `α₯α α‘ α΅ααα α‘ α αααα`...
### Amharic Transliterator
Changes Unicode Amharic characters to ASCII. Exmaple: `ααα½ -> αα α¦α½ -> ljoc`. This tool implements two types of Amharic transliteration lookup tables.
- SERA (System for Ethiopic Representation in ASCII) - This system maps alphabets with similar sounds separately. Eg: `(αα£αα‘α)α£(α°α‘α )α‘(αΈα‘α)α‘(αα‘α )`. However, in practice, these alphabets are used interchangeably and use of SERA would greatly decrease recall. **NOT RECOMMENDED!**
- Felig - Normalizes the redundant symbols into a common symbol. **RECOMMENDED!**
### Amharic Stemmer [LIVE DEMO](https://felig-toolkit-web.vercel.app/#demo)
Reduces the different morphological (e.g. inflectional or
derivational) variations of Amharic word forms by taking an Amharic word and returning the stem through affix-removal with longest match.Exmaple:
`ααα½ -> αα α¦α½ -> ljoc -> lj -> αα `### Amharic Corpus Indexer
Produces an index file for the stemmed words in a corpus and relates them with the files they are found in. It also stores their frequencies per file.
### Term Weighter
Calculates the weight of words from the index file using product of their length normalized Term frequency and Inverse document frequency (`tf*idf`).
## Installation
Felig Toolkit is available as a package on NPM for use in a Node application:
```bash
# NPM
npm install felig-toolkit```
```bash
# YARN
yarn add felig-toolkit```
```bash
# PNPM
pnpm install felig-toolkit```
### Example
**note: this package uses es-modules**
```javaScript
import felig_toolkit from 'felig-toolkit'```
## What's Included
- `felig_transliterate(word,lang)`: takes a single word and its' language (am/en) and returns felig-transliterated string
- `sera_transliterate(word,lang)`: takes a single word and its' language (am/en) and returns SERA-transliterated string.
- `rmvStopwrd(corpus)`: takes an Amharic corpus text (sentence/paragraph/multiple-paragraphs) and removes stop wprds
- `lexAnalyze(corpus)`: takes an Amharic corpus text returns a string of tokens
- `stem(word)`: takes an Amharic word string and returns the stem as a string (async)
- `indexer(filesArray, outputIndexFilePath, type)`: takes an array of files and produces an index (`.json`) file. `(type= "doc" | "query")`
- `weigh_terms(indexFilePath, outputWeightedTermsPath, typeOfIndex)`: takes an index file and produces a file (`.json`) with weighted terms. `(typeOfIndex= "doc" | "query")`
## How to use in Web apps
> Felig toolkit does not work in the browser (requires node.js enviroment).
> Use felig-toolkit on your server.
Exmaple:
If you are using Next.js, you can use felig-toolkit in a Next server route handler (`/api/felig/route.ts`) and pass the results.## Contributions
felig-toolkit is open to contributions, but it is recommend to create an issue or reply in a comment to let others know what you are working on first.
## How to run locally
### Prerequisites
- [nodejs](https://nodejs.org/en/)
1. Clone the repository
1. Run `npm install`
1. Run `node index.js` on the root directory## Attribution
To prepare the following tools, these academic papers were used
- [Girma Neshir Alemneh. βAmharic Light Stemmerβ. ResearchGate. Sep 2020. ](https://www.researchgate.net/publication/344285263_Amharic_Light_Stemmer)
- [Genet Mezemir Fikremariam. βAutomatic Stemming for Amharic text: An experiment using successor variety approachβ. AAU. Jan 2009. ](http://etd.aau.edu.et/bitstream/handle/123456789/14590/Genet%20Mezemir.pdf?sequence=1&isAllowed=y)
- [Tessema Mindaye Mengistu. βDesign and Implementation of Amharic Search Engineβ. ResearchGate. August 2007. ](https://www.researchgate.net/publication/323384408_Design_and_Implementation_of_Amharic_Search_Engine)
- [Yitna Firdyiwek and Daniel Yaqob. βThe System for Ethiopic Representation in ASCIIβ. ResearchGate. Jan 1997. ](https://www.researchgate.net/publication/2682324_The_System_for_Ethiopic_Representation_in_ASCII)