Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/anoopkunchukuttan/indic_nlp_library
Resources and tools for Indian language Natural Language Processing
https://github.com/anoopkunchukuttan/indic_nlp_library
indian-languages natural-language-processing python
Last synced: 2 months ago
JSON representation
Resources and tools for Indian language Natural Language Processing
- Host: GitHub
- URL: https://github.com/anoopkunchukuttan/indic_nlp_library
- Owner: anoopkunchukuttan
- License: mit
- Created: 2014-10-15T01:56:20.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2024-04-18T07:09:23.000Z (9 months ago)
- Last Synced: 2024-04-25T01:02:27.766Z (9 months ago)
- Topics: indian-languages, natural-language-processing, python
- Language: Python
- Homepage: http://anoopkunchukuttan.github.io/indic_nlp_library/
- Size: 9.33 MB
- Stars: 530
- Watchers: 33
- Forks: 156
- Open Issues: 27
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Indic NLP Library
The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text.
The library provides the following functionalities:
- Text Normalization
- Script Information
- Word Tokenization and Detokenization
- Sentence Splitting
- Word Segmentation
- Syllabification
- Script Conversion
- Romanization
- Indicization**Note**: _Shatanuvadak_ translation and _BrahmiNet_ transliteration APIs are no longer supported. You can use newer [IndicTrans](https://github.com/AI4Bharat/indicTrans) translation and [IndicXlit](https://github.com/AI4Bharat/IndicXlit) transliteration models we developed at [AI4Bharat](https://ai4bharat.iitm.ac.in). In fact, you can find many state-of-the-art datasets and models on the AI4Bharat homepage.
The data resources required by the Indic NLP Library are hosted in a different repository. These resources are required for some modules. You can download from the [Indic NLP Resources](https://github.com/anoopkunchukuttan/indic_nlp_resources) project.
**If you are interested in Indian language NLP resources, you should check the [Indic NLP Catalog](https://github.com/indicnlpweb/indicnlp_catalog) for pointers.**
## Pre-requisites
- Python 3.x
- (For Python 2.x version check the tag `PYTHON_2.7_FINAL_JAN_2019`. Not actively supporting Python 2.x anymore, but will try to maintain as much compatibility as possible)
- [Indic NLP Resources](https://github.com/anoopkunchukuttan/indic_nlp_resources)
- [Urduhack](https://github.com/urduhack/urduhack): Needed only if Urdu normalization is required. It has other dependencies like Tensorflow.
- Other dependencies are listed in setup.py## Configuration
- Installation from pip:
`pip install indic-nlp-library`
- If you want to use the project from the github repo, add the project to the Python Path:
- Clone this repository
- Install dependencies: `pip install -r requirements.txt`
- Run: `export PYTHONPATH=$PYTHONPATH:`- In either case, export the path to the _Indic NLP Resources_ directory
Run: `export INDIC_RESOURCES_PATH=`
## Usage
You can use the Python API to access all the features of the library. Many of the most common operations are also accessible via a unified commandline API.
### Getting Started
Check [this IPython Notebook](http://nbviewer.jupyter.org/url/anoopkunchukuttan.github.io/indic_nlp_library/doc/indic_nlp_examples.ipynb) for examples to use the Python API.
- You can find the Python 2.x Notebook [here](http://nbviewer.jupyter.org/url/anoopkunchukuttan.github.io/indic_nlp_library/doc/indic_nlp_examples_2_7.ipynb)### Documentation
You can find detailed documentation [HERE](https://indic-nlp-library.readthedocs.io/en/latest)
This documents the Python API as well as the commandline reference.
## Citing
If you use this library, please include the following citation:
```
@misc{kunchukuttan2020indicnlp,
author = "Anoop Kunchukuttan",
title = "{The IndicNLP Library}",
year = "2020",
howpublished={\url{https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf}}
}
```
You can find the document [HERE](docs/indicnlp.pdf)## Website
`http://anoopkunchukuttan.github.io/indic_nlp_library`
## Author
Anoop Kunchukuttan ([[email protected]]([email protected]))## Companies, Organizations, Projects using IndicNLP Library
- [AI4Bharat-IndicNLPSuite](https://indicnlp.ai4bharat.org)
- [The Classical Language Toolkit](http://cltk.org)
- [Microsoft NLP Recipes](https://github.com/microsoft/nlp-recipes)
- [Facebook M2M-100](https://github.com/pytorch/fairseq/tree/master/examples/m2m_100)## Revision Log
0.81 : 26 May 2021
- Bug fix in version number extraction0.80 : 24 May 2021
- Improved sentence splitting
- Bug fixes
- Support for Urdu Normalizer0.71 : 03 Sep 2020
- Improved documentation
- Bug fixes0.7 : 02 Apr 2020:
- Unified commandline
- Improved documentation
- Added setup.py0.6 : 16 Dec 2019:
- New romanizer and indicizer
- Script Unifiers
- Improved script normalizers
- Added contrib directory for sample uses
- changed to MIT license0.5 : 03 Jun 2019:
- Improved word tokenizer to handle dates and numbers.
- Added sentence splitter that can handle common prefixes/honorofics and uses some heuristics.
- Added detokenizer
- Added acronym transliterator that can convert English acronyms to Brahmi-derived scripts0.4 : 28 Jan 2019: Ported to Python 3, and lots of feature additions since last release; primarily around script information, script similarity and syllabification.
0.3 : 21 Oct 2014: Supports morph-analysis between Indian languages
0.2 : 13 Jun 2014: Supports transliteration between Indian languages and tokenization of Indian languages
0.1 : 12 Mar 2014: Initial version. Supports text normalization.
## LICENSE
Indic NLP Library is released under the MIT license