{"id":19401019,"url":"https://github.com/google-research/nisaba","last_synced_at":"2025-04-24T07:30:26.754Z","repository":{"id":37078210,"uuid":"318338748","full_name":"google-research/nisaba","owner":"google-research","description":"Finite-state script normalization and processing utilities","archived":false,"fork":false,"pushed_at":"2024-10-16T16:54:49.000Z","size":2184,"stargazers_count":38,"open_issues_count":19,"forks_count":4,"subscribers_count":7,"default_branch":"main","last_synced_at":"2024-10-18T23:22:01.723Z","etag":null,"topics":["bengali","brahmic-scripts","devanagari","finite-state","finite-state-automata","finite-state-transducers","grammars","gujarati","gurmukhi","indic-languages","kannada","malayalam","oriya","pynini","sinhala","tamil","telugu","unicode","unicode-normalization","writing-systems"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-03T22:40:26.000Z","updated_at":"2024-10-13T01:24:20.000Z","dependencies_parsed_at":"2024-01-25T19:30:54.109Z","dependency_job_id":"df6c689e-3490-43b4-99cc-da3a13c8de6a","html_url":"https://github.com/google-research/nisaba","commit_stats":{"total_commits":402,"total_committers":16,"mean_commits":25.125,"dds":0.5796019900497513,"last_synced_commit":"529381726a74f45e580581d1d13ea61670211139"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fnisaba","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fnisaba/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fnisaba/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fnisaba/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-research","download_url":"https://codeload.github.com/google-research/nisaba/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250582770,"owners_count":21453911,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bengali","brahmic-scripts","devanagari","finite-state","finite-state-automata","finite-state-transducers","grammars","gujarati","gurmukhi","indic-languages","kannada","malayalam","oriya","pynini","sinhala","tamil","telugu","unicode","unicode-normalization","writing-systems"],"created_at":"2024-11-10T11:16:43.402Z","updated_at":"2025-04-24T07:30:26.198Z","avatar_url":"https://github.com/google-research.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![GitHub license](https://img.shields.io/badge/license-Apache2-blue.svg)](https://github.com/google-research/nisaba/blob/main/LICENSE)\n[![Paper](https://img.shields.io/badge/paper-EACL%202021-blue.svg)](https://www.aclweb.org/anthology/2021.eacl-demos.3/)\n[![Build Tests (Linux)](https://github.com/google-research/nisaba/workflows/linux/badge.svg)](https://github.com/google-research/nisaba/actions?query=workflow%3A%22linux%22)\n[![Build Tests (macOS)](https://github.com/google-research/nisaba/workflows/macos/badge.svg)](https://github.com/google-research/nisaba/actions?query=workflow%3A%22macos%22)\n\n# Nisaba\n\nNamed after Nisaba — the Sumerian goddess of writing and scribe of the gods\n(𒀭𒉀).\n\n![nisaba](etc/nisaba.png \"Source: The Pergamon Museum, Berlin, Germany\")\n\n## About\n\nCollection of finite-state transducer-based (FST) tools for visual\nnormalization, well-formedness, transliteration and NFC normalization of various\nscripts from South Asia and beyond. Nisaba provides these APIs in Python and\nC++. Currently supported script families:\n\n*   Brahmic scripts ([documentation](nisaba/scripts/brahmic/README.md)).\n*   Alphabets and *abjads* ([documentation](nisaba/scripts/abjad_alphabet/README.md)).\n*   Natural transliteration for Brahmic scripts ([documentation](nisaba/scripts/natural_translit/README.md)).\n\nNisaba primarily relies on [OpenGrm Pynini](http://pynini.opengrm.org/), which\nis a Python toolkit for finite-state grammar development. OpenGrm Pynini, like\nits C++ counterpart [Thrax](http://thrax.opengrm.org/), compiles grammars\nexpressed as strings, regular expressions, and context-dependent rewrite rules\ninto\n[weighted finite-state transducers](http://www.cs.nyu.edu/~mohri/pub/fla.pdf)\n(WFSTs). It uses the [OpenFst](http://openfst.org) library and its Python\nextension to create, access and manipulate compiled grammars.\n\n## Building and testing\n\nThis library will build on any system that supports\n[Bazel](https://bazel.build/) versatile multiplatform build and test tool. The\nfollowing examples assume [Debian](https://www.debian.org/) Linux distribution,\nbut should also apply with minor modifications to other Linux and non-Linux\nplatforms that Bazel supports.\n\n### Prerequisites\n\n#### Bazel or Bazelisk\n\nYour operating system may permit an easy installation of pre-built Bazel\npackage, like the Debian-specific example below shows:\n\n```shell\nsudo apt-get install bazel\n```\n\nAlternatively, e.g., on macOS, a user-friendly Bazel launcher called\n[Bazelisk](https://github.com/bazelbuild/bazelisk) can be installed:\n\n```shell\nBAZEL=bazelisk-darwin-amd64\ncurl -LO \"https://github.com/bazelbuild/bazelisk/releases/latest/download/$BAZEL\"\nchmod +x $BAZEL\n```\n\nWhen using Bazelisk, simply replace the command `bazel` in the examples below\nwith `$BAZEL`.\n\n#### C++ and Python\n\nNisaba requires a modern C++ compiler that supports C++17 standard (e.g., the\n[GCC 10](https://gcc.gnu.org/gcc-10/) release series) and Python3. Assuming\nthese are already present, the required dependencies are the Python3 development\nheaders and the Python3 package installer [pip](https://pip.pypa.io/en/stable/).\n\n```shell\nsudo apt-get install python3-dev\nsudo apt-get install python3-pip\n```\n\nExample Debian configuration: gcc (10.2.0), bazel (3.7.2), python3 (3.8.6) and\npip (20.1.1).\n\n### Getting and building the code\n\n1.  Locally, make sure you are in some sort of a virtual environment (`venv`,\n    `virtualenv`, `conda`, etc).\n\n2.  Clone the repository (please note, this example does not clone the fork of\n    the main repository, but a forked repo can be used as well):\n\n    ```shell\n    git clone https://github.com/google-research/nisaba.git\n    cd nisaba\n    ```\n\n3.  Build all the targets using Bazel (this example uses optimized mode):\n\n    ```shell\n    bazel build -c opt ...\n    ```\n\n    The above command will build Nisaba artifacts using all the remote\n    repository dependencies, including OpenFst, Pynin and Thrax, that are\n    specified in the Bazel [WORKSPACE](WORKSPACE.bazel) file. The resulting\n    artifacts are located in `bazel-bin/nisaba` directory.\n\n    If the above command fails due to missing Python prerequisites, please\n    install them using `pip` Python package manager and try again:\n\n    ```shell\n    pip3 install --upgrade pip\n    pip3 install -r requirements.txt\n    ```\n\n4.  Make sure the small unit tests are passing:\n\n    ```shell\n    bazel test -c opt --test_size_filters=-large,-enormous ...\n    ```\n\n    The above command should produce something along the following lines:\n\n    ```shell\n      ...\n      //nisaba/scripts/brahmic:cc_test                                                 PASSED in 0.4s\n      //nisaba/scripts/brahmic:far_cc_test                                             PASSED in 0.2s\n      //nisaba/scripts/brahmic:far_test                                                PASSED in 2.0s\n      //nisaba/scripts/brahmic:fixed_test                                              PASSED in 0.2s\n      //nisaba/scripts/brahmic:fst_properties_test                                     PASSED in 2.3s\n      //nisaba/scripts/brahmic:iso_test                                                PASSED in 0.3s\n      //nisaba/scripts/brahmic:nfc_test                                                PASSED in 0.2s\n      //nisaba/scripts/brahmic:nfc_utf8_test                                           PASSED in 0.2s\n      //nisaba/scripts/brahmic:py_test                                                 PASSED in 2.1s\n      //nisaba/scripts/brahmic:util_test                                               PASSED in 1.9s\n      //nisaba/scripts/brahmic:visual_norm_test                                        PASSED in 0.3s\n      //nisaba/scripts/brahmic:visual_norm_utf8_test                                   PASSED in 0.3s\n      //nisaba/scripts/brahmic:wellformed_test                                         PASSED in 0.2s\n      //nisaba/scripts/brahmic:wellformed_utf8_test                                    PASSED in 0.2s\n      ...\n    ```\n\n    You may also want to run *all* the tests, but depending on your host\n    configuration these may take a long time:\n\n    ```shell\n    bazel test -c opt ...\n    ```\n\n## Contributions\n\nNOTE: We don't accept pull requests (PRs) at the moment.\n\n## License\n\nNisaba is licensed under the terms of the Apache license. See [LICENSE](LICENSE)\nfor more information.\n\n## Citation\n\nIf you use this software in a publication, please cite the accompanying\n[paper](https://www.aclweb.org/anthology/2021.eacl-demos.3.pdf) from\n[EACL 2021](https://2021.eacl.org/):\n\n```bibtex\n@inproceedings{nisaba-eacl2021,\n    title = {Finite-state script normalization and processing utilities: The {N}isaba {B}rahmic library},\n    author = {Cibu Johny and Lawrence Wolf-Sonkin and Alexander Gutkin and Brian Roark},\n    booktitle = {16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021): System Demonstrations},\n    address = {[Online], Kyiv, Ukraine},\n    month = apr,\n    year = {2021},\n    pages = {14--23},\n    publisher = {Association for Computational Linguistics},\n    doi = {10.18653/v1/2021.eacl-demos.3},\n    url = {https://www.aclweb.org/anthology/2021.eacl-demos.3},\n}\n```\n\n## Mandatory disclaimer\n\nThis is not an official Google product.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fnisaba","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-research%2Fnisaba","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fnisaba/lists"}