{"id":47219029,"url":"https://github.com/buda-base/lucene-sa","last_synced_at":"2026-03-13T17:08:20.843Z","repository":{"id":37421328,"uuid":"95012078","full_name":"buda-base/lucene-sa","owner":"buda-base","description":"Lucene analyzer for Sanskrit","archived":false,"fork":false,"pushed_at":"2024-09-10T17:00:01.000Z","size":16541,"stargazers_count":4,"open_issues_count":6,"forks_count":2,"subscribers_count":5,"default_branch":"master","last_synced_at":"2026-01-19T19:43:58.109Z","etag":null,"topics":["lucene","lucene-analyzer","sanskrit"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/buda-base.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-21T14:18:56.000Z","updated_at":"2024-10-14T12:54:30.000Z","dependencies_parsed_at":"2022-08-18T18:41:27.373Z","dependency_job_id":null,"html_url":"https://github.com/buda-base/lucene-sa","commit_stats":null,"previous_names":["buddhistdigitalresourcecenter/lucene-sa"],"tags_count":19,"template":false,"template_full_name":null,"purl":"pkg:github/buda-base/lucene-sa","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buda-base%2Flucene-sa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buda-base%2Flucene-sa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buda-base%2Flucene-sa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buda-base%2Flucene-sa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/buda-base","download_url":"https://codeload.github.com/buda-base/lucene-sa/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buda-base%2Flucene-sa/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30471140,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-13T11:00:43.441Z","status":"ssl_error","status_checked_at":"2026-03-13T11:00:23.173Z","response_time":60,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lucene","lucene-analyzer","sanskrit"],"created_at":"2026-03-13T17:08:20.119Z","updated_at":"2026-03-13T17:08:20.837Z","avatar_url":"https://github.com/buda-base.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lucene Analyzers for Sanskrit \n\nThis repository contains bricks to implement a full analyzer pipeline in Lucene:\n\n- filters to normalize and convert SLP1, Devanagari and IAST into SLP1\n- indexation in SLP1 or simplified IAST with no diacritics (for lenient search)\n- stopwords filter\n- a syllable-based tokenizer\n- a word tokenizer (that doesn't break compounds)\n\n## Installation through maven:\n\n```xml\n    \u003cdependency\u003e\n      \u003cgroupId\u003eio.bdrc.lucene\u003c/groupId\u003e\n      \u003cartifactId\u003elucene-sa\u003c/artifactId\u003e\n      \u003cversion\u003e1.1.1\u003c/version\u003e\n    \u003c/dependency\u003e\n```\n\n## Components\n\n### SanskritAnalyzer\n\n#### Constructors\n\n```\n    SanskritAnalyzer(String mode, String inputEncoding)\n```\n - `mode`: `space`(tokenize at spaces), `syl`(tokenize in syllables) or `word`(tokenize in words)\n - `inputEncoding`: `SLP`(SLP1 encoding), `deva`(devanagari script) or `roman`(IAST)\n \n\n```\n    SanskritAnalyzer(String mode, String inputEncoding, String stopFilename)\n    \n```\n - `stopFilename`: path to the file, empty string (default list) or `null` (no stopwords)\n\n```\n    SanskritAnalyzer(String mode, String inputEncoding, boolean mergePrepositions, boolean filterGeminates, boolean normalizeAnusvara)\n```\n - `mergePrepositions`: concatenates the token containing a preposition with the next one if true.\n - `filterGeminates`: normalize geminates (see [below](#geminatenormalizingfilter)) if `true`, else keep them as-is (default behavior)\n - `normalizeAnusvara`: normalize anusvara (see [below](#anusvaranormalizer)) if `true`, else keep them as-is (default behavior) \n \n```\n    SanskritAnalyzer(String mode, String inputEncoding, boolean mergePrepositions, boolean filterGeminates, String lenient)\n```\n - `lenient`: `index` or `query` (requires this information to select the correct filter pipeline) \n\nIn all configurations except when lenient is activated, the output tokens of the analyzers are always encoded in SLP1.\nLenient analyzers output a drastically simplified IAST (see below for details).\n\n#### Usecases\nThree usecases are given as examples of possible configurations\n\n##### 1. Regular search\nA text in IAST can be tokenized in syllables for indexing. The queries are in SLP and tokenized in words. The default stopwords list is applied.\n- Indexing:  `SanskritAnalyzer(\"syl\", \"roman\")`\n- Querying:  `SanskritAnalyzer(\"syl\", \"SLP\")`\n\n##### 2. Lenient search (syllables)\nA text in IAST is indexed in syllables and the queries are split into syllables in the same way. Geminates and anusvara are normalized. The lenient search is enabled by indicating either \"index\" or \"query\", thereby selecting the appropriate pipeline of filters.\n- Indexing:  `SanskritAnalyzer(\"syl\", \"roman\", false, true, \"index\")` or simpler: `SanskritAnalyzer.IndexLenientSyl()`\n- Querying:  `SanskritAnalyzer(\"syl\", \"roman\", false, false, \"query\")` or simpler: `SanskritAnalyzer.QueryLenientSyl()`\n\n### SkrtWordTokenizer (deprecated)\n\nThis tokenizer produces words through a Maximal Matching algorithm. It builds on top of [this Trie implementation](https://github.com/BuddhistDigitalResourceCenter/stemmer).\n\n### SkrtSyllableTokenizer\n\nProduces syllable tokens using the same syllabation rules found in Peter Scharf's [script](http://www.sanskritlibrary.org/Sanskrit/SanskritTransliterate/syllabify.html). \n\n### Stopword Filter\n\nThe [list of stopwords](src/main/resources/skrt-stopwords.txt) is [this list](https://gist.github.com/Akhilesh28/b012159a10a642ed5c34e551db76f236) encoded in SLP. The list must be formatted in the following way:\n\n - in SLP encoding\n - 1 word per line\n - empty lines (with and without comments), spaces and tabs are allowed\n - comments start with `#`\n - lines can end with a comment\n\n### GeminateNormalizingFilter\n\nGeminates of consonants around a `r` or `y` is commonly found in old documents. These can be normalized in order to be found more easily.\n\nThis filter applies the following simplification rules:\n\n```  \n    CCr   →  Cr \n    rCC   →  rC\n    hCC   →  rC\n    ṛCC   →  ṛC\n    CCy   →  Cy\n```\n\n`C` is any consonant in the following list: [k g c j ṭ ḍ ṇ t d n p b m y v l s ś ṣ]\nThe second consonant can be the aspirated counterpart(ex: `rtth`), in which case the consonant that is kept is the aspirated one.\nThus, \"arttha\" is normalized to \"artha\",  \"dharmma\" to \"dharma\".\n\n### AnusvaraNormalizer\n\nAnusvara get normalized to:\n- `n` before dentals\n- `ṇ` before retroflex\n- `ñ` before palatals\n- `ṅ` before velars\n- `m` otherwise\n \n### Roman2SlpFilter\n\nTranscodes the romanized sanskrit input in SLP.\n\nFollowing the naming convention used by Peter Scharf, we use \"Roman\" instead of \"IAST\" to show that, on top of supporting the full IAST character set, we support the extra distinctions within devanagari found in ISO 15919\nIn this filter, a list of non-Sanskrit and non-Devanagari characters are deleted.\n\nSee [here](src/main/java/io/bdrc/lucene/sa/Roman2SlpFilter.java) for the details.\n\n### Slp2RomanFilter\n\nTranscodes the SLP input in IAST.\n\nOutputs fully composed forms(single Unicode codepoints) instead of relying on extra codepoints for diacritics.\n\n### Deva2SlpFilter\n\nTranscodes the devanagari sanskrit input in SLP.\n\nThis filter also normalizes non-Sanskrit Devanagari characters. Ex: क़ =\u003e क\n\n### Lenient Search Mode\n`SanskritAnalyzer` in lenient mode outputs tokens encoded in simplified sanskrit instead of SLP.\n \nThis following transformations are applied to the IAST transcription:\n - all long vowels become short\n - all aspirated consonants become unaspirated\n - all remaining diacritics are removed\n - all geminates (or consonnant + aspirate) become simple consonnants\n\nKeeping in the same spirit, these informal conventions are modified: \n - `sh` (for `ś` or `ṣ`) becomes `s`\n - `v` becomes `b`\n - anusvaras are transformed into their equivalent\n\nIn terms of implementation, the input normalization happening in `Roman2SlpFilter` and `Deva2SlpFilter` is leveraged by always applying them first, then transforming SLP into *lenient Sanskrit*.  \nRelying on `Roman2SlpFilter` has the additional benefit of correctly dealing with capital letters by lower-casing the input.\n\n#### LenientCharFilter\nUsed at query time.\n\nExpects SLP as input.\nApplies the modifications listed above.\n\n#### LenientTokenFilter\nUsed at index time.\n\nExpects IAST as input. (`Slp2RomanFilter` can be used to achieve that)\nApplies the modifications listed above. \n\n## Building from source\n\n### Build the lexical resources for the Trie:\n\nThese steps need only be done once for a fresh clone of the repo; or simply run the `initialize.sh` script\n\n - make sure the submodules are initialized (`git submodule init`, then `git submodule update`), first from the root of the repo, then from `resources/sanskrit-stemming-data`\n - build lexical resources for the main trie: `cd resources/sanskrit-stemming-data/sandhify/ \u0026\u0026 python3 sandhifier.py`\n - build sandhi test tries: `cd resources/sanskrit-stemming-data/sandhify/ \u0026\u0026 python3 generate_test_tries.py`\n     if you encounter a `ModuleNotFoundError: No module named 'click'` you may need to `python3 -m pip install click`\n - update other test tries with lexical resources: `cd src/test/resources/tries \u0026\u0026 python3 update_tries.py`\n - compile the main trie: `mvn exec:java -Dexec.mainClass=\"io.bdrc.lucene.sa.BuildCompiledTrie\"` \n       (takes about 45mn on an average laptop). This step generally need only be run once \n       unless there are changes to the lexical resources for the main trie.\n       If this step is run initially then it is sufficient to use the second base command \n       line form below.\n\nThe base command line to build a jar is either:\n\n```\nmvn clean compile exec:java package\n```\n\nwhich will build the main trie if it has not been built as indicated above, or:\n\n```\nmvn clean compile package\n```\n\nif the main trie has already been built.\n\nThe following options modify the package step:\n\n- `-DincludeDeps=true` includes `io.bdrc.lucene:stemmer` in the produced jar file\n- `-DperformRelease=true` signs the jar file with gpg\n\nbe aware that only one analyzer jar should have the `io.bdrc.lucene:stemmer` included when more \nthan one of the BDRC analyzers are used together.\n\n## Aknowledgements\n\n - https://gist.github.com/Akhilesh28/b012159a10a642ed5c34e551db76f236\n - http://sanskritlibrary.org/software/transcodeFile.zip (more specifically roman_slp1.xml)\n - https://en.wikipedia.org/wiki/ISO_15919#Comparison_with_UNRSGN_and_IAST\n - http://unicode.org/charts/PDF/U0900.pdf\n\n## License\n\nThe code is Copyright 2017-2020 Buddhist Digital Resource Center, and is provided under [Apache License 2.0](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbuda-base%2Flucene-sa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbuda-base%2Flucene-sa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbuda-base%2Flucene-sa/lists"}