https://github.com/buda-base/lucene-bo
Lucene analyzer for Tibetan
https://github.com/buda-base/lucene-bo
lucene-analyzer tibetan
Last synced: 5 months ago
JSON representation
Lucene analyzer for Tibetan
- Host: GitHub
- URL: https://github.com/buda-base/lucene-bo
- Owner: buda-base
- License: apache-2.0
- Created: 2017-04-10T15:31:10.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2025-10-23T13:24:22.000Z (8 months ago)
- Last Synced: 2025-10-23T15:28:07.033Z (8 months ago)
- Topics: lucene-analyzer, tibetan
- Language: Java
- Homepage:
- Size: 356 KB
- Stars: 12
- Watchers: 7
- Forks: 3
- Open Issues: 21
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# Lucene Analyzers for Tibetan
A collection of Lucene components (analyzers, tokenizers, and filters) for processing Tibetan text.
---
## Features
- **Encoding Conversion:** Convert EWTS, DTS, or ALALC encodings to Tibetan Unicode.
- **Unicode Normalization:** Normalize Tibetan Unicode characters for consistent search and analysis.
- **Affixed Particle Removal:** Remove obvious affixed particles at the end of syllables (e.g., `..འི`).
- **Syllable-Based Tokenization:** Tokenize text into syllables, with fallback to stack tokenization for non-standard syllables (useful for Sanskrit).
- **Phonetic Analyzers:** Search using phonetic representations, supporting both Tibetan and Latin-script queries.
---
## Installation
Add the following dependency to your Maven `pom.xml`:
```xml
io.bdrc.lucene
lucene-bo
2.2.0
```
---
## Components
### TibetanAnalyzer
- Tokenizes input using `TibSyllableTokenizer`.
- Applies `TibAffixedFilter` and `StopFilter` with a predefined stop word list.
### Old Tibetan Normalization
Implements normalization patterns from Faggionato & Garrett (see [reference](https://ep.liu.se/konferensartikel.aspx?series=&issue=168&Article_No=3)), including:
- Gigu normalization
- Dadrag removal
- Medial འ removal in `TibAffixedFilter`
Patterns are based on [Normalize_Old_Tibetan.txt](https://github.com/tibetan-nlp/tibcg3/blob/master/Normalize_Old_Tibetan.txt).
### Lenient Character Normalization
- Normalizes retroflexes (e.g., ཊ → ཏ)
- Normalizes graphical variants (e.g., ཪ → ར)
- Removes achung
- Normalizes gigus
### Tokenizers & Filters
- **TibSyllableTokenizer:** Produces syllable tokens (without tshek). Falls back to stack tokenization for non-standard syllables.
- **TibAffixedFilter:** Removes non-ambiguous affixed particles (e.g., འི, འོ, འིའོ, འམ, འང, འིས), preserving འ when necessary.
- **PaBaFilter:** Normalizes བ and བོ to པ and པོ. Should be used after `TibAffixedFilter`.
---
## Phonetic Analyzers
Tibetan has high spelling opacity, leading to many homophones. This package provides:
- **Index Analyzer:** Converts Tibetan Unicode to an internal phonological notation.
- **Query Analyzer:** Converts Latin-script phonetic queries to the same internal notation.
**Example:**
| Input | Analyzer | Output tokens |
|----------------------|------------------|--------------------|
| སྒམ་པོ་པ། | index | gam, po, pa |
| རྒམ་པོ་པ། | index | gam, po, pa |
| Gampopa | query | gam, po, pa |
**Phonetic System:**
- No voicing distinction (`ka` = `ga`)
- No aspiration distinction (`ka` = `kha`)
- No tone distinction
- Oriented towards Standard Tibetan pronunciation
**Caveats:**
- Some syllables are ambiguous (e.g., བར་ can be `bar` or `war`)
- Pronunciation exceptions are mapped to normalized forms (e.g., "Khandro" → "khadro")
---
## Building
To build from source:
1. Initialize submodules:
```bash
git submodule init
git submodule update
```
2. Build the JAR:
```bash
mvn clean compile exec:java package
```
**Build Options:**
- `-DincludeDeps=true`: Includes `io.bdrc.lucene:stemmer` and `io.bdrc.ewtsconverter:ewts-converter` in the JAR.
- `-DperformRelease=true`: Signs the JAR with GPG.
---
## Acknowledgements
Based on [lucene-analyzers](https://github.com/tibetan-nlp/lucene-analyzers).
---
## License
Copyright 2017 Buddhist Digital Resource Center
Licensed under the [Apache License 2.0](LICENSE).