https://github.com/sigpwned/uax29
Java implementation of UAX#29 text segmentation algorithm
https://github.com/sigpwned/uax29
java text-segmentation uax29 unicode
Last synced: 11 months ago
JSON representation
Java implementation of UAX#29 text segmentation algorithm
- Host: GitHub
- URL: https://github.com/sigpwned/uax29
- Owner: sigpwned
- License: apache-2.0
- Created: 2022-08-21T22:51:07.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2024-01-17T16:58:53.000Z (over 2 years ago)
- Last Synced: 2025-01-20T14:53:17.835Z (over 1 year ago)
- Topics: java, text-segmentation, uax29, unicode
- Language: Java
- Homepage:
- Size: 359 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# UAX29 [](https://github.com/sigpwned/uax29/actions/workflows/tests.yml) [](https://sonarcloud.io/summary/new_code?id=sigpwned_uax29) [](https://sonarcloud.io/summary/new_code?id=sigpwned_uax29) [](https://sonarcloud.io/summary/new_code?id=sigpwned_uax29) [](https://maven-badges.herokuapp.com/maven-central/com.sigpwned/uax29)
Java implementation of [UAX #29 text segmentation algorithm](https://unicode.org/reports/tr29/), plus token types for URLs, emoji, emails, hashtags, cashtags, and mentions.
# Usage
The tokenizer produces the following token types:
* `ALPHANUM` -- A sequence of alphabetic and numeric characters, e.g., hello, test123
* `NUM` -- A number, e.g., 123
* `SOUTHEAST_ASIAN` -- A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
* `IDEOGRAPHIC` -- A single CJKV ideographic character
* `HIRAGANA` -- A single hiragana character
* `KATAKANA` -- A sequence of katakana characters
* `HANGUL` -- A sequence of Hangul characters
* `URL` -- A URL, e.g., https://www.example.com/
* `EMAIL` -- An email address or mailto link, e.g., info@example.com
* `EMOJI` -- A sequence of Emoji characters, e.g., 🙂
* `HASHTAG` -- A social media hashtag, e.g., #hashtag
* `CASHTAG` -- A social media cashtag, e.g., $CASH
* `MENTION` -- A social media mention, e.g., @twitter
To process text into tokens, use code like the following:
try (UAX29URLEmailTokenizer tokenizer=new UAX29URLEmailTokenizer("example text")) {
for(Token token=tokenizer.nextToken();token!=null;token=tokenizer.nextToken(token)) {
// Process the token here
}
}