https://github.com/amake/script_detector_2
A simple utility for determining whether a string is Japanese, Simplified Chinese, Traditional Chinese, or Korean
https://github.com/amake/script_detector_2
cjk detector ruby ruby-gem script
Last synced: 11 months ago
JSON representation
A simple utility for determining whether a string is Japanese, Simplified Chinese, Traditional Chinese, or Korean
- Host: GitHub
- URL: https://github.com/amake/script_detector_2
- Owner: amake
- License: mit
- Created: 2021-08-22T12:47:22.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2023-02-27T12:02:17.000Z (almost 3 years ago)
- Last Synced: 2025-02-23T16:52:28.816Z (12 months ago)
- Topics: cjk, detector, ruby, ruby-gem, script
- Language: Ruby
- Homepage: https://rubygems.org/gems/script_detector_2
- Size: 137 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# ScriptDetector2
A simple utility for determining whether a string is Japanese, Simplified
Chinese, Traditional Chinese, or Korean. It is intended to be a more accurate
and more performant alternative to the [script_detector
gem](https://rubygems.org/gems/script_detector).
Unlike the original script_detector, this gem:
- Is optimized to reduce temporary garbage in favor of some constant memory
usage
- Uses the
[kUnihanCore2020](https://www.unicode.org/reports/tr38/#kUnihanCore2020)
property of the Unicode Unihan database to determine which characters belong
to which script (Unicode 14)
([details](http://www.unicode.org/L2/L2019/19388-unihan-core-2020.pdf))
- Uses [ISO 15924 script names](https://en.wikipedia.org/wiki/ISO_15924) in
symbol form as return values (instead of English strings)
## Installation
Add this line to your application's Gemfile:
```ruby
gem 'script_detector_2'
```
And then execute:
$ bundle install
Or install it yourself as:
$ gem install script_detector_2
## Usage
The main detection methods are:
- `ScriptDetector2.japanese?`
- `ScriptDetector2.chinese?`
- `ScriptDetector2.simplified_chinese?`
- `ScriptDetector2.traditional_chinese?`
- `ScriptDetector2.identify_script`
- `ScriptDetector2.identify_scripts`
Regexp patterns are used to identify the script to which Han characters belong.
These can be used directly as well:
- `ScriptDetector2::JAPANESE_PATTERN`: matches all Han characters in the
kUnihanCore2020 set marked as Japanese (J)
- `ScriptDetector2::SIMPLIFIED_CHINESE_PATTERN`: matches all Han characters in
the kUnihanCore2020 set marked as PRC (G)
- `ScriptDetector2::TRADITIONAL_CHINESE_PATTERN`: matches all Han characters in
the kUnihanCore2020 set marked as Hong Kong (H), Macau (M), or ROC (T)
- `ScriptDetector2::KOREAN_PATTERN`: matches all Han characters in the
kUnihanCore2020 set marked as ROK (K) or DPRK (P)
Each of the above patterns matches an entire string containing only Han
characters of the indicated script, i.e.
```ruby
ScriptDetector2::JAPANESE_PATTERN.match?('日本語') # => true
ScriptDetector2::JAPANESE_PATTERN.match?('你好') # => false
ScriptDetector2::JAPANESE_PATTERN.match?('Hello 日本語') # => false
```
To recreate the script_detector gem's extension of the String class, use the
supplied refinement like so:
```ruby
using ScriptDetector2::StringUtil
```
Then you can do:
```ruby
'こんにちは、世界!'.japanese? # => true
```
## Development
After checking out the repo, run `bin/setup` to install dependencies. Then, run
`rake test` to run the tests. You can also run `bin/console` for an interactive
prompt that will allow you to experiment.
To install this gem onto your local machine, run `bundle exec rake install`. To
release a new version, update the version number in `version.rb`, and then run
`bundle exec rake release`, which will create a git tag for the version, push
git commits and the created tag, and push the `.gem` file to
[rubygems.org](https://rubygems.org).
## Contributing
Bug reports and pull requests are welcome on GitHub at
https://github.com/amake/script_detector_2.
## License
The gem is available as open source under the terms of the [MIT
License](https://opensource.org/licenses/MIT).