Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/janlelis/unicode-emoji
Up-to-date Emoji Regex in Ruby ๐ฅ
https://github.com/janlelis/unicode-emoji
emoji emoji-unicode hacktoberfest regex ruby sequence unicode unicode-data
Last synced: 3 months ago
JSON representation
Up-to-date Emoji Regex in Ruby ๐ฅ
- Host: GitHub
- URL: https://github.com/janlelis/unicode-emoji
- Owner: janlelis
- License: mit
- Created: 2017-04-08T11:54:20.000Z (almost 8 years ago)
- Default Branch: main
- Last Pushed: 2023-10-01T18:28:33.000Z (over 1 year ago)
- Last Synced: 2024-04-14T05:58:21.123Z (10 months ago)
- Topics: emoji, emoji-unicode, hacktoberfest, regex, ruby, sequence, unicode, unicode-data
- Language: Ruby
- Homepage: https://character.construction
- Size: 644 KB
- Stars: 142
- Watchers: 6
- Forks: 14
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: MIT-LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Unicode::Emoji [![[version]](https://badge.fury.io/rb/unicode-emoji.svg)](https://badge.fury.io/rb/unicode-emoji) [![[ci]](https://github.com/janlelis/unicode-emoji/workflows/Test/badge.svg)](https://github.com/janlelis/unicode-emoji/actions?query=workflow%3ATest)
Provides various sophisticated regular expressions to work with Emoji in strings,
incorporating the latest Unicode / Emoji standards.Additional features:
- A categorized list of Emoji (RGI: Recommended for General Interchange)
- Retrieve Emoji properties info about specific codepoints (Emoji_Modifier, Emoji_Presentation, etc.)Emoji version: **16.0** (September 2024)
CLDR version (used for sub-region flags): **46** (October 2024)
## Gemfile
```ruby
gem "unicode-emoji"
```## Usage โ Regex Matching
The gem includes multiple Emoji regexes, which are compiled out of various Emoji Unicode data sources.
```ruby
require "unicode/emoji"string = "String which contains all types of Emoji sequences:
- Basic Emoji: ๐ด
- Textual Emoji with Emoji variation (VS16): โถ๏ธ
- Emoji with skin tone modifier: ๐๐ฝ
- Region flag: ๐ต๐น
- Sub-Region flag: ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ
- Keycap sequence: 2๏ธโฃ
- Skin tone modifier: ๐ป
- Sequence using ZWJ (zero width joiner): ๐คพ๐ฝโโ๏ธ
"string.scan(Unicode::Emoji::REGEX) # => ["๐ด", "โถ๏ธ", "๐๐ฝ", "๐ต๐น", "๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ", "2๏ธโฃ", "๐ป", "๐คพ๐ฝโโ๏ธ"]
```Depending on your exact usecase, you can choose between multiple levels of Emoji detection:
### Main Regexes
Regex | Description | Example Matches | Example Non-Matches
------------------------------|-------------|-----------------|--------------------
`Unicode::Emoji::REGEX` | **Use this one if unsure!** Matches (non-textual) Basic Emoji and all kinds of *recommended* Emoji sequences (RGI/FQE) | `๐ด`, `โถ๏ธ`, `๐๐ฝ`, `๐ต๐น`, `2๏ธโฃ`, `๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐คพ๐ฝโโ๏ธ`, `๐ป` | `๐คพ๐ฝโโ`, `๐โโ๏ธ`, `๐ด๏ธ`, `โถ`, `๐ต๐ต`, `๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐ค โ๐คข`, `1`, `1โฃ`
`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) Basic Emoji and all kinds of *valid* Emoji sequences | `๐ด`, `โถ๏ธ`, `๐๐ฝ`, `๐ต๐น`, `2๏ธโฃ`, `๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐คพ๐ฝโโ๏ธ`, `๐คพ๐ฝโโ` ,`๐โโ๏ธ`, `๐ค โ๐คข`, `๐ป` | `๐ด๏ธ`, `โถ`, `๐ต๐ต`, `1`, `1โฃ`
`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) Basic Emoji and all kinds of *well-formed* Emoji sequences | `๐ด`, `โถ๏ธ`, `๐๐ฝ`, `๐ต๐น`, `2๏ธโฃ`, `๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐คพ๐ฝโโ๏ธ`, `๐คพ๐ฝโโ`,`๐โโ๏ธ` , `๐ค โ๐คข`, `๐ต๐ต`, `๐ป` | `๐ด๏ธ`, `โถ`, `1`, `1โฃ`
`Unicode::Emoji::REGEX_POSSIBLE` | Matches all singleton Emoji, all kinds of Emoji sequences, and even non-Emoji singleton components like digits. Only exception: Unqualified keycap sequences are not matched | `๐ด`, `โถ๏ธ`, `๐๐ฝ`, `๐ต๐น`, `2๏ธโฃ`, `๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐คพ๐ฝโโ๏ธ`, `๐คพ๐ฝโโ`, `๐โโ๏ธ`, `๐ค โ๐คข`, `๐ต๐ต`, `๐ด๏ธ`, `โถ`, `๐ป`, `1` | `1โฃ`#### Include Text Emoji
By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in `REGEX_POSSIBLE`). However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix:
Regex | Description | Example Matches | Example Non-Matches
------------------------------|-------------|-----------------|--------------------
`Unicode::Emoji::REGEX_INCLUDE_TEXT` | `REGEX` + `REGEX_TEXT` | `๐ด`, `โถ๏ธ`, `๐๐ฝ`, `๐ต๐น`, `2๏ธโฃ`, `๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐คพ๐ฝโโ๏ธ`, `๐ด๏ธ`, `โถ`, `1โฃ` , `๐ป`| `๐คพ๐ฝโโ`, `๐โโ๏ธ`, `๐ต๐ต`, `๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐ค โ๐คข`, `1`
`Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT` | `REGEX_VALID` + `REGEX_TEXT` | `๐ด`, `โถ๏ธ`, `๐๐ฝ`, `๐ต๐น`, `2๏ธโฃ`, `๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐คพ๐ฝโโ๏ธ`, `๐คพ๐ฝโโ`, `๐โโ๏ธ`, `๐ค โ๐คข`, `๐ด๏ธ`, `โถ`, `1โฃ` , `๐ป` | `๐ต๐ต`, `1`
`Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT` | `REGEX_WELL_FORMED` + `REGEX_TEXT` | `๐ด`, `โถ๏ธ`, `๐๐ฝ`, `๐ต๐น`, `2๏ธโฃ`, `๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐คพ๐ฝโโ๏ธ`, `๐คพ๐ฝโโ`, `๐โโ๏ธ`, `๐ค โ๐คข`, `๐ต๐ต`, `๐ด๏ธ`, `โถ`, `1โฃ` , `๐ป` | `1`#### Minimally-qualified and Unqualified Sequences
Regex | Description | Example Matches | Example Non-Matches
------------------------------|-------------|-----------------|--------------------
`Unicode::Emoji::REGEX_INCLUDE_MQE` | Like `REGEX`, but additionally includes Emoji with missing Emoji Presentation Variation Selectors, where the first partial Emoji has all required Variation Selectors | `๐ด`, `โถ๏ธ`, `๐๐ฝ`, `๐ต๐น`, `2๏ธโฃ`, `๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐คพ๐ฝโโ๏ธ`, `๐คพ๐ฝโโ`, `๐ป` | `๐โโ๏ธ`, `๐ด๏ธ`, `โถ`, `๐ต๐ต`, `๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐ค โ๐คข`, `1`, `1โฃ`
`Unicode::Emoji::REGEX_INCLUDE_MQE_UQE` | Like `REGEX`, but additionally includes Emoji with missing Emoji Presentation Variation Selectors | `๐ด`, `โถ๏ธ`, `๐๐ฝ`, `๐ต๐น`, `2๏ธโฃ`, `๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐คพ๐ฝโโ๏ธ`, `๐คพ๐ฝโโ`, `๐โโ๏ธ`, `๐ป` | `๐ด๏ธ`, `โถ`, `๐ต๐ต`, `๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐ค โ๐คข`, `1`, `1โฃ`[List of MQE and UQE Emoji sequences](https://character.construction/unqualified-emoji)
#### Singleton Regexes
Matches only simple one-codepoint (+ optional variation selector) Emoji:
Regex | Description | Example Matches | Example Non-Matches
------------------------------|-------------|-----------------|--------------------
`Unicode::Emoji::REGEX_BASIC` | Matches (non-textual) Basic Emoji, but no sequences at all | `๐ด`, `โถ๏ธ`, `๐ป` | `๐ด๏ธ`, `โถ`, `๐๐ฝ`, `๐ต๐น`, `๐ต๐ต`,`2๏ธโฃ`, `๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐คพ๐ฝโโ๏ธ`, `๐คพ๐ฝโโ`, `๐โโ๏ธ`, `๐ค โ๐คข`, `1`
`Unicode::Emoji::REGEX_TEXT` | Matches only textual singleton Emoji | `๐ด๏ธ`, `โถ` | `๐ด`, `โถ๏ธ`, `๐ป`, `๐๐ฝ`, `๐ต๐น`, `๐ต๐ต`,`2๏ธโฃ`, `๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐คพ๐ฝโโ๏ธ`, `๐คพ๐ฝโโ`, `๐โโ๏ธ`, `๐ค โ๐คข`, `1`Here is a list of all Emoji that can be matched using the two regexes: [character.construction/emoji-vs-text](https://character.construction/emoji-vs-text). The `REGEX_BASIC` regex also matches [visual Emoji components](https://character.construction/emoji-components) (skin tone modifiers and hair components).
While `REGEX_BASIC` is part of the above regexes, `REGEX_TEXT` is only included in the `*_INCLUDE_TEXT` or `*_UQE` variants.
### Comparison
1) Fully-qualified RGI Emoji ZWJ sequence
2) Minimally-qualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selectors, but not in the first Emoji character)
3) Unqualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selector, including in the first Emoji character). Unqualified Emoji include all basic Emoji in Text Presentation (see column 11/12).
4) Non-RGI Emoji ZWJ sequence
5) Valid Region made from a pair of Regional Indicators
6) Any Region made from a pair of Regional Indicators
7) RGI Flag Emoji Tag Sequences (England, Scotland, Wales)
8) Valid Flag Emoji Tag Sequences (any known subdivision)
9) Any Emoji Tag Sequences (any tag sequence with any base)
10) Basic Default Emoji Presentation Characters or Text characters with Emoji Presentation Selector
11) Basic Default Text Presentation Characters or Basic Emoji with Text Presentation Selector
12) Non-Emoji (unqualified) keycapRegex | 1 RGI/FQE | 2 RGI/MQE | 3 RGI/UQE | 4 Non-RGI | 5 Valid Reยญgion | 6 Any Reยญgion | 7 RGI Tag | 8 Valid Tag | 9 Any Tag | 10 Basic Emoji | 11 Basic Text | 12 Text Keyยญcap
-|-|-|-|-|-|-|-|-|-|-|-|-
REGEX | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ
REGEX INCLUDE TEXT | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ
REGEX INCLUDE MQE | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ
REGEX INCLUDE MQE UQE | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ
REGEX VALID | โ | โ | (โ )ยน | โ | โ | โ | โ | โ | โ | โ | โ | โ
REGEX VALID INCLUDE TEXT | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ
REGEX WELL FORMED | โ | โ | (โ )ยน | โ | โ | โ | โ | โ | โ | โ | โ | โ
REGEX WELL FORMED INCLUDE TEXT | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ
REGEX POSSIBLE | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ
REGEX BASIC | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ
REGEX TEXT | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โยน Matches all unqualified Emoji, except for textual singleton Emoji (see columns 11, 12)
See [spec files](/spec) for detailed examples about which regex matches which kind of Emoji.
### Picking the Right Emoji Regex
- Usually you just want `REGEX` (recommended Emoji set, RGI)
- Use `REGEX_INCLUDE_MQE` or `REGEX_INCLUDE_MQE_UQE` if you want to catch Emoji sequences with missing Variation Selectors.
- If you want broader matching (any ZWJ sequences, more sub-region flags), choose `REGEX_VALID`
- If you need to match any region flag and any tag sequence, choose `REGEX_WELL_FORMED`
- Use the `_INCLUDE_TEXT` suffix with any of the above base regexes, if you want to also match basic textual Emoji
- And finally, there is also the option to use `REGEX_POSSIBLE`, which is a simplified test for possible Emoji, comparable to `REGEX_WELL_FORMED*`. It might contain false positives, however, the regex is less complex and [suggested in the Unicode standard itself](https://www.unicode.org/reports/tr51/#EBNF_and_Regex) as a first check.### Examples
Desc | Emoji | Escaped | `REGEX` (RGI/FQE) | `REGEX_INCLUDE_MQE` (RGI/MQE) | `REGEX_VALID` | `REGEX_WELL_FORMED` / `REGEX_POSSIBLE`
-----|-------|---------|---------------|-----------------------|-----------------------------------|-----------------
RGI ZWJ Sequence | ๐คพ๐ฝโโ๏ธ | `\u{1F93E 1F3FD 200D 2640 FE0F}` | โ | โ | โ | โ
RGI ZWJ Sequence MQE | ๐คพ๐ฝโโ | `\u{1F93E 1F3FD 200D 2640}` | โ | โ | โ | โ
Valid ZWJ Sequence, Non-RGI | ๐ค โ๐คข | `\u{1F920 200D 1F922}` | โ | โ | โ | โ
Known Region | ๐ต๐น | `\u{1F1F5 1F1F9}` | โ | โ | โ | โ
Unknown Region | ๐ต๐ต | `\u{1F1F5 1F1F5}` | โ | โ | โ | โ
RGI Tag Sequence | ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ | `\u{1F3F4 E0067 E0062 E0073 E0063 E0074 E007F}` | โ | โ | โ | โ
Valid Tag Sequence | ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ | `\u{1F3F4 E0067 E0062 E0061 E0067 E0062 E007F}` | โ | โ | โ | โ
Well-formed Tag Sequence | ๐ด๓ ง๓ ข๓ ก๓ ก๓ ก๓ ฟ | `\u{1F634 E0067 E0062 E0061 E0061 E0061 E007F}` | โ | โ | โ | โPlease see [the standard](https://www.unicode.org/reports/tr51/#Emoji_Sets) for more details, examples, explanations.
More info about valid vs. recommended Emoji can also be found in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/).
### Emoji Property Regexes
Ruby includes native regex Emoji properties, as listed in the following table. You can also opt-in to use the `*_PROP_*` regexes to get the Emoji support level of this gem (instead of Ruby's).
Gem Regex (`Unicode::Emoji`'s Emoji support level) | Native Regex (Ruby's Emoji support level)
---------------------------------------------------|------------------------------------------
`Unicode::Emoji::REGEX_PROP_EMOJI` | `/\p{Emoji}/`
`Unicode::Emoji::REGEX_PROP_MODIFIER` | `/\p{EMod}/`
`Unicode::Emoji::REGEX_PROP_MODIFIER_BASE` | `/\p{EBase}/`
`Unicode::Emoji::REGEX_PROP_COMPONENT` | `/\p{EComp}/`
`Unicode::Emoji::REGEX_PROP_PRESENTATION` | `/\p{EPres}/`#### Extended Pictographic Regex
`Unicode::Emoji::REGEX_PICTO` matches single codepoints with the **Extended_Pictographic** property. For example, it will match `โ` BLACK SAFETY SCISSORS.
`Unicode::Emoji::REGEX_PICTO_NO_EMOJI` matches single codepoints with the **Extended_Pictographic** property, but excludes Emoji characters.
See [character.construction/picto](https://character.construction/picto) for a list of all non-Emoji pictographic characters.
## Usage โ List
Use `Unicode::Emoji::LIST` or the **list** method to get a ordered and categorized list of Emoji:
```ruby
Unicode::Emoji.list.keys
# => ["Smileys & Emotion", "People & Body", "Component", "Animals & Nature", "Food & Drink", "Travel & Places", "Activities", "Objects", "Symbols", "Flags"]Unicode::Emoji.list("Food & Drink").keys
# => ["food-fruit", "food-vegetable", "food-prepared", "food-asian", "food-marine", "food-sweet", "drink", "dishware"]Unicode::Emoji.list("Food & Drink", "food-asian")
=> ["๐ฑ", "๐", "๐", "๐", "๐", "๐", "๐", "๐ ", "๐ข", "๐ฃ", "๐ค", "๐ฅ", "๐ฅฎ", "๐ก", "๐ฅ", "๐ฅ ", "๐ฅก"]
```Please note that categories might change with future versions of the Emoji standard, although this has not happened often.
A list of all Emoji (generated from this gem) can be found at [character.construction/emoji](https://character.construction/emoji).
## Usage โ Properties Data
Allows you to access the codepoint data for a single character form Unicode's [emoji-data.txt](https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt) file:
```ruby
require "unicode/emoji"Unicode::Emoji.properties "โ" # => ["Emoji", "Emoji_Modifier_Base"]
```## Also See
- [Unicodeยฎ Technical Standard #51](https://www.unicode.org/reports/tr51/)
- [Emoji categories](https://unicode.org/emoji/charts/emoji-ordering.html)
- Ruby gem which displays [Emoji sequence names](https://github.com/janlelis/unicode-sequence_name) ([as website](https://character.construction/name))
- Part of [unicode-x](https://github.com/janlelis/unicode-x)## MIT
- Copyright (C) 2017-2024 Jan Lelis . Released under the MIT license.
- Unicode data: https://www.unicode.org/copyright.html#Exhibit1