Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bsolomon1124/demoji
Accurately find/replace/remove emojis in text strings
https://github.com/bsolomon1124/demoji
emojis python python3 unicode
Last synced: 10 days ago
JSON representation
Accurately find/replace/remove emojis in text strings
- Host: GitHub
- URL: https://github.com/bsolomon1124/demoji
- Owner: bsolomon1124
- License: apache-2.0
- Created: 2019-02-07T20:50:30.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2023-12-16T11:50:34.000Z (11 months ago)
- Last Synced: 2024-10-22T23:26:57.361Z (17 days ago)
- Topics: emojis, python, python3, unicode
- Language: Python
- Homepage: https://pypi.org/project/demoji/
- Size: 80.1 KB
- Stars: 158
- Watchers: 4
- Forks: 20
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
- awesome-starred - bsolomon1124/demoji - Accurately find/replace/remove emojis in text strings (python3)
README
# demoji
Accurately find or remove [emojis](https://en.wikipedia.org/wiki/Emoji) from a blob of text using
data from the Unicode Consortium's [emoji code repository](https://unicode.org/Public/emoji/).[![License](https://img.shields.io/github/license/bsolomon1124/demoji.svg)](https://github.com/bsolomon1124/demoji/blob/master/LICENSE)
[![PyPI](https://img.shields.io/pypi/v/demoji.svg)](https://pypi.org/project/demoji/)
[![Status](https://img.shields.io/pypi/status/demoji.svg)](https://pypi.org/project/demoji/)
[![Python](https://img.shields.io/pypi/pyversions/demoji.svg)](https://pypi.org/project/demoji)-------
## Major Changes in Version 1.x
Version 1.x of `demoji` now bundles Unicode data in the package at install time rather than requiring
a download of the codes from unicode.org at runtime. Please see the [CHANGELOG.md](CHANGELOG.md)
for detail and be familiar with the changes before updating from 0.x to 1.x.To report any regressions, please [open a GitHub issue](https://github.com/bsolomon1124/demoji/issues/new?assignees=&labels=&template=bug_report.md&title=).
## Basic Usage
`demoji` exports several text-related functions for find-and-replace functionality with emojis:
```python
>>> tweet = """\
... #startspreadingthenews yankees win great start by ๐ ๐พ going 5strong innings with 5kโs๐ฅ ๐
... solo homerun ๐๐ with 2 solo homeruns and๐น 3run homerunโฆ ๐คก ๐ฃ๐ผ ๐จ๐ฝโโ๏ธ with rbiโs โฆ ๐ฅ๐ฅ
... ๐ฒ๐ฝ and ๐ณ๐ฎ to close the game๐ฅ๐ฅ!!!โฆ.
... WHAT A GAME!!..
... """
>>> demoji.findall(tweet)
{
"๐ฅ": "fire",
"๐": "volcano",
"๐จ๐ฝ\u200dโ๏ธ": "man judge: medium skin tone",
"๐ ๐พ": "Santa Claus: medium-dark skin tone",
"๐ฒ๐ฝ": "flag: Mexico",
"๐น": "ogre",
"๐คก": "clown face",
"๐ณ๐ฎ": "flag: Nicaragua",
"๐ฃ๐ผ": "person rowing boat: medium-light skin tone",
"๐": "ox",
}
```See [below](#reference) for function API.
## Command-line Use
You can use `demoji` or `python -m demoji` to replace emojis
in file(s) or stdin with their `:code:` equivalents:```bash
$ cat out.txt
All done! โจ ๐ฐ โจ
$ demoji out.txt
All done! :sparkles: :shortcake: :sparkles:$ echo 'All done! โจ ๐ฐ โจ' | demoji
All done! :sparkles: :shortcake: :sparkles:$ demoji -
we didnt start the ๐ฅ
we didnt start the :fire:
```## Reference
```python
findall(string: str) -> Dict[str, str]
```Find emojis within `string`. Return a mapping of `{emoji: description}`.
```python
findall_list(string: str, desc: bool = True) -> List[str]
```Find emojis within `string`. Return a list (with possible duplicates).
If `desc` is True, the list contains description codes. If `desc` is False, the list contains emojis.
```python
replace(string: str, repl: str = "") -> str
```Replace emojis in `string` with `repl`.
```python
replace_with_desc(string: str, sep: str = ":") -> str
```Replace emojis in `string` with their description codes. The codes are surrounded by `sep`.
```python
last_downloaded_timestamp() -> datetime.datetime
```Show the timestamp of last download for the emoji data bundled with the package.
## Footnote: Emoji Sequences
Numerous emojis that look like single Unicode characters are actually multi-character sequences. Examples:
- The keycap 2๏ธโฃ is actually 3 characters, U+0032 (the ASCII digit 2), U+FE0F (variation selector), and U+20E3 (combining enclosing keycap).
- The flag of Scotland 7 component characters, `b'\\U0001f3f4\\U000e0067\\U000e0062\\U000e0073\\U000e0063\\U000e0074\\U000e007f'` in full esaped notation.(You can see any of these through `s.encode("unicode-escape")`.)
`demoji` is careful to handle this and should find the full sequences rather than their incomplete subcomponents.
The way it does this it to sort emoji codes by their length, and then compile a concatenated regular expression that will greedily search for longer emojis first, falling back to shorter ones if not found. This is not by any means a super-optimized way of searching as it has O(N2) properties, but the focus is on accuracy and completeness.
```python
>>> from pprint import pprint
>>> seq = """\
... I bet you didn't know that ๐, ๐โโ๏ธ, and ๐โโ๏ธ are three different emojis.
... """
>>> pprint(seq.encode('unicode-escape')) # Python 3
(b"I bet you didn't know that \\U0001f64b, \\U0001f64b\\u200d\\u2642\\ufe0f,"
b' and \\U0001f64b\\u200d\\u2640\\ufe0f are three different emojis.\\n')
```