Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hitchhicker/tweet_nlp_toolkit

Tweet NLP toolkit
https://github.com/hitchhicker/tweet_nlp_toolkit

nlp nlp-library preprocessing python social-media toolkits twitter

Last synced: 17 days ago
JSON representation

Tweet NLP toolkit

Awesome Lists containing this project

README

        

![ci](https://github.com/hitchhicker/tweet_nlp_toolkit/actions/workflows/makefile.yml/badge.svg)

# tweet_nlp_toolkit
Tweet NLP toolkit

It can handle:
- mentions
- hashtags
- emojis
- emoticons
- emails
- HTML entities
- digits
- urls
- punctuations
- customized words to filter
## Installation
```
python3 -m venv .env
source .env/bin/activate
python -m pip install -U pip
pip install tweet_nlp_toolkit
```
## Usage
### Text Parsing
```python
>>> from tweet_nlp_toolkit import parse_text
>>> text = parse_text("123 @hello #world www.url.com 😰 :) [email protected]")
>>> text.tokens
['123', '@hello', '#world', 'www.url.com', '😰', ':)', '[email protected]']
>>> text.hashtags
['#world']
>>> text.mentions
['@hello']
>>> text.urls
['www.url.com']
>>> text.emojis
['😰']
>>> text.emoticons
[':)']
>>> text.digits
['123']
>>> text.emails
['[email protected]']
```
### Tagging entities
```python
>>> from tweet_nlp_toolkit import parse_text
>>> parse_text(
... "123 @hello #world www.url.com 😰 :) [email protected]",
... emojis="tag",
... hashtags="tag",
... mentions="tag"
... ).tokens
>>> ['123', '', '', 'www.url.com', '', ':)', '[email protected]']
```

### Preprocessing
```python
>>> from tweet_nlp_toolkit import prep
>>> prep(
"123 @hello #world www.url.com 😰 :) [email protected]",
emojis="demojize",
mentions="remove",
hashtags="remove",
urls="remove",
digits="tag",
emails="remove"
... )
>>> ' :anxious_face_with_sweat: :)'
```

```
>>> from tweet_nlp_toolkit import prep_file
>>> prep_file("input.txt", "output.txt")
```
### More
`parse_text`, `prep` and `prep_file` share the same parameters, `parse_text` returns an instance of `ParsedText`,
`prep` returns the preprocessed string and `prep_file` preprocesses the file.
```
Parameters
----------
text: str
The text to preprocess.
tokenizer: Callable[[str], List[Token]]
Tokenizer
encoding: str
The encoding of the text.
Default "utf-8".
remove_unencodable_char: bool
In case of encoding error of a character it is replaced with '�'. This option allows removing the '�'.
Otherwise a sequence of '�' is replaced by a single one
Default False
to_lower: bool
Whether to convert the text to lowercase.
Default True
strip_accents: bool
Whether to remove accents from latin characters.
Default False
reduce_len: bool
Whether to remove repeated character sequences.
Default False
filters: set
Tokens to filter (case sensitive).
Default None
emojis: Optional[str]
How to handle emojis.
Options:
- "remove": remove all emojis
- "tag": replaces the emoji by a tag
- "demojize": replaces the emoji by its textual representation, e.g. :musical_keyboard:
list of emojis: https://www.webfx.com/tools/emoji-cheat-sheet/
- "emojize": replaces the emoji by its unicode representation, e.g. 😰
Default None
hashtags: Optional[str]
How to handle hashtags.
Options:
- "remove": delete all hashtags
- "tag": replaces the hashtag by a tag
Default None
urls: Optional[str]
How to handle urls.
Options:
- "remove": delete all urls
- "tag": replaces the url by a tag
Default None
mentions: Optional[str]
How to handle mentions.
Options:
- "remove": delete all mentions
- "tag": replaces the mention by a tag
Default None
digits: Optional[str]
How to handle digits.
Options:
- "remove": delete all digits
- "tag": replaces the digit by a tag
Default None
emoticons: Optional[str]
How to handle emoticons.
Options:
- "remove": delete all emoticons
- "tag": replaces the emoticon by a tag
Default None
puncts: Optional[str]
How to handle puncts.
Options:
- "remove": delete all puncts
- "tag": replaces the puncts by a tag
Default None
emails: Optional[str]
How to handle emails.
Options:
- "remove": delete all emails
- "tag": replaces the emails by a tag
Default None
html_tags: Optional[str]
How to handle HTML tags like

.
Options:
- "remove": delete all HTML tags
Default None
html_tags: Optional[str]
How to handle HTML tags like
.
Options:
- "remove": delete all HTML tags
Default None
stop_words: Optional[str]
How to handle stop words.
Options:
- "remove": delete all HTML tags
Default None
stop_words
How to handle stop words. Only English stop words are supported
Options:
- "remove"
Default None
```