https://github.com/hitchhicker/tweet_nlp_toolkit

Tweet NLP toolkit
https://github.com/hitchhicker/tweet_nlp_toolkit

nlp nlp-library preprocessing python social-media toolkits twitter

Last synced: 7 days ago
JSON representation

Tweet NLP toolkit

Host: GitHub
URL: https://github.com/hitchhicker/tweet_nlp_toolkit
Owner: hitchhicker
License: apache-2.0
Created: 2022-02-26T15:33:39.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-04-26T22:53:56.000Z (about 2 years ago)
Last Synced: 2025-06-04T15:43:01.687Z (about 1 month ago)
Topics: nlp, nlp-library, preprocessing, python, social-media, toolkits, twitter
Language: Python
Homepage:
Size: 62.5 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

        ![ci](https://github.com/hitchhicker/tweet_nlp_toolkit/actions/workflows/makefile.yml/badge.svg)

# tweet_nlp_toolkit

Tweet NLP toolkit

It can handle:

 - mentions

 - hashtags

 - emojis

 - emoticons

 - emails

 - HTML entities

 - digits

 - urls

 - punctuations

 - customized words to filter

## Installation

```

python3 -m venv .env

source .env/bin/activate

python -m pip install -U pip

pip install tweet_nlp_toolkit

```

## Usage

### Text Parsing

```python

>>> from tweet_nlp_toolkit import parse_text

>>> text = parse_text("123 @hello #world www.url.com 😰 :) [email protected]")

>>> text.tokens

['123', '@hello', '#world', 'www.url.com', '😰', ':)', '[email protected]']

>>> text.hashtags

['#world']

>>> text.mentions

['@hello']

>>> text.urls

['www.url.com']

>>> text.emojis

['😰']

>>> text.emoticons

[':)']

>>> text.digits

['123']

>>> text.emails

['[email protected]']

```

### Tagging entities

```python

>>> from tweet_nlp_toolkit import parse_text

>>> parse_text(

...     "123 @hello #world www.url.com 😰 :) [email protected]",

...     emojis="tag",

...     hashtags="tag",

...     mentions="tag"

... ).tokens

>>> ['123', '', '', 'www.url.com', '', ':)', '[email protected]']

```

### Preprocessing

```python

>>> from tweet_nlp_toolkit import prep

>>> prep(

        "123 @hello #world www.url.com 😰 :) [email protected]",

        emojis="demojize",

        mentions="remove",

        hashtags="remove",

        urls="remove",

        digits="tag",

        emails="remove"

... )

>>> ' :anxious_face_with_sweat: :)'

```

```

>>> from tweet_nlp_toolkit import prep_file

>>> prep_file("input.txt", "output.txt")

```

### More

`parse_text`, `prep` and `prep_file` share the same parameters, `parse_text` returns an instance of `ParsedText`,

`prep` returns the preprocessed string and `prep_file` preprocesses the file.

```

Parameters

----------

text: str

    The text to preprocess.

tokenizer: Callable[[str], List[Token]]

    Tokenizer

encoding: str

    The encoding of the text.

    Default "utf-8".

remove_unencodable_char: bool

    In case of encoding error of a character it is replaced with '�'. This option allows removing the '�'.

    Otherwise a sequence of '�' is replaced by a single one

    Default False

to_lower: bool

    Whether to convert the text to lowercase.

    Default True

strip_accents: bool

    Whether to remove accents from latin characters.

    Default False

reduce_len: bool

    Whether to remove repeated character sequences.

    Default False

filters: set

    Tokens to filter (case sensitive).

    Default None

emojis: Optional[str]

    How to handle emojis.

    Options:

        - "remove": remove all emojis

        - "tag": replaces the emoji by a tag 

        - "demojize": replaces the emoji by its textual representation, e.g. :musical_keyboard:

            list of emojis: https://www.webfx.com/tools/emoji-cheat-sheet/

        - "emojize": replaces the emoji by its unicode representation, e.g. 😰

    Default None

hashtags: Optional[str]

    How to handle hashtags.

    Options:

        - "remove": delete all hashtags

        - "tag": replaces the hashtag by a tag 

    Default None

urls: Optional[str]

    How to handle urls.

    Options:

        - "remove": delete all urls

        - "tag": replaces the url by a tag 

    Default None

mentions: Optional[str]

    How to handle mentions.

    Options:

        - "remove": delete all mentions

        - "tag": replaces the mention by a tag 

    Default None

digits: Optional[str]

    How to handle digits.

    Options:

        - "remove": delete all digits

        - "tag": replaces the digit by a tag 

    Default None

emoticons: Optional[str]

    How to handle emoticons.

    Options:

        - "remove": delete all emoticons

        - "tag": replaces the emoticon by a tag 

    Default None

puncts: Optional[str]

    How to handle puncts.

    Options:

        - "remove": delete all puncts

        - "tag": replaces the puncts by a tag 

    Default None

emails: Optional[str]

    How to handle emails.

    Options:

        - "remove": delete all emails

        - "tag": replaces the emails by a tag 

    Default None

html_tags: Optional[str]

    How to handle HTML tags like 
.

    Options:

        - "remove": delete all HTML tags

    Default None

html_tags: Optional[str]

    How to handle HTML tags like .

    Options:

        - "remove": delete all HTML tags

    Default None

stop_words: Optional[str]

    How to handle stop words.

    Options:

        - "remove": delete all HTML tags

    Default None

stop_words

    How to handle stop words. Only English stop words are supported

    Options:

        - "remove"

    Default None

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hitchhicker/tweet_nlp_toolkit

Awesome Lists containing this project

README