https://github.com/memgonzales/regex-tweet-tokenizer

General-purpose regex-based tweet tokenizer that employs pattern matching with a single regular expression
https://github.com/memgonzales/regex-tweet-tokenizer

natural-language-processing natural-language-understanding nlp regex regular-expression tokenization tokenizer tweet-preprocessing tweets

Last synced: about 1 year ago
JSON representation

General-purpose regex-based tweet tokenizer that employs pattern matching with a single regular expression

Host: GitHub
URL: https://github.com/memgonzales/regex-tweet-tokenizer
Owner: memgonzales
Created: 2022-06-19T14:48:11.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2023-01-23T16:12:51.000Z (over 3 years ago)
Last Synced: 2025-01-20T11:11:26.541Z (over 1 year ago)
Topics: natural-language-processing, natural-language-understanding, nlp, regex, regular-expression, tokenization, tokenizer, tweet-preprocessing, tweets
Language: Jupyter Notebook
Homepage:
Size: 13.3 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # RegEx-Based Tweet Tokenizer

![badge][badge-jupyter]

![badge-python](https://img.shields.io/badge/python-3670A0?style=flat&logo=python&logoColor=white)

![badge][badge-pandas]

![Twitter](https://img.shields.io/badge/Twitter-%231DA1F2.svg?style=flat&logo=Twitter&logoColor=white)

This project is a **general-purpose regular expression-based tokenizer for tweets**. In order to highlight the power and limitations of a purely regular expression-based approach, tokenization is performed by **pattern matching with a *single* regular expression**; conditional statements and substitutions are deliberately not utilized.

All the scripts are placed inside a [Jupyter notebook](https://github.com/memgonzales/regex-tweet-tokenizer/blob/master/RegEx-Based%20Tweet%20Tokenizer.ipynb), which also includes a detailed write-up covering the following:

- Definition of a token (and the underlying rationale)

- Design decisions in the implementation of the tokenizer

- Walkthrough of the implementation of the tokenizer

- Descriptive statistics of the corpus after tokenization

- Analysis of the power and limitations of the tokenizer

- Comparative analysis with the state-of-the-art [NLTK TweetTokenizer](https://www.nltk.org/api/nltk.tokenize.casual.html) 

- Performance (running time) of the tokenizer

- Analysis of the most frequent tokens

This is a major course output in an introduction to natural language processing class under Mr. Edward P. Tighe of the Department of Software Technology, De La Salle University.

## Built Using

This project is a Jupyter notebook, with the following Python libraries and modules used:

Library/Module |	Description |	License

-- | -- | --

[`pandas`](https://pandas.pydata.org/)	| Provides functions for data analysis and manipulation	| BSD 3-Clause "New" or "Revised" License

[`csv`](https://docs.python.org/3/library/csv.html)	| Implements classes to read and write tabular data in CSV format | Python Software Foundation License

[`regex`](https://pypi.org/project/regex/)	| Provides additional functionality over the standard [`re`](https://docs.python.org/3/library/re.html) module while maintaining backwards-compatibility	| Apache License 2.0

[`nltk`](https://www.nltk.org/) (For comparative analysis of resulting tokenization)	| Provides interfaces to corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning	| Apache License 2.0

*The descriptions are taken from their respective websites.*

## Author

- Mark Edward M. Gonzales 


  mark_gonzales@dlsu.edu.ph 


  gonzales.markedward@gmail.com 


The [dataset of tweets](https://github.com/memgonzales/regex-tweet-tokenizer/blob/master/tweets_for_pa2.csv) was scraped by Mr. Edward P. Tighe of the Department of Software Technology, De La Salle University. All the tweets in this dataset are public tweets collected via the [Twitter API](https://developer.twitter.com/en/docs/twitter-api).

[badge-jupyter]: https://img.shields.io/badge/Jupyter-F37626.svg?&style=flat&logo=Jupyter&logoColor=white

[badge-pandas]: https://img.shields.io/badge/Pandas-2C2D72?style=flat&logo=pandas&logoColor=white

[badge-numpy]: https://img.shields.io/badge/Numpy-777BB4?style=flat&logo=numpy&logoColor=white

[badge-scipy]: https://img.shields.io/badge/SciPy-654FF0?style=flat&logo=SciPy&logoColor=white

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/memgonzales/regex-tweet-tokenizer

Awesome Lists containing this project

README