https://github.com/memgonzales/regex-tweet-tokenizer
General-purpose regex-based tweet tokenizer that employs pattern matching with a single regular expression
https://github.com/memgonzales/regex-tweet-tokenizer
natural-language-processing natural-language-understanding nlp regex regular-expression tokenization tokenizer tweet-preprocessing tweets
Last synced: 12 months ago
JSON representation
General-purpose regex-based tweet tokenizer that employs pattern matching with a single regular expression
- Host: GitHub
- URL: https://github.com/memgonzales/regex-tweet-tokenizer
- Owner: memgonzales
- Created: 2022-06-19T14:48:11.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2023-01-23T16:12:51.000Z (over 3 years ago)
- Last Synced: 2025-01-20T11:11:26.541Z (over 1 year ago)
- Topics: natural-language-processing, natural-language-understanding, nlp, regex, regular-expression, tokenization, tokenizer, tweet-preprocessing, tweets
- Language: Jupyter Notebook
- Homepage:
- Size: 13.3 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# RegEx-Based Tweet Tokenizer
![badge][badge-jupyter]

![badge][badge-pandas]

This project is a **general-purpose regular expression-based tokenizer for tweets**. In order to highlight the power and limitations of a purely regular expression-based approach, tokenization is performed by **pattern matching with a *single* regular expression**; conditional statements and substitutions are deliberately not utilized.
All the scripts are placed inside a [Jupyter notebook](https://github.com/memgonzales/regex-tweet-tokenizer/blob/master/RegEx-Based%20Tweet%20Tokenizer.ipynb), which also includes a detailed write-up covering the following:
- Definition of a token (and the underlying rationale)
- Design decisions in the implementation of the tokenizer
- Walkthrough of the implementation of the tokenizer
- Descriptive statistics of the corpus after tokenization
- Analysis of the power and limitations of the tokenizer
- Comparative analysis with the state-of-the-art [NLTK TweetTokenizer](https://www.nltk.org/api/nltk.tokenize.casual.html)
- Performance (running time) of the tokenizer
- Analysis of the most frequent tokens
This is a major course output in an introduction to natural language processing class under Mr. Edward P. Tighe of the Department of Software Technology, De La Salle University.
## Built Using
This project is a Jupyter notebook, with the following Python libraries and modules used:
Library/Module | Description | License
-- | -- | --
[`pandas`](https://pandas.pydata.org/) | Provides functions for data analysis and manipulation | BSD 3-Clause "New" or "Revised" License
[`csv`](https://docs.python.org/3/library/csv.html) | Implements classes to read and write tabular data in CSV format | Python Software Foundation License
[`regex`](https://pypi.org/project/regex/) | Provides additional functionality over the standard [`re`](https://docs.python.org/3/library/re.html) module while maintaining backwards-compatibility | Apache License 2.0
[`nltk`](https://www.nltk.org/) (For comparative analysis of resulting tokenization) | Provides interfaces to corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning | Apache License 2.0
*The descriptions are taken from their respective websites.*
## Author
- Mark Edward M. Gonzales
mark_gonzales@dlsu.edu.ph
gonzales.markedward@gmail.com
The [dataset of tweets](https://github.com/memgonzales/regex-tweet-tokenizer/blob/master/tweets_for_pa2.csv) was scraped by Mr. Edward P. Tighe of the Department of Software Technology, De La Salle University. All the tweets in this dataset are public tweets collected via the [Twitter API](https://developer.twitter.com/en/docs/twitter-api).
[badge-jupyter]: https://img.shields.io/badge/Jupyter-F37626.svg?&style=flat&logo=Jupyter&logoColor=white
[badge-pandas]: https://img.shields.io/badge/Pandas-2C2D72?style=flat&logo=pandas&logoColor=white
[badge-numpy]: https://img.shields.io/badge/Numpy-777BB4?style=flat&logo=numpy&logoColor=white
[badge-scipy]: https://img.shields.io/badge/SciPy-654FF0?style=flat&logo=SciPy&logoColor=white