https://github.com/pngo1997/text-processing-tokenization

Simple text analysis and tokenization.
https://github.com/pngo1997/text-processing-tokenization

nltk python sentence-segmentation text-preprocessing tokenization word-frequency zipfs-law

Last synced: 7 months ago
JSON representation

Simple text analysis and tokenization.

Host: GitHub
URL: https://github.com/pngo1997/text-processing-tokenization
Owner: pngo1997
Created: 2025-02-05T22:04:53.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-02-05T22:31:09.000Z (8 months ago)
Last Synced: 2025-02-05T23:21:54.167Z (8 months ago)
Topics: nltk, python, sentence-segmentation, text-preprocessing, tokenization, word-frequency, zipfs-law
Language: Jupyter Notebook
Homepage:
Size: 185 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 📊 Text Processing & Tokenization

## 📜 Overview
This project focuses on **text analysis and tokenization**, computing key statistics from a plain text document while following **custom tokenization rules**. The program processes input text to generate:

- 📑 **Total number of paragraphs**
- ✍️ **Total number of sentences**
- 📝 **Total number of words/tokens** (after tokenization)
- 🔍 **Total number of unique words/tokens**
- 📊 **Ranked word frequency counts** (sorted in descending order)

Additionally, visualizes **Zipf’s Law**, plotting word frequency vs. rank in a log-log chart.

## 🚀 1️⃣ Implementation Details

### **Text Processing Rules**
✅ **Sentence Segmentation**: Uses `nltk.sent_tokenize()`.
✅ **Word Tokenization**: Uses `nltk.word_tokenize()`, then further processes tokens:
- **Punctuation Handling**:
- Leading/trailing punctuation **separated**, but **kept**.
- Internal punctuation within words **retained** (e.g., `$3.19 → "$", "3.19"`).
- **Contractions Expansion**:
- `"don't" → "do", "not"`
- `"they'll" → "they", "will"`
- `"let's" → "let", "us"`
- `"I'm" → "I", "am"`
- Multi-contraction handling: `"shouldn't've" → "should", "not", "have"`.

✅ **Normalization**:
- Convert all words to **lowercase**.
- **No stemming or stopword removal**.

✅ **Word Frequency Calculation**:
- Compute **unique word counts**.
- Sort words **by frequency (descending)**, then **alphabetically (ascending) if tied**.

## 🚀 2️⃣ **Zipf’s Law Visualization**:
- Generate **log-log plot** of word frequency vs. rank.
- Expected **Zipfian curve** showing a power-law distribution.
- Zipf's Law curve shows the token's frequency distribution in descending order of ranking. First ranked word (start from left) has the highest occurrence, and gradually decrease with lower rankings. The initial steep indicates that the text corpus has high occurrences of common words, and flatten tails indicates approximately more than 50% of less common words. The curve shows a comprehensive view of words distribution in a text which is extremely helpful as a preliminary step to conduct further advanced tasks.
![Zipf Law](https://github.com/pngo1997/Images/blob/main/Zipf's%20Law.png)

## 📌 Summary
✅ Implemented text processing & tokenization with custom rules.

✅ Expanded contractions, normalized words, and retained punctuations.

✅ Analyzed word frequency distribution and generated Zipf’s Law plot.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pngo1997/text-processing-tokenization

Awesome Lists containing this project

README