https://github.com/pngo1997/tweet-data-processing-query-performance-analysis
Analyzes large Tweet dataset (4.4M tweets) using SQL.
https://github.com/pngo1997/tweet-data-processing-query-performance-analysis
big-data python query sql text-processing
Last synced: about 1 year ago
JSON representation
Analyzes large Tweet dataset (4.4M tweets) using SQL.
- Host: GitHub
- URL: https://github.com/pngo1997/tweet-data-processing-query-performance-analysis
- Owner: pngo1997
- Created: 2025-01-30T20:30:44.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-30T20:39:27.000Z (over 1 year ago)
- Last Synced: 2025-01-30T21:28:59.762Z (over 1 year ago)
- Topics: big-data, python, query, sql, text-processing
- Language: Jupyter Notebook
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🏗️ Tweet Data Processing & Query Performance Analysis
## 📜 Overview
This project involves **processing, storing, querying, and analyzing tweet data** from a large dataset (4.4M tweets - one day of tweet data). The tasks include **downloading tweets, storing them in a SQLite database, optimizing database operations, comparing query execution performance in SQL vs. Python, and exporting processed data in multiple formats (JSON, CSV)**.
## 🎯 Problem Explanation
Tasks are divided into three major sections:
1. **Processing & Storing Tweets:**
- Populate a **3-table schema in SQLite** and measure execution time.
- Optimize database inserts using **batching (executemany)**.
- Compare execution time across different methods.
2. **Query Execution & Performance Analysis:**
- Execute SQL queries vs. equivalent Python-based queries.
- Analyze **linear scalability** of query execution.
- Implement **regular expressions** as an alternative to `json.loads()`.
3. **Data Export & Storage Format Comparison:**
- Create a **materialized view** (using `CREATE TABLE AS SELECT`).
- Export processed data to **JSON and CSV formats**.
- Compare **file sizes** to evaluate the most efficient storage format.
## 🛠️ Implementation Details
### **1. Processing & Storing Tweets**
- **Downloaded tweet data** (130K & 650K tweets) and stored them in a text file.
- **Populated SQLite tables**:
- `Tweet` (Tweet ID, User ID, Text, GeoFK)
- `User` (User ID, Screen Name, Friends Count)
- `Geo` (Geo ID, Longitude, Latitude)
- **Optimized insert operations**:
- **Single inserts** vs. **batch inserts (executemany, batch size = 2500)**.
### **2. Query Execution & Performance Analysis**
- **SQL Queries Executed (for each tweet batch):**
- **Find average latitude per user:**
```sql
SELECT UserID, AVG(latitude), SUM(latitude)/COUNT(latitude)
FROM Tweet, Geo WHERE Tweet.GeoFK = Geo.GeoID
GROUP BY UserID;
```
- **Measure runtime across multiple executions (1x, 5x, 20x).**
- **Python Query Execution:**
- Read & process tweets **without SQL**.
- Compare execution time to SQL.
- **Regular Expression Approach:**
- Extract UserID and Geo info using **regex instead of `json.loads()`**.
### **3. Data Export & Storage Format Comparison**
- **Created a `Tweet_Join` table** (joins Tweet, User, Geo).
- Exported processed data into JSON & CSV formats.
- File Size Comparison