{"id":25819027,"url":"https://github.com/pngo1997/tweet-data-processing-query-performance-analysis","last_synced_at":"2025-06-12T03:38:05.066Z","repository":{"id":275043129,"uuid":"924888353","full_name":"pngo1997/Tweet-Data-Processing-Query-Performance-Analysis","owner":"pngo1997","description":"Analyzes large Tweet dataset (4.4M tweets) using SQL.","archived":false,"fork":false,"pushed_at":"2025-01-30T20:39:27.000Z","size":0,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-30T21:28:59.762Z","etag":null,"topics":["big-data","python","query","sql","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pngo1997.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-30T20:30:44.000Z","updated_at":"2025-01-30T20:39:30.000Z","dependencies_parsed_at":"2025-01-30T21:29:01.427Z","dependency_job_id":"2215671f-1682-4fc7-8ef4-5e76df6545ac","html_url":"https://github.com/pngo1997/Tweet-Data-Processing-Query-Performance-Analysis","commit_stats":null,"previous_names":["pngo1997/tweet-data-processing-query-performance-analysis"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pngo1997%2FTweet-Data-Processing-Query-Performance-Analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pngo1997%2FTweet-Data-Processing-Query-Performance-Analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pngo1997%2FTweet-Data-Processing-Query-Performance-Analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pngo1997%2FTweet-Data-Processing-Query-Performance-Analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pngo1997","download_url":"https://codeload.github.com/pngo1997/Tweet-Data-Processing-Query-Performance-Analysis/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241122319,"owners_count":19913455,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","python","query","sql","text-processing"],"created_at":"2025-02-28T08:14:25.045Z","updated_at":"2025-02-28T08:14:25.473Z","avatar_url":"https://github.com/pngo1997.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🏗️ Tweet Data Processing \u0026 Query Performance Analysis  \n\n## 📜 Overview  \nThis project involves **processing, storing, querying, and analyzing tweet data** from a large dataset (4.4M tweets - one day of tweet data). The tasks include **downloading tweets, storing them in a SQLite database, optimizing database operations, comparing query execution performance in SQL vs. Python, and exporting processed data in multiple formats (JSON, CSV)**.  \n\n## 🎯 Problem Explanation  \nTasks are divided into three major sections:  \n\n1. **Processing \u0026 Storing Tweets:**  \n   - Populate a **3-table schema in SQLite** and measure execution time.  \n   - Optimize database inserts using **batching (executemany)**.  \n   - Compare execution time across different methods.  \n\n2. **Query Execution \u0026 Performance Analysis:**  \n   - Execute SQL queries vs. equivalent Python-based queries.  \n   - Analyze **linear scalability** of query execution.  \n   - Implement **regular expressions** as an alternative to `json.loads()`.  \n\n3. **Data Export \u0026 Storage Format Comparison:**  \n   - Create a **materialized view** (using `CREATE TABLE AS SELECT`).  \n   - Export processed data to **JSON and CSV formats**.  \n   - Compare **file sizes** to evaluate the most efficient storage format.  \n\n## 🛠️ Implementation Details  \n\n### **1. Processing \u0026 Storing Tweets**  \n- **Downloaded tweet data** (130K \u0026 650K tweets) and stored them in a text file.  \n- **Populated SQLite tables**:  \n  - `Tweet` (Tweet ID, User ID, Text, GeoFK)  \n  - `User` (User ID, Screen Name, Friends Count)  \n  - `Geo` (Geo ID, Longitude, Latitude)  \n- **Optimized insert operations**:  \n  - **Single inserts** vs. **batch inserts (executemany, batch size = 2500)**.  \n\n\n### **2. Query Execution \u0026 Performance Analysis**  \n- **SQL Queries Executed (for each tweet batch):**  \n  - **Find average latitude per user:**  \n    ```sql\n    SELECT UserID, AVG(latitude), SUM(latitude)/COUNT(latitude)\n    FROM Tweet, Geo WHERE Tweet.GeoFK = Geo.GeoID \n    GROUP BY UserID;\n    ```\n  - **Measure runtime across multiple executions (1x, 5x, 20x).**  \n- **Python Query Execution:**  \n  - Read \u0026 process tweets **without SQL**.  \n  - Compare execution time to SQL.  \n- **Regular Expression Approach:**  \n  - Extract UserID and Geo info using **regex instead of `json.loads()`**.  \n\n\n### **3. Data Export \u0026 Storage Format Comparison**  \n- **Created a `Tweet_Join` table** (joins Tweet, User, Geo).  \n- Exported processed data into JSON \u0026 CSV formats.\n- File Size Comparison\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpngo1997%2Ftweet-data-processing-query-performance-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpngo1997%2Ftweet-data-processing-query-performance-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpngo1997%2Ftweet-data-processing-query-performance-analysis/lists"}