https://github.com/pngo1997/sql-query-optimization-precedence-graph-analysis
Explores SQL query execution, indexing, and serializability in database transactions using SQLite and Python.
https://github.com/pngo1997/sql-query-optimization-precedence-graph-analysis
index python sql sqllite text-processing
Last synced: about 1 month ago
JSON representation
Explores SQL query execution, indexing, and serializability in database transactions using SQLite and Python.
- Host: GitHub
- URL: https://github.com/pngo1997/sql-query-optimization-precedence-graph-analysis
- Owner: pngo1997
- Created: 2025-01-30T20:25:30.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-30T20:40:09.000Z (over 1 year ago)
- Last Synced: 2025-02-28T14:13:31.570Z (over 1 year ago)
- Topics: index, python, sql, sqllite, text-processing
- Language: Jupyter Notebook
- Homepage:
- Size: 2.75 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🏗️ SQL Query Optimization & Precedence Graph Analysis
## 📜 Overview
This project explores **SQL query execution, indexing, and serializability in database transactions** using **SQLite** and **Python**. It includes **performance benchmarking**, **SQL vs. Python query execution time comparisons**, and **precedence graph analysis** for database scheduling.
## 🎯 Problem Explanation
### **Part 1: Query Execution and Performance Benchmarking**
This section compares **SQL query execution** with equivalent **Python-based queries** and measures **runtime performance**.
#### **1. Expanding the Database Schema**
- Created a **new Geo table** in addition to the existing `Tweet` and `User` tables.
- The `Geo` table includes:
- **Primary Key (`geo_id`)** – Assigned based on location uniqueness.
- **Type**
- **Longitude**
- **Latitude**
- **Linked `Geo` table to the `Tweet` table** via **foreign key**.
#### **2. Query Execution & Benchmarking**
- **SQL Query Execution & Timing:**
a. Find tweets where **tweet ID (`id_str`) contains "89" or "78"** anywhere in the column.
b. Find the number of **unique values in the `friends_count` column**.
- **Equivalent Python-Based Query Execution & Timing:**
- Queries executed **without using SQL** by reading from the **CSV file**.
- Execution time compared against **SQL query performance**.
#### **3. Visualization (Scatter Plot)**
- Plotted **tweet lengths (first 60 tweets) vs. username lengths**.
- Created a **scatterplot** to analyze correlation.
### **Part 2: Query Optimization Using Indexes & Materialized Views**
This section improves query performance using **indexes and materialized views** in **SQLite**.
- **Indexes Created:**
a. **Index on `userid`** in the `Tweet` table.
b. **Composite index on (`friends_count`, `screen_name`)** in the `User` table.
- **Materialized View for Query Optimization:**
- Since SQLite lacks **materialized view support**, created an **optimized table (`CREATE TABLE AS`)** to store precomputed query results for **faster retrieval**.
### **Part 3: Precedence Graph & Serializability Analysis**
This section evaluates **database transaction schedules** for **conflict serializability** using **precedence graphs**.
- **Precedence Graph for Schedule 1:**
- The schedule is **serializable**.
- **Equivalent Serial Schedule:** ``.
- **Precedence Graph for Schedule 2:**
- The schedule is **NOT serializable** due to a **conflict schedule**.
## 🚀 Technologies Used
- **SQLite** (for database management & indexing).
- **Python (Pandas, Matplotlib)** (for data querying, visualization & performance benchmarking).
- **SQL Query Optimization** (Indexes & Materialized Views).