https://github.com/mtholahan/postgresql-tuning-mini-project
Optimized PostgreSQL queries on a computer science publications dataset. Created tables, ingested CSVs, and wrote queries to analyze conferences, authors, and publication trends. Improved performance by designing indexes, refining join/filter logic, and evaluating execution plans with EXPLAIN, demonstrating query tuning and indexing strategies.
https://github.com/mtholahan/postgresql-tuning-mini-project
bootcamp data-engineering data-ingestion database etl indexing performance-tuning postgresql publications query-optimization research-papers springboard sql
Last synced: 9 days ago
JSON representation
Optimized PostgreSQL queries on a computer science publications dataset. Created tables, ingested CSVs, and wrote queries to analyze conferences, authors, and publication trends. Improved performance by designing indexes, refining join/filter logic, and evaluating execution plans with EXPLAIN, demonstrating query tuning and indexing strategies.
- Host: GitHub
- URL: https://github.com/mtholahan/postgresql-tuning-mini-project
- Owner: mtholahan
- Created: 2025-03-21T04:43:32.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-09-15T04:58:14.000Z (8 months ago)
- Last Synced: 2025-09-15T06:26:36.707Z (8 months ago)
- Topics: bootcamp, data-engineering, data-ingestion, database, etl, indexing, performance-tuning, postgresql, publications, query-optimization, research-papers, springboard, sql
- Homepage:
- Size: 37.1 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PostgreSQL Tuning Mini Project
## π Abstract
This project focuses on query performance tuning in PostgreSQL, using a bibliographic dataset of computer science papers, authors, books, and conference proceedings. The goal was to practice query design, indexing strategies, and query plan analysis by answering a series of five analytical questions about conferences, authors, and publications.
The workflow included:
- Creating relational tables in PostgreSQL (articles, authors, books, inproceedings, proceedings, publications) and loading data from CSV extracts.
- Writing SQL queries to answer tasks such as:
- Finding conferences with 200+ papers in a decade.
- Identifying authors with at least 10 publications in both PVLDB and SIGMOD.
- Summarizing conference publications by decade from 1970β2019.
- Ranking the top authors in βdataβ-related venues.
- Listing June conferences with over 100 proceedings.
- Using EXPLAIN to study execution plans, compare queries with and without indexes, and evaluate cache effects.
- Optimizing queries through indexing, improved join logic, and filtering on indexed columns.
- Writing a report analyzing performance improvements, trade-offs, and index usage.
Deliverables include individual .sql files for each query and a written report documenting how indexes improved query performance. This project strengthened my ability to design efficient SQL, interpret query plans, and optimize workloads in PostgreSQL, all essential skills for production-scale analytics.
## π Requirements
- PostgreSQL 13+ installed locally
- pgAdmin or psql CLI
- DBLP dataset of computer science publications (provided via Python script/CSV export)
- Python script to download and parse dataset into CSV
- GitHub repo with SQL files + written report (Word/PDF)
## π§° Setup
- Install PostgreSQL and pgAdmin (or use psql CLI)
- Create database: CREATE DATABASE dblp;
- Create tables: Articles, Authors, Books, Inproceedings, Proceedings, Publications
- Run Python script to download and parse DBLP XML β CSVs (this is large file!)
- Import CSVs into corresponding Postgres tables using pgAdmin import or COPY
## π Dataset
- DBLP computer science publications dataset
- Parsed into CSVs for Articles, Authors, Books, Inproceedings, Proceedings, Publications
- Imported into Postgres for query + optimization tasks
## β±οΈ Run Steps
- Write queries to answer 5 rubric questions
- Run queries without indexes; capture EXPLAIN plans
- Create indexes to optimize joins/filters
- Re-run queries with indexes; capture new EXPLAIN plans
- Document improvements in Word/PDF report
## π Outputs
- 5 SQL queries answering rubric questions
- EXPLAIN query plans before and after indexing (see "Query_Plans_Before_and_After.xlsx")
- Written report comparing performance improvements
## π Deliverables
- [`Query_4-1.sql`](./deliverables/Query_4-1.sql)
- [`Query_4-2.sql`](./deliverables/Query_4-2.sql)
- [`Query_4-3.sql`](./deliverables/Query_4-3.sql)
- [`Query_4-4.sql`](./deliverables/Query_4-4.sql)
- [`Query_4-5.sql`](./deliverables/Query_4-5.sql)
- [`EXPLAIN_Query_Plans_Before_and_After.xlsx`](./deliverables/EXPLAIN_Query_Plans_Before_and_After.xlsx)
- [`PostgreSQL_Mini_Project_Report.pdf`](./deliverables/PostgreSQL_Mini_Project_Report.pdf)
- [`dblp_extract.py`](./deliverables/dblp_extract.py)
## π οΈ Architecture
- Single-node PostgreSQL database
- DBLP dataset imported into relational schema
- Queries benchmarked with and without indexing
## π Monitoring
- Used EXPLAIN to analyze query plans
- Compared execution cost before and after indexes
- Optionally observed caching effects
## β»οΈ Cleanup
- Drop dblp database if no longer needed
- Remove CSVs and parsed dataset
- Archive final Word/PDF report and SQL files in repo
*Generated automatically via Python + Jinja2 + SQL Server table `tblMiniProjectProgress` on 11-11-2025 15:31:10*