An open API service indexing awesome lists of open source software.

https://github.com/mtholahan/postgresql-tuning-mini-project

Optimized PostgreSQL queries on a computer science publications dataset. Created tables, ingested CSVs, and wrote queries to analyze conferences, authors, and publication trends. Improved performance by designing indexes, refining join/filter logic, and evaluating execution plans with EXPLAIN, demonstrating query tuning and indexing strategies.
https://github.com/mtholahan/postgresql-tuning-mini-project

bootcamp data-engineering data-ingestion database etl indexing performance-tuning postgresql publications query-optimization research-papers springboard sql

Last synced: 9 days ago
JSON representation

Optimized PostgreSQL queries on a computer science publications dataset. Created tables, ingested CSVs, and wrote queries to analyze conferences, authors, and publication trends. Improved performance by designing indexes, refining join/filter logic, and evaluating execution plans with EXPLAIN, demonstrating query tuning and indexing strategies.

Awesome Lists containing this project

README

          

# PostgreSQL Tuning Mini Project

## πŸ“– Abstract
This project focuses on query performance tuning in PostgreSQL, using a bibliographic dataset of computer science papers, authors, books, and conference proceedings. The goal was to practice query design, indexing strategies, and query plan analysis by answering a series of five analytical questions about conferences, authors, and publications.

The workflow included:

- Creating relational tables in PostgreSQL (articles, authors, books, inproceedings, proceedings, publications) and loading data from CSV extracts.

- Writing SQL queries to answer tasks such as:

- Finding conferences with 200+ papers in a decade.

- Identifying authors with at least 10 publications in both PVLDB and SIGMOD.

- Summarizing conference publications by decade from 1970–2019.

- Ranking the top authors in β€œdata”-related venues.

- Listing June conferences with over 100 proceedings.

- Using EXPLAIN to study execution plans, compare queries with and without indexes, and evaluate cache effects.

- Optimizing queries through indexing, improved join logic, and filtering on indexed columns.

- Writing a report analyzing performance improvements, trade-offs, and index usage.

Deliverables include individual .sql files for each query and a written report documenting how indexes improved query performance. This project strengthened my ability to design efficient SQL, interpret query plans, and optimize workloads in PostgreSQL, all essential skills for production-scale analytics.

## πŸ›  Requirements
- PostgreSQL 13+ installed locally

- pgAdmin or psql CLI

- DBLP dataset of computer science publications (provided via Python script/CSV export)

- Python script to download and parse dataset into CSV

- GitHub repo with SQL files + written report (Word/PDF)

## 🧰 Setup
- Install PostgreSQL and pgAdmin (or use psql CLI)

- Create database: CREATE DATABASE dblp;

- Create tables: Articles, Authors, Books, Inproceedings, Proceedings, Publications

- Run Python script to download and parse DBLP XML β†’ CSVs (this is large file!)

- Import CSVs into corresponding Postgres tables using pgAdmin import or COPY

## πŸ“Š Dataset
- DBLP computer science publications dataset

- Parsed into CSVs for Articles, Authors, Books, Inproceedings, Proceedings, Publications

- Imported into Postgres for query + optimization tasks

## ⏱️ Run Steps
- Write queries to answer 5 rubric questions

- Run queries without indexes; capture EXPLAIN plans

- Create indexes to optimize joins/filters

- Re-run queries with indexes; capture new EXPLAIN plans

- Document improvements in Word/PDF report

## πŸ“ˆ Outputs
- 5 SQL queries answering rubric questions

- EXPLAIN query plans before and after indexing (see "Query_Plans_Before_and_After.xlsx")

- Written report comparing performance improvements

## πŸ“Ž Deliverables

- [`Query_4-1.sql`](./deliverables/Query_4-1.sql)

- [`Query_4-2.sql`](./deliverables/Query_4-2.sql)

- [`Query_4-3.sql`](./deliverables/Query_4-3.sql)

- [`Query_4-4.sql`](./deliverables/Query_4-4.sql)

- [`Query_4-5.sql`](./deliverables/Query_4-5.sql)

- [`EXPLAIN_Query_Plans_Before_and_After.xlsx`](./deliverables/EXPLAIN_Query_Plans_Before_and_After.xlsx)

- [`PostgreSQL_Mini_Project_Report.pdf`](./deliverables/PostgreSQL_Mini_Project_Report.pdf)

- [`dblp_extract.py`](./deliverables/dblp_extract.py)

## πŸ› οΈ Architecture
- Single-node PostgreSQL database

- DBLP dataset imported into relational schema

- Queries benchmarked with and without indexing

## πŸ” Monitoring
- Used EXPLAIN to analyze query plans

- Compared execution cost before and after indexes

- Optionally observed caching effects

## ♻️ Cleanup
- Drop dblp database if no longer needed

- Remove CSVs and parsed dataset

- Archive final Word/PDF report and SQL files in repo

*Generated automatically via Python + Jinja2 + SQL Server table `tblMiniProjectProgress` on 11-11-2025 15:31:10*