{"id":49874346,"url":"https://github.com/mtholahan/postgresql-tuning-mini-project","last_synced_at":"2026-05-15T11:41:33.405Z","repository":{"id":283604181,"uuid":"952314711","full_name":"mtholahan/postgresql-tuning-mini-project","owner":"mtholahan","description":"Optimized PostgreSQL queries on a computer science publications dataset. Created tables, ingested CSVs, and wrote queries to analyze conferences, authors, and publication trends. Improved performance by designing indexes, refining join/filter logic, and evaluating execution plans with EXPLAIN, demonstrating query tuning and indexing strategies.","archived":false,"fork":false,"pushed_at":"2025-09-15T04:58:14.000Z","size":38,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-15T06:26:36.707Z","etag":null,"topics":["bootcamp","data-engineering","data-ingestion","database","etl","indexing","performance-tuning","postgresql","publications","query-optimization","research-papers","springboard","sql"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mtholahan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-03-21T04:43:32.000Z","updated_at":"2025-09-15T04:58:17.000Z","dependencies_parsed_at":"2025-09-15T06:26:42.550Z","dependency_job_id":"56a98454-791e-4070-8630-674bdfc6248c","html_url":"https://github.com/mtholahan/postgresql-tuning-mini-project","commit_stats":null,"previous_names":["mtholahan/sql_tuning","mtholahan/postgresql-tuning-mini-project"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mtholahan/postgresql-tuning-mini-project","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtholahan%2Fpostgresql-tuning-mini-project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtholahan%2Fpostgresql-tuning-mini-project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtholahan%2Fpostgresql-tuning-mini-project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtholahan%2Fpostgresql-tuning-mini-project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mtholahan","download_url":"https://codeload.github.com/mtholahan/postgresql-tuning-mini-project/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtholahan%2Fpostgresql-tuning-mini-project/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33066034,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-15T11:35:32.926Z","status":"ssl_error","status_checked_at":"2026-05-15T11:35:31.362Z","response_time":103,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bootcamp","data-engineering","data-ingestion","database","etl","indexing","performance-tuning","postgresql","publications","query-optimization","research-papers","springboard","sql"],"created_at":"2026-05-15T11:41:32.807Z","updated_at":"2026-05-15T11:41:33.396Z","avatar_url":"https://github.com/mtholahan.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# PostgreSQL Tuning Mini Project\r\n\r\n\r\n## 📖 Abstract\r\nThis project focuses on query performance tuning in PostgreSQL, using a bibliographic dataset of computer science papers, authors, books, and conference proceedings. The goal was to practice query design, indexing strategies, and query plan analysis by answering a series of five analytical questions about conferences, authors, and publications.\r\r\n\r\r\nThe workflow included:\r\r\n\r\r\n- Creating relational tables in PostgreSQL (articles, authors, books, inproceedings, proceedings, publications) and loading data from CSV extracts.\r\r\n- Writing SQL queries to answer tasks such as:\r\r\n-   Finding conferences with 200+ papers in a decade.\r\r\n-   Identifying authors with at least 10 publications in both PVLDB and SIGMOD.\r\r\n-   Summarizing conference publications by decade from 1970–2019.\r\r\n-   Ranking the top authors in “data”-related venues.\r\r\n-   Listing June conferences with over 100 proceedings.\r\r\n- Using EXPLAIN to study execution plans, compare queries with and without indexes, and evaluate cache effects.\r\r\n- Optimizing queries through indexing, improved join logic, and filtering on indexed columns.\r\r\n- Writing a report analyzing performance improvements, trade-offs, and index usage.\r\r\n\r\r\nDeliverables include individual .sql files for each query and a written report documenting how indexes improved query performance. This project strengthened my ability to design efficient SQL, interpret query plans, and optimize workloads in PostgreSQL, all essential skills for production-scale analytics.\r\n\r\n\r\n\r\n## 🛠 Requirements\r\n- PostgreSQL 13+ installed locally\r\r\n- pgAdmin or psql CLI\r\r\n- DBLP dataset of computer science publications (provided via Python script/CSV export)\r\r\n- Python script to download and parse dataset into CSV\r\r\n- GitHub repo with SQL files + written report (Word/PDF)\r\n\r\n\r\n\r\n## 🧰 Setup\r\n- Install PostgreSQL and pgAdmin (or use psql CLI)\r\r\n- Create database: CREATE DATABASE dblp;\r\r\n- Create tables: Articles, Authors, Books, Inproceedings, Proceedings, Publications\r\r\n- Run Python script to download and parse DBLP XML → CSVs (this is large file!)\r\r\n- Import CSVs into corresponding Postgres tables using pgAdmin import or COPY\r\n\r\n\r\n\r\n## 📊 Dataset\r\n- DBLP computer science publications dataset\r\r\n- Parsed into CSVs for Articles, Authors, Books, Inproceedings, Proceedings, Publications\r\r\n- Imported into Postgres for query + optimization tasks\r\n\r\n\r\n\r\n## ⏱️ Run Steps\r\n- Write queries to answer 5 rubric questions\r\r\n- Run queries without indexes; capture EXPLAIN plans\r\r\n- Create indexes to optimize joins/filters\r\r\n- Re-run queries with indexes; capture new EXPLAIN plans\r\r\n- Document improvements in Word/PDF report\r\n\r\n\r\n\r\n## 📈 Outputs\r\n- 5 SQL queries answering rubric questions\r\r\n- EXPLAIN query plans before and after indexing (see \"Query_Plans_Before_and_After.xlsx\")\r\r\n- Written report comparing performance improvements\r\n\r\n\r\n\r\n\r\n\r\n## 📎 Deliverables\r\n\r\n- [`Query_4-1.sql`](./deliverables/Query_4-1.sql)\r\n\r\n- [`Query_4-2.sql`](./deliverables/Query_4-2.sql)\r\n\r\n- [`Query_4-3.sql`](./deliverables/Query_4-3.sql)\r\n\r\n- [`Query_4-4.sql`](./deliverables/Query_4-4.sql)\r\n\r\n- [`Query_4-5.sql`](./deliverables/Query_4-5.sql)\r\n\r\n- [`EXPLAIN_Query_Plans_Before_and_After.xlsx`](./deliverables/EXPLAIN_Query_Plans_Before_and_After.xlsx)\r\n\r\n- [`PostgreSQL_Mini_Project_Report.pdf`](./deliverables/PostgreSQL_Mini_Project_Report.pdf)\r\n\r\n- [`dblp_extract.py`](./deliverables/dblp_extract.py)\r\n\r\n\r\n\r\n\r\n## 🛠️ Architecture\r\n- Single-node PostgreSQL database\r\r\n- DBLP dataset imported into relational schema\r\r\n- Queries benchmarked with and without indexing\r\n\r\n\r\n\r\n## 🔍 Monitoring\r\n- Used EXPLAIN to analyze query plans\r\r\n- Compared execution cost before and after indexes\r\r\n- Optionally observed caching effects\r\n\r\n\r\n\r\n## ♻️ Cleanup\r\n- Drop dblp database if no longer needed\r\r\n- Remove CSVs and parsed dataset\r\r\n- Archive final Word/PDF report and SQL files in repo\r\n\r\n\r\n*Generated automatically via Python + Jinja2 + SQL Server table `tblMiniProjectProgress` on 11-11-2025 15:31:10*","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmtholahan%2Fpostgresql-tuning-mini-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmtholahan%2Fpostgresql-tuning-mini-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmtholahan%2Fpostgresql-tuning-mini-project/lists"}