{"id":26777850,"url":"https://github.com/dorukalkan/pgdatahub","last_synced_at":"2026-04-20T13:05:47.210Z","repository":{"id":284591566,"uuid":"954150965","full_name":"dorukalkan/pgdatahub","owner":"dorukalkan","description":"Multi-format PostgreSQL ETL automation","archived":false,"fork":false,"pushed_at":"2025-07-06T10:44:21.000Z","size":27,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-07-06T11:34:45.520Z","etag":null,"topics":["automation","csv","data-import","data-pipeline","etl","excel","json","postgresql","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dorukalkan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-24T16:29:49.000Z","updated_at":"2025-07-06T10:44:24.000Z","dependencies_parsed_at":"2025-03-26T17:50:10.210Z","dependency_job_id":null,"html_url":"https://github.com/dorukalkan/pgdatahub","commit_stats":null,"previous_names":["dorukalkan/pgdatahub"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dorukalkan/pgdatahub","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dorukalkan%2Fpgdatahub","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dorukalkan%2Fpgdatahub/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dorukalkan%2Fpgdatahub/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dorukalkan%2Fpgdatahub/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dorukalkan","download_url":"https://codeload.github.com/dorukalkan/pgdatahub/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dorukalkan%2Fpgdatahub/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32048450,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T11:35:06.609Z","status":"ssl_error","status_checked_at":"2026-04-20T11:34:48.899Z","response_time":94,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","csv","data-import","data-pipeline","etl","excel","json","postgresql","python"],"created_at":"2025-03-29T05:18:13.671Z","updated_at":"2026-04-20T13:05:47.204Z","avatar_url":"https://github.com/dorukalkan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PostgreSQL Data Hub\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/dorukalkan/pgdatahub/blob/master/LICENSE)\n[![Python](https://img.shields.io/badge/python-3.6%2B-blue)](https://www.python.org/)\n[![Maintenance](https://img.shields.io/badge/maintained-yes-green.svg)](https://github.com/dorukalkan/pgdatahub/graphs/commit-activity)\n\npgdatahub is a multi-format PostgreSQL data import tool that automates ETL operations by processing various file formats (CSV, JSON, Excel) and importing them into a PostgreSQL database.\n\n## Overview\n\npgdatahub automates the process of importing data from different file formats into PostgreSQL databases. It handles the entire pipeline from detecting data files to creating properly structured database tables and importing the data. This tool is particularly useful for data analysts and engineers who need to quickly load multiple datasets into a PostgreSQL database without writing custom import scripts for each file format.\n\n## Features\n\n### File detection and processing\n- **Automatic file detection**: Scans the current directory for CSV, JSON, and Excel files\n- **Multi-format support**: Handles CSV, JSON, Excel (.xlsx, .xls, .xlsm, .xlsb, .odf, .ods, .odt)\n- **Multi-sheet Excel support**: Creates separate tables for each sheet in Excel workbooks\n\n### Data cleaning and transformation\n- **Standardized naming**: Converts column names to database-friendly formats\n- **Turkish character conversion**: Transliterates Turkish characters to Latin equivalents\n- **SQL data type mapping**: Maps pandas data types to appropriate SQL data types\n\n### Database creation\n- **Automatic schema generation**: Creates database schemas based on data structure\n- **Efficient data loading**: Uses PostgreSQL's COPY command for fast data insertion\n- **Configurable connection**: Connect to any PostgreSQL server via configuration file\n\n## Installation\n\n### Prerequisites\n- Python 3.6+\n- PostgreSQL server (local or remote)\n- pandas\n- psycopg2\n- openpyxl\n\n### Steps\n1. Clone the repository:\n   ```\n   git clone https://github.com/dorukalkan/pgdatahub.git\n   cd pgdatahub\n   ```\n\n2. Install dependencies:\n   ```\n   pip install -r requirements.txt\n   ```\n\n## Configuration\n\nDatabase connection settings are stored in the `config.json` file. **Note: When you first clone this repository, you will only have `config.template.json` (not the actual config file).**\n\n### Setting up configuration (first-time setup):\n\n1. Create a new text file in the same folder as the project and name it `config.json`\n2. Open `config.template.json` to view the template structure\n3. Copy the contents from the template and paste them into your new `config.json` file\n4. Replace the placeholder values with your actual PostgreSQL database credentials:\n   ```json\n   {\n       \"database\": {\n           \"host\": \"localhost\",\n           \"database\": \"your_database\",\n           \"user\": \"your_username\",\n           \"password\": \"your_password\",\n           \"port\": 5432\n       }\n   }\n   ```\n5. Save the file\n\n## Usage\n\n1. Make sure you have a PostgreSQL server running\n2. Update the `config.json` file with your database credentials\n3. Place your data files (CSV, JSON, Excel) in the same directory as `main.py`\n4. Run the script:\n\n```\npython main.py\n```\n\nThe script will:\n1. Move your original data files to an \"unprocessed_data\" directory\n2. Process each file and create appropriate dataframes\n3. Clean and standardize column names and file names\n4. Create database tables with appropriate schemas\n5. Import all data into your PostgreSQL database\n6. Move the processed CSV files to a \"processed_data\" directory\n\n## Sample data\n\nThe repository includes sample datasets in the `sample_data` directory that demonstrate the features of pgdatahub:\n\n- **Album-Records.json**: A JSON file demonstrating how JSON structures are converted to database tables\n- **Customer Data \u0026 Info.csv**: CSV file showing how special characters and spaces in headers are handled\n- **Product Sales \u0026 User Data.xlsx**: Excel file with multiple sheets, demonstrating how each sheet becomes a separate table\n\nThese files showcase key features including:\n- File and column name standardization (spaces and special characters → underscores)\n- Turkish character transliteration (ö, ç, ş, ğ, ü, ı → o, c, s, g, u, i)\n- Multi-sheet Excel processing\n- Type mapping from source formats to appropriate SQL data types\n\nTo try out pgdatahub with these sample files, simply copy them to the root directory and run the script.\n\n## Logging\n\npgdatahub includes comprehensive logging that saves information about each run to a timestamped log file. The logs include details about file processing, data cleaning, and database operations, making it easier to troubleshoot any issues.\n\n## Acknowledgements\n\nThis project was initially inspired by StrataScratch's [CSV File to Database Import Automation](https://github.com/Strata-Scratch/csv_to_db_automation) project. pgdatahub has been built on top of this project and has significantly expanded the core functionalities to include:\n\n- Support for Excel file formats with multiple sheets \n- Support for JSON files\n- Enhanced data cleaning with regex functions\n- Turkish character transliteration\n- Extensive error handling and logging\n- Secure database configuration\n\nYou can check out [StrataScratch](https://www.stratascratch.com) for data science resources and watch their tutorials here:\n- [Solve Data Science Tasks In Python](https://youtu.be/wqBFgaMgFQA?feature=shared)\n- [Automating Your Data Science Tasks In Python](https://youtu.be/TDwy1lSjEZo?feature=shared)\n- [Applying Software Engineering Principles To Your Data Science Tasks In Python](https://youtu.be/N0aHeKyNEto?feature=shared)\n\n## Contact\n\nYou can reach me at [dorukalkan1.0@gmail.com](mailto:dorukalkan1.0@gmail.com) for any issues, questions, or suggestions.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](https://github.com/dorukalkan/pgdatahub/blob/master/LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdorukalkan%2Fpgdatahub","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdorukalkan%2Fpgdatahub","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdorukalkan%2Fpgdatahub/lists"}