{"id":29729552,"url":"https://github.com/machado000/sqlalchemy-psql-upsert","last_synced_at":"2025-07-25T04:13:30.763Z","repository":{"id":305017090,"uuid":"1021623014","full_name":"machado000/sqlalchemy-psql-upsert","owner":"machado000","description":"Upsert helper for PostgreSQL using SQLAlchemy","archived":false,"fork":false,"pushed_at":"2025-07-19T18:32:14.000Z","size":97,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-19T19:50:37.663Z","etag":null,"topics":["postgresql","python","sqlalchemy","upsert"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/machado000.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-17T17:20:13.000Z","updated_at":"2025-07-19T18:32:18.000Z","dependencies_parsed_at":"2025-07-19T19:50:43.421Z","dependency_job_id":null,"html_url":"https://github.com/machado000/sqlalchemy-psql-upsert","commit_stats":null,"previous_names":["machado000/sqlalchemy-upsert","machado000/sqlalchemy-psql-upsert"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/machado000/sqlalchemy-psql-upsert","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machado000%2Fsqlalchemy-psql-upsert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machado000%2Fsqlalchemy-psql-upsert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machado000%2Fsqlalchemy-psql-upsert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machado000%2Fsqlalchemy-psql-upsert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/machado000","download_url":"https://codeload.github.com/machado000/sqlalchemy-psql-upsert/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machado000%2Fsqlalchemy-psql-upsert/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266878713,"owners_count":23999624,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-24T02:00:09.469Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["postgresql","python","sqlalchemy","upsert"],"created_at":"2025-07-25T04:13:27.851Z","updated_at":"2025-07-25T04:13:30.758Z","avatar_url":"https://github.com/machado000.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SQLAlchemy PostgreSQL Upsert\n\n[![PyPI version](https://img.shields.io/pypi/v/sqlalchemy-psql-upsert)](https://pypi.org/project/sqlalchemy-psql-upsert/)\n[![License](https://img.shields.io/github/license/machado000/sqlalchemy-psql-upsert)](https://github.com/machado000/sqlalchemy-psql-upsert/blob/main/LICENSE)\n[![Issues](https://img.shields.io/github/issues/machado000/sqlalchemy-psql-upsert)](https://github.com/machado000/sqlalchemy-psql-upsert/issues)\n[![Last Commit](https://img.shields.io/github/last-commit/machado000/sqlalchemy-psql-upsert)](https://github.com/machado000/sqlalchemy-psql-upsert/commits/main)\n\nA high-performance Python library for PostgreSQL UPSERT operations with intelligent conflict resolution using temporary tables and advanced CTE algorithm.\n\n\n## 🚀 Features\n\n- **Temporary Table Strategy**: Uses PostgreSQL temporary tables for efficient staging and conflict analysis\n- **Advanced CTE Logic**: Sophisticated Common Table Expression queries for multi-constraint conflict resolution\n- **Atomic MERGE Operations**: Single-transaction upserts using PostgreSQL 15+ MERGE statements\n- **Multi-constraint Support**: Handles primary keys, unique constraints, and composite constraints simultaneously\n- **Intelligent Conflict Resolution**: Automatically filters ambiguous conflicts and deduplicates data\n- **Automatic NaN to NULL Conversion**: Seamlessly converts pandas NaN values to PostgreSQL NULL values\n- **Schema Validation**: Automatic table and column validation before operations\n- **Comprehensive Logging**: Detailed debug information and progress tracking\n\n\n## 📦 Installation\n\n### Using Poetry (Recommended)\n```bash\npoetry add sqlalchemy-psql-upsert\n```\n\n### Using pip\n```bash\npip install sqlalchemy-psql-upsert\n```\n\n## ⚙️ Configuration\n\n### Database Privileges Requirements\n\nBesides SELECT, INSERT, UPDATE permissions on target tables, this library requires  PostgreSQL `TEMPORARY` privilege to function properly:\n\n**Why Temporary Tables?**\n- **Isolation**: Staging data doesn't interfere with production tables during analysis\n- **Performance**: Bulk operations are faster on temporary tables\n- **Safety**: Failed operations don't leave partial data in target tables\n- **Atomicity**: Entire upsert operation happens in a single transaction\n\n### Environment Variables\n\nCreate a `.env` file or set the following environment variables:\n\n```bash\n# PostgreSQL Configuration\nPGHOST = localhost\nPGPORT = 5432\nPGDATABASE = your_database\nPGUSER = your_username\nPGPASSWORD = your_password\n```\n\n### Configuration Class\n\n```python\nfrom sqlalchemy_psql_upsert import PgConfig\n\n# Default configuration from environment\nconfig = PgConfig()\n\n# Manual configuration\nconfig = PgConfig(\n    host=\"localhost\",\n    port=\"5432\",\n    user=\"myuser\",\n    password=\"mypass\",\n    dbname=\"mydb\"\n)\n\nprint(config.uri())  # postgresql+psycopg2://myuser:mypass@localhost:5432/mydb\n```\n\n## 🛠️ Quick Start\n\n### Connection Testing\n\n```python\nfrom sqlalchemy_psql_upsert import test_connection\n\n# Test default connection\nsuccess, message = test_connection()\nif success:\n    print(\"✅ Database connection OK\")\nelse:\n    print(f\"❌ Connection failed: {message}\")\n```\n\n### Privileges Verification\n\n**Important**: This library requires `CREATE TEMP TABLE` privileges to function properly. The client automatically verifies these privileges during initialization.\n\n```python\nfrom sqlalchemy_psql_upsert import PostgresqlUpsert, PgConfig\nfrom sqlalchemy import create_engine\n\n# Test connection and privileges\nconfig = PgConfig()\n\ntry:\n    # This will automatically test temp table privileges\n    upserter = PostgresqlUpsert(config=config)\n    print(\"✅ Connection and privileges verified successfully\")\n    \nexcept PermissionError as e:\n    print(f\"❌ Privilege error: {e}\")\n    print(\"Solution: Grant temporary privileges with:\")\n    print(\"GRANT TEMPORARY ON DATABASE your_database TO your_user;\")\n    \nexcept Exception as e:\n    print(f\"❌ Connection failed: {e}\")\n```\n\n**Grant Required Privileges:**\n```sql\n-- As database administrator, grant temporary table privileges\nGRANT TEMPORARY ON DATABASE your_database TO your_user;\n\n-- Alternatively, grant more comprehensive privileges\nGRANT CREATE ON DATABASE your_database TO your_user;\n```\n\n### Basic Usage\n\n```python\nimport pandas as pd\nfrom sqlalchemy_psql_upsert import PostgresqlUpsert, PgConfig\n\n# Configure database connection\nconfig = PgConfig()  # Loads from environment variables\nupserter = PostgresqlUpsert(config=config)\n\n# Prepare your data\ndf = pd.DataFrame({\n    'id': [1, 2, 3],\n    'name': ['Alice', 'Bob', 'Charlie'],\n    'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com']\n})\n\n# Perform upsert\nsuccess, affected_rows = upserter.upsert_dataframe(\n    dataframe=df,\n    table_name='users',\n    schema='public'\n)\n\nprint(f\"✅ Upserted {affected_rows} rows successfully\")\n```\n\n### Advanced Configuration\n\n```python\nfrom sqlalchemy import create_engine\n\n# Using custom SQLAlchemy engine\nengine = create_engine('postgresql://user:pass@localhost:5432/mydb')\nupserter = PostgresqlUpsert(engine=engine, debug=True)\n\n# Upsert with custom schema\nsuccess, affected_rows = upserter.upsert_dataframe(\n    dataframe=large_df,\n    table_name='products',\n    schema='inventory'\n)\n```\n\n### Data Type Handling\n\nThe library automatically handles pandas data type conversions:\n\n```python\nimport pandas as pd\nimport numpy as np\n\n# DataFrame with NaN values\ndf = pd.DataFrame({\n    'id': [1, 2, 3, 4],\n    'name': ['Alice', 'Bob', None, 'David'],      # None values\n    'score': [85.5, np.nan, 92.0, 88.1],         # NaN values\n    'active': [True, False, None, True]           # Mixed types with None\n})\n\n# All NaN and None values are automatically converted to PostgreSQL NULL\nupserter.upsert_dataframe(df, 'users')\n# Result: NaN/None → NULL in PostgreSQL\n```\n\n## 🔍 How It Works\n\n### Overview: Temporary Table + CTE + MERGE Strategy\n\nThis library uses a sophisticated 3-stage approach for reliable, high-performance upserts:\n\n1. **Temporary Table Staging**: Data is first loaded into a PostgreSQL temporary table\n2. **CTE-based Conflict Analysis**: Advanced Common Table Expressions analyze and resolve conflicts\n3. **Atomic MERGE Operation**: Single transaction applies all changes using PostgreSQL MERGE\n\n### Stage 1: Temporary Table Creation\n\n```sql\n-- Creates a temporary table with identical structure to target table\nCREATE TEMP TABLE \"temp_upsert_12345678\" (\n    id INTEGER,\n    email VARCHAR,\n    doc_type VARCHAR,\n    doc_number VARCHAR\n);\n\n-- Bulk insert all DataFrame data into temporary table\nINSERT INTO \"temp_upsert_12345678\" VALUES (...);\n```\n\n**Benefits:**\n- ✅ Isolates staging data from target table\n- ✅ Enables complex analysis without affecting production data\n- ✅ Automatic cleanup when session ends\n- ✅ High-performance bulk inserts\n\n### Stage 2: CTE-based Conflict Analysis\n\nThe library generates sophisticated CTEs to handle complex conflict scenarios:\n\n```sql\nWITH temp_with_conflicts AS (\n    -- Analyze each temp row against ALL constraints simultaneously\n    SELECT temp.*, temp.ctid,\n           COUNT(DISTINCT target.id) AS conflict_targets,\n           MAX(target.id) AS conflicted_target_id\n    FROM \"temp_upsert_12345678\" temp\n    LEFT JOIN \"public\".\"target_table\" target\n      ON (temp.id = target.id) OR              -- PK conflict\n         (temp.email = target.email) OR        -- Unique conflict\n         (temp.doc_type = target.doc_type AND  -- Composite unique conflict\n          temp.doc_number = target.doc_number)\n    GROUP BY temp.ctid, temp.id, temp.email, temp.doc_type, temp.doc_number\n),\nfiltered AS (\n    -- Filter out rows that conflict with multiple target rows (ambiguous)\n    SELECT * FROM temp_with_conflicts\n    WHERE conflict_targets \u003c= 1\n),\nranked AS (\n    -- Deduplicate temp rows targeting the same existing row\n    SELECT *,\n           ROW_NUMBER() OVER (\n               PARTITION BY COALESCE(conflicted_target_id, id)\n               ORDER BY ctid DESC\n           ) AS row_rank\n    FROM filtered\n),\nclean_rows AS (\n    -- Final dataset: only the latest row per target\n    SELECT id, email, doc_type, doc_number\n    FROM ranked\n    WHERE row_rank = 1\n)\n-- Ready for MERGE...\n```\n\n**CTE Logic Breakdown:**\n\n1. **`temp_with_conflicts`**: Joins temporary table against target table using ALL constraints simultaneously, counting how many existing rows each temp row conflicts with.\n\n2. **`filtered`**: Removes ambiguous rows that would conflict with multiple existing records (keeps rows with conflict_targets \u003c= 1) - [ _1 to many conflict_ ].\n\n3. **`ranked`**: When multiple temp rows target the same existing record, rank these based on descend insertion order using table ctid - [ _many to 1 conflict_ ].\n\n4. **`clean_rows`**: Keps only the last row inserted for each conflict. Final clean dataset ready for atomic upsert.\n\n### Stage 3: Atomic MERGE Operation\n\n```sql\nMERGE INTO \"public\".\"target_table\" AS tgt\nUSING clean_rows AS src\nON (tgt.id = src.id) OR \n   (tgt.email = src.email) OR \n   (tgt.doc_type = src.doc_type AND tgt.doc_number = src.doc_number)\nWHEN MATCHED THEN\n    UPDATE SET email = src.email, doc_type = src.doc_type, doc_number = src.doc_number\nWHEN NOT MATCHED THEN\n    INSERT (id, email, doc_type, doc_number)\n    VALUES (src.id, src.email, src.doc_type, src.doc_number);\n```\n\n**Benefits:**\n- ✅ Single atomic transaction\n- ✅ Handles all conflict types simultaneously\n- ✅ Automatic INSERT or UPDATE decision\n- ✅ No race conditions or partial updates\n\n### Constraint Detection\n\nThe library automatically analyzes your target table to identify:\n- **Primary key constraints**: Single or composite primary keys\n- **Unique constraints**: Single column unique constraints  \n- **Composite unique constraints**: Multi-column unique constraints\n\nAll constraint types are handled simultaneously in a single operation.\n\n\n## 🚨 Pros, Cons \u0026 Considerations\n\n### ✅ Advantages of Temporary Table + CTE + MERGE Approach\n\n**Performance Benefits:**\n- **Single Transaction**: Entire operation is atomic, no partial updates or race conditions\n- **Bulk Operations**: High-performance bulk inserts into temporary tables\n- **Efficient Joins**: PostgreSQL optimizes joins between temporary and main tables\n- **Minimal Locking**: Temporary tables don't interfere with concurrent operations\n\n**Reliability Benefits:**\n- **Comprehensive Conflict Resolution**: Handles all constraint types simultaneously\n- **Deterministic Results**: Same input always produces same output\n- **Automatic Cleanup**: Temporary tables are automatically dropped\n- **ACID Compliance**: Full transaction safety and rollback capability\n\n**Data Integrity Benefits:**\n- **Ambiguity Detection**: Automatically detects and skips problematic rows\n- **Deduplication**: Handles duplicate input data intelligently\n- **Constraint Validation**: PostgreSQL validates all constraints during MERGE\n\n### ❌ Limitations and Trade-offs\n\n**Resource Requirements:**\n- **Memory Usage**: All input data is staged in temporary tables (memory-resident)\n- **Temporary Space**: Requires sufficient temporary storage for staging tables\n- **Single-threaded**: No parallel processing (traded for reliability and simplicity)\n\n**PostgreSQL Specifics:**\n- **Version Dependency**: MERGE statement requires PostgreSQL 15+\n- **Session-based Temp Tables**: Temporary tables are tied to database sessions\n\n**Privilege-related Limitations:**\n- **Database Administrator**: May need DBA assistance to grant `TEMPORARY` privileges\n- **Shared Hosting**: Some cloud providers restrict temporary table creation\n- **Security Policies**: Corporate environments may restrict temporary table usage\n\n**Scale Considerations:**\n- **Large Dataset Handling**: Very large datasets (\u003e1M rows) may require memory tuning\n- **Transaction Duration**: Entire operation happens in one transaction (longer lock times)\n\n### 🎯 Best Practices\n\n**Memory Management:**\n```python\n# For large datasets, monitor memory usage\nimport pandas as pd\n\n# Consider chunking very large datasets manually if needed\ndef chunk_dataframe(df, chunk_size=50000):\n    for start in range(0, len(df), chunk_size):\n        yield df[start:start + chunk_size]\n\n# Process in manageable chunks\nfor chunk in chunk_dataframe(large_df):\n    success, rows = upserter.upsert_dataframe(chunk, 'target_table')\n    print(f\"Processed chunk: {rows} rows\")\n```\n\n**Performance Optimization:**\n```python\n# Ensure proper indexing on conflict columns\n# CREATE INDEX idx_email ON target_table(email);\n# CREATE INDEX idx_composite ON target_table(doc_type, doc_number);\n\n# Use debug mode to monitor performance\nupserter = PostgresqlUpsert(engine=engine, debug=True)\n```\n\n**Error Handling:**\n```python\ntry:\n    success, affected_rows = upserter.upsert_dataframe(df, 'users')\n    logger.info(f\"Successfully upserted {affected_rows} rows\")\nexcept ValueError as e:\n    logger.error(f\"Validation error: {e}\")\nexcept Exception as e:\n    logger.error(f\"Upsert failed: {e}\")\n    # Handle rollback - transaction is automatically rolled back\n```\n\n## 🤝 Contributing\n\nWe welcome contributions! Here's how to get started:\n\n1. **Fork the repository**\n2. **Create a feature branch**: `git checkout -b feature/amazing-feature`\n3. **Make your changes** and add tests\n4. **Run the test suite**: `pytest tests/ -v`\n5. **Submit a pull request**\n\n\n## 📝 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## 🙋 Support\n\n- **Issues**: [GitHub Issues](https://github.com/machado000/sqlalchemy-psql-upsert/issues)\n- **Documentation**: Check the docstrings and test files for detailed usage examples","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmachado000%2Fsqlalchemy-psql-upsert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmachado000%2Fsqlalchemy-psql-upsert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmachado000%2Fsqlalchemy-psql-upsert/lists"}