https://github.com/machado000/sqlalchemy-psql-upsert

Upsert helper for PostgreSQL using SQLAlchemy
https://github.com/machado000/sqlalchemy-psql-upsert
postgresql python sqlalchemy upsert
Last synced: 11 months ago
JSON representation
Upsert helper for PostgreSQL using SQLAlchemy
Host: GitHub
URL: https://github.com/machado000/sqlalchemy-psql-upsert
Owner: machado000
License: mit
Created: 2025-07-17T17:20:13.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-07-19T18:32:14.000Z (11 months ago)
Last Synced: 2025-07-19T19:50:37.663Z (11 months ago)
Topics: postgresql, python, sqlalchemy, upsert
Language: Python
Homepage:
Size: 94.7 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # SQLAlchemy PostgreSQL Upsert

[![PyPI version](https://img.shields.io/pypi/v/sqlalchemy-psql-upsert)](https://pypi.org/project/sqlalchemy-psql-upsert/)

[![License](https://img.shields.io/github/license/machado000/sqlalchemy-psql-upsert)](https://github.com/machado000/sqlalchemy-psql-upsert/blob/main/LICENSE)

[![Issues](https://img.shields.io/github/issues/machado000/sqlalchemy-psql-upsert)](https://github.com/machado000/sqlalchemy-psql-upsert/issues)

[![Last Commit](https://img.shields.io/github/last-commit/machado000/sqlalchemy-psql-upsert)](https://github.com/machado000/sqlalchemy-psql-upsert/commits/main)

A high-performance Python library for PostgreSQL UPSERT operations with intelligent conflict resolution using temporary tables and advanced CTE algorithm.

## 🚀 Features

- **Temporary Table Strategy**: Uses PostgreSQL temporary tables for efficient staging and conflict analysis

- **Advanced CTE Logic**: Sophisticated Common Table Expression queries for multi-constraint conflict resolution

- **Atomic MERGE Operations**: Single-transaction upserts using PostgreSQL 15+ MERGE statements

- **Multi-constraint Support**: Handles primary keys, unique constraints, and composite constraints simultaneously

- **Intelligent Conflict Resolution**: Automatically filters ambiguous conflicts and deduplicates data

- **Automatic NaN to NULL Conversion**: Seamlessly converts pandas NaN values to PostgreSQL NULL values

- **Schema Validation**: Automatic table and column validation before operations

- **Comprehensive Logging**: Detailed debug information and progress tracking

## 📦 Installation

### Using Poetry (Recommended)

```bash

poetry add sqlalchemy-psql-upsert

```

### Using pip

```bash

pip install sqlalchemy-psql-upsert

```

## ⚙️ Configuration

### Database Privileges Requirements

Besides SELECT, INSERT, UPDATE permissions on target tables, this library requires  PostgreSQL `TEMPORARY` privilege to function properly:

**Why Temporary Tables?**

- **Isolation**: Staging data doesn't interfere with production tables during analysis

- **Performance**: Bulk operations are faster on temporary tables

- **Safety**: Failed operations don't leave partial data in target tables

- **Atomicity**: Entire upsert operation happens in a single transaction

### Environment Variables

Create a `.env` file or set the following environment variables:

```bash

# PostgreSQL Configuration

PGHOST = localhost

PGPORT = 5432

PGDATABASE = your_database

PGUSER = your_username

PGPASSWORD = your_password

```

### Configuration Class

```python

from sqlalchemy_psql_upsert import PgConfig

# Default configuration from environment

config = PgConfig()

# Manual configuration

config = PgConfig(

    host="localhost",

    port="5432",

    user="myuser",

    password="mypass",

    dbname="mydb"

)

print(config.uri())  # postgresql+psycopg2://myuser:mypass@localhost:5432/mydb

```

## 🛠️ Quick Start

### Connection Testing

```python

from sqlalchemy_psql_upsert import test_connection

# Test default connection

success, message = test_connection()

if success:

    print("✅ Database connection OK")

else:

    print(f"❌ Connection failed: {message}")

```

### Privileges Verification

**Important**: This library requires `CREATE TEMP TABLE` privileges to function properly. The client automatically verifies these privileges during initialization.

```python

from sqlalchemy_psql_upsert import PostgresqlUpsert, PgConfig

from sqlalchemy import create_engine

# Test connection and privileges

config = PgConfig()

try:

    # This will automatically test temp table privileges

    upserter = PostgresqlUpsert(config=config)

    print("✅ Connection and privileges verified successfully")

    

except PermissionError as e:

    print(f"❌ Privilege error: {e}")

    print("Solution: Grant temporary privileges with:")

    print("GRANT TEMPORARY ON DATABASE your_database TO your_user;")

    

except Exception as e:

    print(f"❌ Connection failed: {e}")

```

**Grant Required Privileges:**

```sql

-- As database administrator, grant temporary table privileges

GRANT TEMPORARY ON DATABASE your_database TO your_user;

-- Alternatively, grant more comprehensive privileges

GRANT CREATE ON DATABASE your_database TO your_user;

```

### Basic Usage

```python

import pandas as pd

from sqlalchemy_psql_upsert import PostgresqlUpsert, PgConfig

# Configure database connection

config = PgConfig()  # Loads from environment variables

upserter = PostgresqlUpsert(config=config)

# Prepare your data

df = pd.DataFrame({

    'id': [1, 2, 3],

    'name': ['Alice', 'Bob', 'Charlie'],

    'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com']

})

# Perform upsert

success, affected_rows = upserter.upsert_dataframe(

    dataframe=df,

    table_name='users',

    schema='public'

)

print(f"✅ Upserted {affected_rows} rows successfully")

```

### Advanced Configuration

```python

from sqlalchemy import create_engine

# Using custom SQLAlchemy engine

engine = create_engine('postgresql://user:pass@localhost:5432/mydb')

upserter = PostgresqlUpsert(engine=engine, debug=True)

# Upsert with custom schema

success, affected_rows = upserter.upsert_dataframe(

    dataframe=large_df,

    table_name='products',

    schema='inventory'

)

```

### Data Type Handling

The library automatically handles pandas data type conversions:

```python

import pandas as pd

import numpy as np

# DataFrame with NaN values

df = pd.DataFrame({

    'id': [1, 2, 3, 4],

    'name': ['Alice', 'Bob', None, 'David'],      # None values

    'score': [85.5, np.nan, 92.0, 88.1],         # NaN values

    'active': [True, False, None, True]           # Mixed types with None

})

# All NaN and None values are automatically converted to PostgreSQL NULL

upserter.upsert_dataframe(df, 'users')

# Result: NaN/None → NULL in PostgreSQL

```

## 🔍 How It Works

### Overview: Temporary Table + CTE + MERGE Strategy

This library uses a sophisticated 3-stage approach for reliable, high-performance upserts:

1. **Temporary Table Staging**: Data is first loaded into a PostgreSQL temporary table

2. **CTE-based Conflict Analysis**: Advanced Common Table Expressions analyze and resolve conflicts

3. **Atomic MERGE Operation**: Single transaction applies all changes using PostgreSQL MERGE

### Stage 1: Temporary Table Creation

```sql

-- Creates a temporary table with identical structure to target table

CREATE TEMP TABLE "temp_upsert_12345678" (

    id INTEGER,

    email VARCHAR,

    doc_type VARCHAR,

    doc_number VARCHAR

);

-- Bulk insert all DataFrame data into temporary table

INSERT INTO "temp_upsert_12345678" VALUES (...);

```

**Benefits:**

- ✅ Isolates staging data from target table

- ✅ Enables complex analysis without affecting production data

- ✅ Automatic cleanup when session ends

- ✅ High-performance bulk inserts

### Stage 2: CTE-based Conflict Analysis

The library generates sophisticated CTEs to handle complex conflict scenarios:

```sql

WITH temp_with_conflicts AS (

    -- Analyze each temp row against ALL constraints simultaneously

    SELECT temp.*, temp.ctid,

           COUNT(DISTINCT target.id) AS conflict_targets,

           MAX(target.id) AS conflicted_target_id

    FROM "temp_upsert_12345678" temp

    LEFT JOIN "public"."target_table" target

      ON (temp.id = target.id) OR              -- PK conflict

         (temp.email = target.email) OR        -- Unique conflict

         (temp.doc_type = target.doc_type AND  -- Composite unique conflict

          temp.doc_number = target.doc_number)

    GROUP BY temp.ctid, temp.id, temp.email, temp.doc_type, temp.doc_number

),

filtered AS (

    -- Filter out rows that conflict with multiple target rows (ambiguous)

    SELECT * FROM temp_with_conflicts

    WHERE conflict_targets <= 1

),

ranked AS (

    -- Deduplicate temp rows targeting the same existing row

    SELECT *,

           ROW_NUMBER() OVER (

               PARTITION BY COALESCE(conflicted_target_id, id)

               ORDER BY ctid DESC

           ) AS row_rank

    FROM filtered

),

clean_rows AS (

    -- Final dataset: only the latest row per target

    SELECT id, email, doc_type, doc_number

    FROM ranked

    WHERE row_rank = 1

)

-- Ready for MERGE...

```

**CTE Logic Breakdown:**

1. **`temp_with_conflicts`**: Joins temporary table against target table using ALL constraints simultaneously, counting how many existing rows each temp row conflicts with.

2. **`filtered`**: Removes ambiguous rows that would conflict with multiple existing records (keeps rows with conflict_targets <= 1) - [ _1 to many conflict_ ].

3. **`ranked`**: When multiple temp rows target the same existing record, rank these based on descend insertion order using table ctid - [ _many to 1 conflict_ ].

4. **`clean_rows`**: Keps only the last row inserted for each conflict. Final clean dataset ready for atomic upsert.

### Stage 3: Atomic MERGE Operation

```sql

MERGE INTO "public"."target_table" AS tgt

USING clean_rows AS src

ON (tgt.id = src.id) OR 

   (tgt.email = src.email) OR 

   (tgt.doc_type = src.doc_type AND tgt.doc_number = src.doc_number)

WHEN MATCHED THEN

    UPDATE SET email = src.email, doc_type = src.doc_type, doc_number = src.doc_number

WHEN NOT MATCHED THEN

    INSERT (id, email, doc_type, doc_number)

    VALUES (src.id, src.email, src.doc_type, src.doc_number);

```

**Benefits:**

- ✅ Single atomic transaction

- ✅ Handles all conflict types simultaneously

- ✅ Automatic INSERT or UPDATE decision

- ✅ No race conditions or partial updates

### Constraint Detection

The library automatically analyzes your target table to identify:

- **Primary key constraints**: Single or composite primary keys

- **Unique constraints**: Single column unique constraints  

- **Composite unique constraints**: Multi-column unique constraints

All constraint types are handled simultaneously in a single operation.

## 🚨 Pros, Cons & Considerations

### ✅ Advantages of Temporary Table + CTE + MERGE Approach

**Performance Benefits:**

- **Single Transaction**: Entire operation is atomic, no partial updates or race conditions

- **Bulk Operations**: High-performance bulk inserts into temporary tables

- **Efficient Joins**: PostgreSQL optimizes joins between temporary and main tables

- **Minimal Locking**: Temporary tables don't interfere with concurrent operations

**Reliability Benefits:**

- **Comprehensive Conflict Resolution**: Handles all constraint types simultaneously

- **Deterministic Results**: Same input always produces same output

- **Automatic Cleanup**: Temporary tables are automatically dropped

- **ACID Compliance**: Full transaction safety and rollback capability

**Data Integrity Benefits:**

- **Ambiguity Detection**: Automatically detects and skips problematic rows

- **Deduplication**: Handles duplicate input data intelligently

- **Constraint Validation**: PostgreSQL validates all constraints during MERGE

### ❌ Limitations and Trade-offs

**Resource Requirements:**

- **Memory Usage**: All input data is staged in temporary tables (memory-resident)

- **Temporary Space**: Requires sufficient temporary storage for staging tables

- **Single-threaded**: No parallel processing (traded for reliability and simplicity)

**PostgreSQL Specifics:**

- **Version Dependency**: MERGE statement requires PostgreSQL 15+

- **Session-based Temp Tables**: Temporary tables are tied to database sessions

**Privilege-related Limitations:**

- **Database Administrator**: May need DBA assistance to grant `TEMPORARY` privileges

- **Shared Hosting**: Some cloud providers restrict temporary table creation

- **Security Policies**: Corporate environments may restrict temporary table usage

**Scale Considerations:**

- **Large Dataset Handling**: Very large datasets (>1M rows) may require memory tuning

- **Transaction Duration**: Entire operation happens in one transaction (longer lock times)

### 🎯 Best Practices

**Memory Management:**

```python

# For large datasets, monitor memory usage

import pandas as pd

# Consider chunking very large datasets manually if needed

def chunk_dataframe(df, chunk_size=50000):

    for start in range(0, len(df), chunk_size):

        yield df[start:start + chunk_size]

# Process in manageable chunks

for chunk in chunk_dataframe(large_df):

    success, rows = upserter.upsert_dataframe(chunk, 'target_table')

    print(f"Processed chunk: {rows} rows")

```

**Performance Optimization:**

```python

# Ensure proper indexing on conflict columns

# CREATE INDEX idx_email ON target_table(email);

# CREATE INDEX idx_composite ON target_table(doc_type, doc_number);

# Use debug mode to monitor performance

upserter = PostgresqlUpsert(engine=engine, debug=True)

```

**Error Handling:**

```python

try:

    success, affected_rows = upserter.upsert_dataframe(df, 'users')

    logger.info(f"Successfully upserted {affected_rows} rows")

except ValueError as e:

    logger.error(f"Validation error: {e}")

except Exception as e:

    logger.error(f"Upsert failed: {e}")

    # Handle rollback - transaction is automatically rolled back

```

## 🤝 Contributing

We welcome contributions! Here's how to get started:

1. **Fork the repository**

2. **Create a feature branch**: `git checkout -b feature/amazing-feature`

3. **Make your changes** and add tests

4. **Run the test suite**: `pytest tests/ -v`

5. **Submit a pull request**

## 📝 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙋 Support

- **Issues**: [GitHub Issues](https://github.com/machado000/sqlalchemy-psql-upsert/issues)

- **Documentation**: Check the docstrings and test files for detailed usage examples
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/machado000/sqlalchemy-psql-upsert

Awesome Lists containing this project

README