https://github.com/machado000/sqlalchemy-psql-upsert
Upsert helper for PostgreSQL using SQLAlchemy
https://github.com/machado000/sqlalchemy-psql-upsert
postgresql python sqlalchemy upsert
Last synced: 11 months ago
JSON representation
Upsert helper for PostgreSQL using SQLAlchemy
- Host: GitHub
- URL: https://github.com/machado000/sqlalchemy-psql-upsert
- Owner: machado000
- License: mit
- Created: 2025-07-17T17:20:13.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-07-19T18:32:14.000Z (11 months ago)
- Last Synced: 2025-07-19T19:50:37.663Z (11 months ago)
- Topics: postgresql, python, sqlalchemy, upsert
- Language: Python
- Homepage:
- Size: 94.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SQLAlchemy PostgreSQL Upsert
[](https://pypi.org/project/sqlalchemy-psql-upsert/)
[](https://github.com/machado000/sqlalchemy-psql-upsert/blob/main/LICENSE)
[](https://github.com/machado000/sqlalchemy-psql-upsert/issues)
[](https://github.com/machado000/sqlalchemy-psql-upsert/commits/main)
A high-performance Python library for PostgreSQL UPSERT operations with intelligent conflict resolution using temporary tables and advanced CTE algorithm.
## 🚀 Features
- **Temporary Table Strategy**: Uses PostgreSQL temporary tables for efficient staging and conflict analysis
- **Advanced CTE Logic**: Sophisticated Common Table Expression queries for multi-constraint conflict resolution
- **Atomic MERGE Operations**: Single-transaction upserts using PostgreSQL 15+ MERGE statements
- **Multi-constraint Support**: Handles primary keys, unique constraints, and composite constraints simultaneously
- **Intelligent Conflict Resolution**: Automatically filters ambiguous conflicts and deduplicates data
- **Automatic NaN to NULL Conversion**: Seamlessly converts pandas NaN values to PostgreSQL NULL values
- **Schema Validation**: Automatic table and column validation before operations
- **Comprehensive Logging**: Detailed debug information and progress tracking
## 📦 Installation
### Using Poetry (Recommended)
```bash
poetry add sqlalchemy-psql-upsert
```
### Using pip
```bash
pip install sqlalchemy-psql-upsert
```
## ⚙️ Configuration
### Database Privileges Requirements
Besides SELECT, INSERT, UPDATE permissions on target tables, this library requires PostgreSQL `TEMPORARY` privilege to function properly:
**Why Temporary Tables?**
- **Isolation**: Staging data doesn't interfere with production tables during analysis
- **Performance**: Bulk operations are faster on temporary tables
- **Safety**: Failed operations don't leave partial data in target tables
- **Atomicity**: Entire upsert operation happens in a single transaction
### Environment Variables
Create a `.env` file or set the following environment variables:
```bash
# PostgreSQL Configuration
PGHOST = localhost
PGPORT = 5432
PGDATABASE = your_database
PGUSER = your_username
PGPASSWORD = your_password
```
### Configuration Class
```python
from sqlalchemy_psql_upsert import PgConfig
# Default configuration from environment
config = PgConfig()
# Manual configuration
config = PgConfig(
host="localhost",
port="5432",
user="myuser",
password="mypass",
dbname="mydb"
)
print(config.uri()) # postgresql+psycopg2://myuser:mypass@localhost:5432/mydb
```
## 🛠️ Quick Start
### Connection Testing
```python
from sqlalchemy_psql_upsert import test_connection
# Test default connection
success, message = test_connection()
if success:
print("✅ Database connection OK")
else:
print(f"❌ Connection failed: {message}")
```
### Privileges Verification
**Important**: This library requires `CREATE TEMP TABLE` privileges to function properly. The client automatically verifies these privileges during initialization.
```python
from sqlalchemy_psql_upsert import PostgresqlUpsert, PgConfig
from sqlalchemy import create_engine
# Test connection and privileges
config = PgConfig()
try:
# This will automatically test temp table privileges
upserter = PostgresqlUpsert(config=config)
print("✅ Connection and privileges verified successfully")
except PermissionError as e:
print(f"❌ Privilege error: {e}")
print("Solution: Grant temporary privileges with:")
print("GRANT TEMPORARY ON DATABASE your_database TO your_user;")
except Exception as e:
print(f"❌ Connection failed: {e}")
```
**Grant Required Privileges:**
```sql
-- As database administrator, grant temporary table privileges
GRANT TEMPORARY ON DATABASE your_database TO your_user;
-- Alternatively, grant more comprehensive privileges
GRANT CREATE ON DATABASE your_database TO your_user;
```
### Basic Usage
```python
import pandas as pd
from sqlalchemy_psql_upsert import PostgresqlUpsert, PgConfig
# Configure database connection
config = PgConfig() # Loads from environment variables
upserter = PostgresqlUpsert(config=config)
# Prepare your data
df = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com']
})
# Perform upsert
success, affected_rows = upserter.upsert_dataframe(
dataframe=df,
table_name='users',
schema='public'
)
print(f"✅ Upserted {affected_rows} rows successfully")
```
### Advanced Configuration
```python
from sqlalchemy import create_engine
# Using custom SQLAlchemy engine
engine = create_engine('postgresql://user:pass@localhost:5432/mydb')
upserter = PostgresqlUpsert(engine=engine, debug=True)
# Upsert with custom schema
success, affected_rows = upserter.upsert_dataframe(
dataframe=large_df,
table_name='products',
schema='inventory'
)
```
### Data Type Handling
The library automatically handles pandas data type conversions:
```python
import pandas as pd
import numpy as np
# DataFrame with NaN values
df = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', None, 'David'], # None values
'score': [85.5, np.nan, 92.0, 88.1], # NaN values
'active': [True, False, None, True] # Mixed types with None
})
# All NaN and None values are automatically converted to PostgreSQL NULL
upserter.upsert_dataframe(df, 'users')
# Result: NaN/None → NULL in PostgreSQL
```
## 🔍 How It Works
### Overview: Temporary Table + CTE + MERGE Strategy
This library uses a sophisticated 3-stage approach for reliable, high-performance upserts:
1. **Temporary Table Staging**: Data is first loaded into a PostgreSQL temporary table
2. **CTE-based Conflict Analysis**: Advanced Common Table Expressions analyze and resolve conflicts
3. **Atomic MERGE Operation**: Single transaction applies all changes using PostgreSQL MERGE
### Stage 1: Temporary Table Creation
```sql
-- Creates a temporary table with identical structure to target table
CREATE TEMP TABLE "temp_upsert_12345678" (
id INTEGER,
email VARCHAR,
doc_type VARCHAR,
doc_number VARCHAR
);
-- Bulk insert all DataFrame data into temporary table
INSERT INTO "temp_upsert_12345678" VALUES (...);
```
**Benefits:**
- ✅ Isolates staging data from target table
- ✅ Enables complex analysis without affecting production data
- ✅ Automatic cleanup when session ends
- ✅ High-performance bulk inserts
### Stage 2: CTE-based Conflict Analysis
The library generates sophisticated CTEs to handle complex conflict scenarios:
```sql
WITH temp_with_conflicts AS (
-- Analyze each temp row against ALL constraints simultaneously
SELECT temp.*, temp.ctid,
COUNT(DISTINCT target.id) AS conflict_targets,
MAX(target.id) AS conflicted_target_id
FROM "temp_upsert_12345678" temp
LEFT JOIN "public"."target_table" target
ON (temp.id = target.id) OR -- PK conflict
(temp.email = target.email) OR -- Unique conflict
(temp.doc_type = target.doc_type AND -- Composite unique conflict
temp.doc_number = target.doc_number)
GROUP BY temp.ctid, temp.id, temp.email, temp.doc_type, temp.doc_number
),
filtered AS (
-- Filter out rows that conflict with multiple target rows (ambiguous)
SELECT * FROM temp_with_conflicts
WHERE conflict_targets <= 1
),
ranked AS (
-- Deduplicate temp rows targeting the same existing row
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY COALESCE(conflicted_target_id, id)
ORDER BY ctid DESC
) AS row_rank
FROM filtered
),
clean_rows AS (
-- Final dataset: only the latest row per target
SELECT id, email, doc_type, doc_number
FROM ranked
WHERE row_rank = 1
)
-- Ready for MERGE...
```
**CTE Logic Breakdown:**
1. **`temp_with_conflicts`**: Joins temporary table against target table using ALL constraints simultaneously, counting how many existing rows each temp row conflicts with.
2. **`filtered`**: Removes ambiguous rows that would conflict with multiple existing records (keeps rows with conflict_targets <= 1) - [ _1 to many conflict_ ].
3. **`ranked`**: When multiple temp rows target the same existing record, rank these based on descend insertion order using table ctid - [ _many to 1 conflict_ ].
4. **`clean_rows`**: Keps only the last row inserted for each conflict. Final clean dataset ready for atomic upsert.
### Stage 3: Atomic MERGE Operation
```sql
MERGE INTO "public"."target_table" AS tgt
USING clean_rows AS src
ON (tgt.id = src.id) OR
(tgt.email = src.email) OR
(tgt.doc_type = src.doc_type AND tgt.doc_number = src.doc_number)
WHEN MATCHED THEN
UPDATE SET email = src.email, doc_type = src.doc_type, doc_number = src.doc_number
WHEN NOT MATCHED THEN
INSERT (id, email, doc_type, doc_number)
VALUES (src.id, src.email, src.doc_type, src.doc_number);
```
**Benefits:**
- ✅ Single atomic transaction
- ✅ Handles all conflict types simultaneously
- ✅ Automatic INSERT or UPDATE decision
- ✅ No race conditions or partial updates
### Constraint Detection
The library automatically analyzes your target table to identify:
- **Primary key constraints**: Single or composite primary keys
- **Unique constraints**: Single column unique constraints
- **Composite unique constraints**: Multi-column unique constraints
All constraint types are handled simultaneously in a single operation.
## 🚨 Pros, Cons & Considerations
### ✅ Advantages of Temporary Table + CTE + MERGE Approach
**Performance Benefits:**
- **Single Transaction**: Entire operation is atomic, no partial updates or race conditions
- **Bulk Operations**: High-performance bulk inserts into temporary tables
- **Efficient Joins**: PostgreSQL optimizes joins between temporary and main tables
- **Minimal Locking**: Temporary tables don't interfere with concurrent operations
**Reliability Benefits:**
- **Comprehensive Conflict Resolution**: Handles all constraint types simultaneously
- **Deterministic Results**: Same input always produces same output
- **Automatic Cleanup**: Temporary tables are automatically dropped
- **ACID Compliance**: Full transaction safety and rollback capability
**Data Integrity Benefits:**
- **Ambiguity Detection**: Automatically detects and skips problematic rows
- **Deduplication**: Handles duplicate input data intelligently
- **Constraint Validation**: PostgreSQL validates all constraints during MERGE
### ❌ Limitations and Trade-offs
**Resource Requirements:**
- **Memory Usage**: All input data is staged in temporary tables (memory-resident)
- **Temporary Space**: Requires sufficient temporary storage for staging tables
- **Single-threaded**: No parallel processing (traded for reliability and simplicity)
**PostgreSQL Specifics:**
- **Version Dependency**: MERGE statement requires PostgreSQL 15+
- **Session-based Temp Tables**: Temporary tables are tied to database sessions
**Privilege-related Limitations:**
- **Database Administrator**: May need DBA assistance to grant `TEMPORARY` privileges
- **Shared Hosting**: Some cloud providers restrict temporary table creation
- **Security Policies**: Corporate environments may restrict temporary table usage
**Scale Considerations:**
- **Large Dataset Handling**: Very large datasets (>1M rows) may require memory tuning
- **Transaction Duration**: Entire operation happens in one transaction (longer lock times)
### 🎯 Best Practices
**Memory Management:**
```python
# For large datasets, monitor memory usage
import pandas as pd
# Consider chunking very large datasets manually if needed
def chunk_dataframe(df, chunk_size=50000):
for start in range(0, len(df), chunk_size):
yield df[start:start + chunk_size]
# Process in manageable chunks
for chunk in chunk_dataframe(large_df):
success, rows = upserter.upsert_dataframe(chunk, 'target_table')
print(f"Processed chunk: {rows} rows")
```
**Performance Optimization:**
```python
# Ensure proper indexing on conflict columns
# CREATE INDEX idx_email ON target_table(email);
# CREATE INDEX idx_composite ON target_table(doc_type, doc_number);
# Use debug mode to monitor performance
upserter = PostgresqlUpsert(engine=engine, debug=True)
```
**Error Handling:**
```python
try:
success, affected_rows = upserter.upsert_dataframe(df, 'users')
logger.info(f"Successfully upserted {affected_rows} rows")
except ValueError as e:
logger.error(f"Validation error: {e}")
except Exception as e:
logger.error(f"Upsert failed: {e}")
# Handle rollback - transaction is automatically rolled back
```
## 🤝 Contributing
We welcome contributions! Here's how to get started:
1. **Fork the repository**
2. **Create a feature branch**: `git checkout -b feature/amazing-feature`
3. **Make your changes** and add tests
4. **Run the test suite**: `pytest tests/ -v`
5. **Submit a pull request**
## 📝 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🙋 Support
- **Issues**: [GitHub Issues](https://github.com/machado000/sqlalchemy-psql-upsert/issues)
- **Documentation**: Check the docstrings and test files for detailed usage examples