https://github.com/mrmcmullan/flycatcher
Define your schema once & for all — built for DataFrames, powered across Pydantic, Polars, and SQLAlchemy.
https://github.com/mrmcmullan/flycatcher
data-engineering data-validation dataframe etl orm polars pydantic python python3 schema sqlalchemy type-checking validation
Last synced: 24 days ago
JSON representation
Define your schema once & for all — built for DataFrames, powered across Pydantic, Polars, and SQLAlchemy.
- Host: GitHub
- URL: https://github.com/mrmcmullan/flycatcher
- Owner: mrmcmullan
- License: mit
- Created: 2025-11-18T19:53:13.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-12-10T15:58:08.000Z (4 months ago)
- Last Synced: 2026-02-16T11:53:30.840Z (about 1 month ago)
- Topics: data-engineering, data-validation, dataframe, etl, orm, polars, pydantic, python, python3, schema, sqlalchemy, type-checking, validation
- Language: Python
- Homepage: https://mrmcmullan.github.io/flycatcher/
- Size: 398 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README

Define your schema once. Validate at scale. Stay columnar.
Built for DataFrames, powered across Pydantic, Polars, and SQLAlchemy.
---
Flycatcher is a **DataFrame-native schema layer** for Python. Define your data model once and generate optimized representations for every part of your stack:
- 🎯 **Pydantic models** for API validation & serialization
- ⚡ **Polars validators** for blazing-fast bulk validation
- 🗄️ **SQLAlchemy tables** for typed database access
**Built for modern data workflows:** Validate millions of rows at high speed, keep schema drift at zero, and stay columnar end-to-end.
## ❓ Why Flycatcher?
Modern Python data projects need **row-level validation** (Pydantic), **efficient bulk operations** (Polars), and **typed database queries** (SQLAlchemy). But maintaining multiple schemas across this stack can lead to duplication, drift, and manually juggling row-oriented and columnar paradigms.
**Flycatcher solves this:** One schema definition → three optimized outputs.
```python
from flycatcher import Schema, Field, col, model_validator
class ProductSchema(Schema):
id: int = Field(primary_key=True)
name: str = Field(min_length=3, max_length=100)
price: float = Field(gt=0)
discount_price: float | None = Field(default=None, gt=0, nullable=True)
@model_validator
def check_discount():
# Cross-field validation with DSL
return (
col('discount_price') < col('price'),
"Discount price must be less than regular price"
)
# Generate three optimized representations
ProductModel = ProductSchema.to_pydantic() # → Pydantic BaseModel
ProductValidator = ProductSchema.to_polars_validator() # → Polars DataFrame validator
ProductTable = ProductSchema.to_sqlalchemy() # → SQLAlchemy Table
```
**Flycatcher lets you stay DataFrame-native without giving up the speed of Polars, the ergonomic validation of Pydantic, or the Pythonic power of SQLAlchemy**.
---
## 🚀 Quick Start
### Installation
```bash
pip install flycatcher
# or
uv add flycatcher
```
### Define Your Schema
```python
from datetime import datetime
from flycatcher import Schema, Field
class UserSchema(Schema):
id: int = Field(primary_key=True)
username: str = Field(min_length=3, max_length=50, unique=True)
email: str = Field(pattern=r'^[^@]+@[^@]+\.[^@]+$', unique=True, index=True)
age: int = Field(ge=13, le=120)
is_active: bool = Field(default=True)
created_at: datetime
```
### Use Pydantic for Row-Level Validation
Perfect for APIs, forms, and single-record validation:
```python
from datetime import datetime
User = UserSchema.to_pydantic()
# Validates constraints automatically via Pydantic
user = User(
id=1,
username="alice",
email="alice@example.com",
age=25,
created_at=datetime.utcnow()
)
# Serialize to JSON/dict
print(user.model_dump_json())
```
### Use Polars for Bulk Validation
Perfect for DataFrame-level validation:
```python
import polars as pl
UserValidator = UserSchema.to_polars_validator()
# Validate 1M+ rows with blazing speed
df = pl.read_csv("users.csv")
validated_df = UserValidator.validate(df, strict=True)
validated_df.write_parquet("validated_users.parquet")
```
### Use SQLAlchemy for Database Operations
Perfect for typed queries and database interactions:
```python
from sqlalchemy import create_engine
UserTable = UserSchema.to_sqlalchemy(table_name="users")
engine = create_engine("postgresql://localhost/mydb")
# Type-safe queries
with engine.connect() as conn:
result = conn.execute(
UserTable.select()
.where(UserTable.c.is_active == True)
.where(UserTable.c.age >= 18)
)
for row in result:
print(row)
```
---
## ✨ Key Features
### Rich Field Types & Constraints
Use standard Python types with `Field(...)` constraints:
| Python Type | Constraints | Example |
|-------------|-------------|---------|
| `int` | `ge`, `gt`, `le`, `lt`, `multiple_of` | `age: int = Field(ge=0, le=120)` |
| `float` | `ge`, `gt`, `le`, `lt` | `price: float = Field(gt=0)` |
| `str` | `min_length`, `max_length`, `pattern` | `email: str = Field(pattern=r'^[^@]+@...')` |
| `bool` | - | `is_active: bool = Field(default=True)` |
| `datetime` | `ge`, `gt`, `le`, `lt` | `created_at: datetime = Field(ge=datetime(2020, 1, 1))` |
| `date` | `ge`, `gt`, `le`, `lt` | `birth_date: date` |
**All fields support (validation):** `nullable`, `default`, `description`
**SQLAlchemy-specific:** `primary_key`, `unique`, `index`, `autoincrement`
### Custom & Cross-Field Validation
Use the `col()` DSL for powerful field-level and cross-field validation that works across both Pydantic and Polars:
```python
from datetime import datetime
from flycatcher import Schema, Field, col, model_validator
class BookingSchema(Schema):
email: str
phone: str
check_in: datetime = Field(ge=datetime(2024, 1, 1))
check_out: datetime = Field(ge=datetime(2024, 1, 1))
nights: int = Field(ge=1)
@model_validator
def check_dates():
return (
col('check_out') > col('check_in'),
"Check-out must be after check-in"
)
@model_validator
def check_phone_format():
cleaned = col('phone').str.replace(r'[^\d]', '')
return (cleaned.str.len_chars() == 10, "Phone must have 10 digits")
@model_validator
def check_minimum_stay():
# For operations not yet in DSL (like .is_in()), use explicit Polars format
# Note: .dt.month() is available in DSL, but .is_in() is not yet supported
import polars as pl
return {
'polars': (
(~pl.col('check_in').dt.month().is_in([7, 8])) | (pl.col('nights') >= 3),
"Minimum stay in July and August is 3 nights"
),
'pydantic': lambda v: (
v.check_in.month not in [7, 8] or v.nights >= 3,
"Minimum stay in July and August is 3 nights"
)
}
```
### Validation Modes
Polars validation supports flexible error handling:
```python
# Strict mode: Raise on validation errors (default)
validated_df = UserValidator.validate(df, strict=True)
# Non-strict mode: Filter out invalid rows
valid_df = UserValidator.validate(df, strict=False)
# Show violations for debugging
validated_df = UserValidator.validate(df, strict=True, show_violations=True)
```
---
## 🎯 Complete Example: ETL Pipeline
```python
import polars as pl
from datetime import datetime
from flycatcher import Schema, Field, col, model_validator
from sqlalchemy import create_engine, MetaData
# 1. Define schema once
class OrderSchema(Schema):
order_id: int = Field(primary_key=True)
customer_email: str = Field(pattern=r'^[^@]+@[^@]+\.[^@]+$', index=True)
amount: float = Field(gt=0)
tax: float = Field(ge=0)
total: float = Field(gt=0)
created_at: datetime
@model_validator
def check_total():
return (
col('total') == col('amount') + col('tax'),
"Total must equal amount + tax"
)
# 2. Extract & Validate with Polars (handles millions of rows)
OrderValidator = OrderSchema.to_polars_validator()
df = pl.read_csv("orders.csv")
validated_df = OrderValidator.validate(df, strict=True)
# 3. Load to database with SQLAlchemy
OrderTable = OrderSchema.to_sqlalchemy(table_name="orders")
engine = create_engine("postgresql://localhost/analytics")
with engine.connect() as conn:
conn.execute(OrderTable.insert(), validated_df.to_dicts())
conn.commit()
```
✅ **Result:** Validated millions of rows, enforced business rules, and loaded to database — all from one schema definition.
---
## 🏗️ Design Philosophy
**One schema, three representations. Each optimized for its use case.**
```
Schema Definition
↓
┌──────────┼──────────┐
↓ ↓ ↓
Pydantic Polars SQLAlchemy
↓ ↓ ↓
APIs ETL Database
```
### What Flycatcher Does
✅ Single source of truth for schema definitions
✅ Generate optimized representations for different use cases
✅ Keep runtimes separate (no ORM ↔ DataFrame conversions)
✅ Use stable public APIs (Pydantic, Polars, SQLAlchemy)
### What Flycatcher Doesn't Do
❌ Mix row-oriented and columnar paradigms
❌ Create a "unified runtime" (that would be slow)
❌ Reinvent validation logic (delegates to proven libraries when possible)
❌ Depend on internal APIs
---
## ⚠️ Current Limitations (v0.1.0)
Flycatcher v0.1.0 is an **alpha release**. The core functionality works perfectly, but some advanced features are planned for future versions:
### Polars DSL
The `col()` DSL supports **basic operations** (`>`, `<`, `==`, `+`, etc.),
**numeric math operations** (`.abs()`, `.round()`, `.floor()`, `.ceil()`, `.sqrt()`, `.pow()`),
**limited string operations** (`.str.contains()`, `.str.starts_with()`, `.str.len_chars()`, etc.),
and a **limited datetime accessor** (`.dt.year()`, `.dt.month()`, `.dt.total_days(other)`, etc.).
The `col()` DSL does not support the full range of Polars operations. However, additional
operations will be added in future versions to better support the full functionality of Polars.
**Workaround**: Use the explicit format in `@model_validator`:
```python
@model_validator
def check():
return {
'polars': (pl.col('field').is_null(), "Message"),
'pydantic': lambda v: (v.field is None, "Message")
}
```
### Pydantic Features
- ❌ `@field_validator` - Only `@model_validator` is supported (coming in v0.2.0)
- ❌ Field aliases and computed fields (coming in v0.2.0+)
- ❌ Custom serialization options (coming in v0.2.0+)
**Workaround**: Use `@model_validator` for all validation needs.
### SQLAlchemy Features
- ❌ Foreign key relationships - Must be added manually after table generation (coming in v0.3.0+)
- ❌ Composite primary keys - Only single-field primary keys supported (coming in v0.3.0+)
- ❌ Function-based defaults (e.g., `default=func.now()`) - Only literal defaults supported
**Workaround**: Add relationships and composite keys manually in SQLAlchemy after table generation.
### Field Types
- ❌ Enum, UUID, JSON, Array field types (coming in v0.3.0+)
- ❌ Numeric/Decimal field type (coming in v0.3.0+)
**Workaround**: Use `String` with pattern validation or manual handling.
---
## 📊 Comparison
| Feature | Flycatcher | SQLModel | Patito |
|---------|-----------|----------|--------|
| Pydantic support | ✅ | ✅ | ✅ |
| Polars support | ✅ | ❌ | ✅ |
| SQLAlchemy support | ✅ | ✅ | ❌ |
| DataFrame-level DB ops | 🚧 (v0.2) | ❌ | ❌ |
| Cross-field validation | ✅ | ⚠️ (Pydantic only) | ⚠️ (Polars only) |
| Single schema definition | ✅ | ⚠️ (Pydantic + ORM hybrid) | ⚠️ (Pydantic + Polars hybrid) |
**Flycatcher** is the only library that generates optimized representations for all three systems while keeping them properly separated.
---
## 📚 Documentation
- **[Getting Started](https://mrmcmullan.github.io/flycatcher/)** - Installation and basics
- **[Tutorials](https://mrmcmullan.github.io/flycatcher/tutorials/)** - Step-by-step guides
- **[How-To Guides](https://mrmcmullan.github.io/flycatcher/how-to/)** - Solve specific problems
- **[API Reference](https://mrmcmullan.github.io/flycatcher/api/)** - Complete API documentation
- **[Explanations](https://mrmcmullan.github.io/flycatcher/explanations/)** - Deep dives and concepts
---
## 🛣️ Roadmap
### v0.1.0 (Released) 🚀
- [x] Core schema definition with metaclass
- [x] Field types with constraints (Integer, String, Float, Boolean, Datetime, Date)
- [x] Pydantic model generator
- [x] Polars DataFrame validator with bulk validation
- [x] SQLAlchemy table generator
- [x] Cross-field validators with DSL (`col()`)
- [x] Test suite with 70%+ coverage
- [x] Complete documentation site
- [x] PyPI publication
### v0.2.0 (In Progress) 🚧
**Theme:** Enhanced validation and database operations
- [ ] `@field_validator` support in addition to existing `@model_validator`
- [x] Enhanced Polars DSL: `.is_null()`, `.is_not_null()`, `.str.contains()`, `.str.startswith()`, `.dt.month`, `.dt.year`, `.is_in([...])`, `.is_between()`
- [ ] Pydantic enhancements: field aliases, computed fields, custom serialization
- [ ] Enable inheritance of `Schema` to create subclasses with different fields
- [ ] For more details, see the [GitHub Milestone for v0.2.0](https://github.com/mrmcmullan/flycatcher/milestone/2)
### v0.3.0 (Planned)
- [ ] DataFrame-level queries (`Schema.query()`)
- [ ] Bulk write operations (`Schema.insert()`, `Schema.update()`, `Schema.upsert()`)
- [ ] Complete ETL loop staying columnar end-to-end
- [ ] Add PascalCase metaclass
- [ ] Additional Pydantic validation modes (`mode='before'`, `mode='wrap'`)
- [ ] For more details, see the [GitHub Milestone for v0.3.0](https://github.com/mrmcmullan/flycatcher/milestone/3)
### v0.4.0+ (Future)
**Theme:** Advanced field types and relationships
- [ ] Additional field types: Enum, UUID, JSON, Array, Numeric/Decimal, Time, Binary, Interval
- [ ] SQLAlchemy relationships: Foreign keys, composite primary keys
- [ ] SQLAlchemy function-based defaults (e.g., `default=func.now()`)
- [ ] JOIN support in queries
- [ ] Aggregations (GROUP BY, COUNT, SUM)
- [ ] Schema migrations helper
## 🤝 Contributing
Contributions are welcome! Please see our [Contributing Guide] for details.
---
## 📄 License
MIT License - see [LICENSE]([LICENSE](https://github.com/mrmcmullan/flycatcher?tab=MIT-1-ov-file)) for details.
---
## 💬 Community
- **[GitHub Issues](https://github.com/mrmcmullan/flycatcher/issues)** - Bug reports and feature requests
- **[GitHub Discussions](https://github.com/mrmcmullan/flycatcher/discussions)** - Questions and community discussion
- **[Documentation](https://mrmcmullan.github.io/flycatcher)** - Full guides and API reference
---
Built with ❤️ for the DataFrame generation