https://github.com/mrmcmullan/flycatcher

Define your schema once & for all — built for DataFrames, powered across Pydantic, Polars, and SQLAlchemy.
https://github.com/mrmcmullan/flycatcher
data-engineering data-validation dataframe etl orm polars pydantic python python3 schema sqlalchemy type-checking validation
Last synced: 4 months ago
JSON representation
Define your schema once & for all — built for DataFrames, powered across Pydantic, Polars, and SQLAlchemy.
Host: GitHub
URL: https://github.com/mrmcmullan/flycatcher
Owner: mrmcmullan
License: mit
Created: 2025-11-18T19:53:13.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-12-10T15:58:08.000Z (7 months ago)
Last Synced: 2026-02-16T11:53:30.840Z (5 months ago)
Topics: data-engineering, data-validation, dataframe, etl, orm, polars, pydantic, python, python3, schema, sqlalchemy, type-checking, validation
Language: Python
Homepage: https://mrmcmullan.github.io/flycatcher/
Size: 398 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 12
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project

README

          




Define your schema once. Validate at scale. Stay columnar.

Built for DataFrames, powered across Pydantic, Polars, and SQLAlchemy.




  

    

  

  

    

  

  

    

  

  

    

  

  

    

  

  

    

  





---

Flycatcher is a **DataFrame-native schema layer** for Python. Define your data model once and generate optimized representations for every part of your stack:

- 🎯 **Pydantic models** for API validation & serialization

- ⚡ **Polars validators** for blazing-fast bulk validation

- 🗄️ **SQLAlchemy tables** for typed database access

**Built for modern data workflows:** Validate millions of rows at high speed, keep schema drift at zero, and stay columnar end-to-end.

## ❓ Why Flycatcher?

Modern Python data projects need **row-level validation** (Pydantic), **efficient bulk operations** (Polars), and **typed database queries** (SQLAlchemy). But maintaining multiple schemas across this stack can lead to duplication, drift, and manually juggling row-oriented and columnar paradigms.

**Flycatcher solves this:** One schema definition → three optimized outputs.

```python

from flycatcher import Schema, Field, col, model_validator

class ProductSchema(Schema):

    id: int = Field(primary_key=True)

    name: str = Field(min_length=3, max_length=100)

    price: float = Field(gt=0)

    discount_price: float | None = Field(default=None, gt=0, nullable=True)

    @model_validator

    def check_discount():

        # Cross-field validation with DSL

        return (

            col('discount_price') < col('price'),

            "Discount price must be less than regular price"

        )

# Generate three optimized representations

ProductModel = ProductSchema.to_pydantic()         # → Pydantic BaseModel

ProductValidator = ProductSchema.to_polars_validator() # → Polars DataFrame validator

ProductTable = ProductSchema.to_sqlalchemy()       # → SQLAlchemy Table

```

**Flycatcher lets you stay DataFrame-native without giving up the speed of Polars, the ergonomic validation of Pydantic, or the Pythonic power of SQLAlchemy**.

---

## 🚀 Quick Start

### Installation

```bash

pip install flycatcher

# or

uv add flycatcher

```

### Define Your Schema

```python

from datetime import datetime

from flycatcher import Schema, Field

class UserSchema(Schema):

    id: int = Field(primary_key=True)

    username: str = Field(min_length=3, max_length=50, unique=True)

    email: str = Field(pattern=r'^[^@]+@[^@]+\.[^@]+$', unique=True, index=True)

    age: int = Field(ge=13, le=120)

    is_active: bool = Field(default=True)

    created_at: datetime

```

### Use Pydantic for Row-Level Validation

Perfect for APIs, forms, and single-record validation:

```python

from datetime import datetime

User = UserSchema.to_pydantic()

# Validates constraints automatically via Pydantic

user = User(

    id=1,

    username="alice",

    email="alice@example.com",

    age=25,

    created_at=datetime.utcnow()

)

# Serialize to JSON/dict

print(user.model_dump_json())

```

### Use Polars for Bulk Validation

Perfect for DataFrame-level validation:

```python

import polars as pl

UserValidator = UserSchema.to_polars_validator()

# Validate 1M+ rows with blazing speed

df = pl.read_csv("users.csv")

validated_df = UserValidator.validate(df, strict=True)

validated_df.write_parquet("validated_users.parquet")

```

### Use SQLAlchemy for Database Operations

Perfect for typed queries and database interactions:

```python

from sqlalchemy import create_engine

UserTable = UserSchema.to_sqlalchemy(table_name="users")

engine = create_engine("postgresql://localhost/mydb")

# Type-safe queries

with engine.connect() as conn:

    result = conn.execute(

        UserTable.select()

        .where(UserTable.c.is_active == True)

        .where(UserTable.c.age >= 18)

    )

    for row in result:

        print(row)

```

---

## ✨ Key Features

### Rich Field Types & Constraints

Use standard Python types with `Field(...)` constraints:

| Python Type | Constraints | Example |

|-------------|-------------|---------|

| `int` | `ge`, `gt`, `le`, `lt`, `multiple_of` | `age: int = Field(ge=0, le=120)` |

| `float` | `ge`, `gt`, `le`, `lt` | `price: float = Field(gt=0)` |

| `str` | `min_length`, `max_length`, `pattern` | `email: str = Field(pattern=r'^[^@]+@...')` |

| `bool` | - | `is_active: bool = Field(default=True)` |

| `datetime` | `ge`, `gt`, `le`, `lt` | `created_at: datetime = Field(ge=datetime(2020, 1, 1))` |

| `date` | `ge`, `gt`, `le`, `lt` | `birth_date: date` |

**All fields support (validation):** `nullable`, `default`, `description`

**SQLAlchemy-specific:** `primary_key`, `unique`, `index`, `autoincrement`

### Custom & Cross-Field Validation

Use the `col()` DSL for powerful field-level and cross-field validation that works across both Pydantic and Polars:

```python

from datetime import datetime

from flycatcher import Schema, Field, col, model_validator

class BookingSchema(Schema):

    email: str

    phone: str

    check_in: datetime = Field(ge=datetime(2024, 1, 1))

    check_out: datetime = Field(ge=datetime(2024, 1, 1))

    nights: int = Field(ge=1)

    @model_validator

    def check_dates():

        return (

            col('check_out') > col('check_in'),

            "Check-out must be after check-in"

        )

    @model_validator

    def check_phone_format():

        cleaned = col('phone').str.replace(r'[^\d]', '')

        return (cleaned.str.len_chars() == 10, "Phone must have 10 digits")

    @model_validator

    def check_minimum_stay():

        # For operations not yet in DSL (like .is_in()), use explicit Polars format

        # Note: .dt.month() is available in DSL, but .is_in() is not yet supported

        import polars as pl

        return {

            'polars': (

                (~pl.col('check_in').dt.month().is_in([7, 8])) | (pl.col('nights') >= 3),

                "Minimum stay in July and August is 3 nights"

            ),

            'pydantic': lambda v: (

                v.check_in.month not in [7, 8] or v.nights >= 3,

                "Minimum stay in July and August is 3 nights"

            )

        }

```

### Validation Modes

Polars validation supports flexible error handling:

```python

# Strict mode: Raise on validation errors (default)

validated_df = UserValidator.validate(df, strict=True)

# Non-strict mode: Filter out invalid rows

valid_df = UserValidator.validate(df, strict=False)

# Show violations for debugging

validated_df = UserValidator.validate(df, strict=True, show_violations=True)

```

---

## 🎯 Complete Example: ETL Pipeline

```python

import polars as pl

from datetime import datetime

from flycatcher import Schema, Field, col, model_validator

from sqlalchemy import create_engine, MetaData

# 1. Define schema once

class OrderSchema(Schema):

    order_id: int = Field(primary_key=True)

    customer_email: str = Field(pattern=r'^[^@]+@[^@]+\.[^@]+$', index=True)

    amount: float = Field(gt=0)

    tax: float = Field(ge=0)

    total: float = Field(gt=0)

    created_at: datetime

    @model_validator

    def check_total():

        return (

            col('total') == col('amount') + col('tax'),

            "Total must equal amount + tax"

        )

# 2. Extract & Validate with Polars (handles millions of rows)

OrderValidator = OrderSchema.to_polars_validator()

df = pl.read_csv("orders.csv")

validated_df = OrderValidator.validate(df, strict=True)

# 3. Load to database with SQLAlchemy

OrderTable = OrderSchema.to_sqlalchemy(table_name="orders")

engine = create_engine("postgresql://localhost/analytics")

with engine.connect() as conn:

    conn.execute(OrderTable.insert(), validated_df.to_dicts())

    conn.commit()

```

✅ **Result:** Validated millions of rows, enforced business rules, and loaded to database — all from one schema definition.

---

## 🏗️ Design Philosophy

**One schema, three representations. Each optimized for its use case.**

```

        Schema Definition

               ↓

    ┌──────────┼──────────┐

    ↓          ↓          ↓

Pydantic    Polars    SQLAlchemy

   ↓          ↓          ↓

 APIs       ETL      Database

```

### What Flycatcher Does

✅ Single source of truth for schema definitions




✅ Generate optimized representations for different use cases




✅ Keep runtimes separate (no ORM ↔ DataFrame conversions)




✅ Use stable public APIs (Pydantic, Polars, SQLAlchemy)




### What Flycatcher Doesn't Do

❌ Mix row-oriented and columnar paradigms




❌ Create a "unified runtime" (that would be slow)




❌ Reinvent validation logic (delegates to proven libraries when possible)




❌ Depend on internal APIs

---

## ⚠️ Current Limitations (v0.1.0)

Flycatcher v0.1.0 is an **alpha release**. The core functionality works perfectly, but some advanced features are planned for future versions:

### Polars DSL

The `col()` DSL supports **basic operations** (`>`, `<`, `==`, `+`, etc.),

**numeric math operations** (`.abs()`, `.round()`, `.floor()`, `.ceil()`, `.sqrt()`, `.pow()`),

**limited string operations** (`.str.contains()`, `.str.starts_with()`, `.str.len_chars()`, etc.),

 and a **limited datetime accessor** (`.dt.year()`, `.dt.month()`, `.dt.total_days(other)`, etc.).

The `col()` DSL does not support the full range of Polars operations. However, additional

operations will be added in future versions to better support the full functionality of Polars.

**Workaround**: Use the explicit format in `@model_validator`:

```python

@model_validator

def check():

    return {

        'polars': (pl.col('field').is_null(), "Message"),

        'pydantic': lambda v: (v.field is None, "Message")

    }

```

### Pydantic Features

- ❌ `@field_validator` - Only `@model_validator` is supported (coming in v0.2.0)

- ❌ Field aliases and computed fields (coming in v0.2.0+)

- ❌ Custom serialization options (coming in v0.2.0+)

**Workaround**: Use `@model_validator` for all validation needs.

### SQLAlchemy Features

- ❌ Foreign key relationships - Must be added manually after table generation (coming in v0.3.0+)

- ❌ Composite primary keys - Only single-field primary keys supported (coming in v0.3.0+)

- ❌ Function-based defaults (e.g., `default=func.now()`) - Only literal defaults supported

**Workaround**: Add relationships and composite keys manually in SQLAlchemy after table generation.

### Field Types

- ❌ Enum, UUID, JSON, Array field types (coming in v0.3.0+)

- ❌ Numeric/Decimal field type (coming in v0.3.0+)

**Workaround**: Use `String` with pattern validation or manual handling.

---

## 📊 Comparison

| Feature | Flycatcher | SQLModel | Patito |

|---------|-----------|----------|--------|

| Pydantic support | ✅ | ✅ | ✅ |

| Polars support | ✅ | ❌ | ✅ |

| SQLAlchemy support | ✅ | ✅ | ❌ |

| DataFrame-level DB ops | 🚧 (v0.2) | ❌ | ❌ |

| Cross-field validation | ✅ | ⚠️ (Pydantic only) | ⚠️ (Polars only) |

| Single schema definition | ✅ | ⚠️ (Pydantic + ORM hybrid) | ⚠️ (Pydantic + Polars hybrid) |

**Flycatcher** is the only library that generates optimized representations for all three systems while keeping them properly separated.

---

## 📚 Documentation

- **[Getting Started](https://mrmcmullan.github.io/flycatcher/)** - Installation and basics

- **[Tutorials](https://mrmcmullan.github.io/flycatcher/tutorials/)** - Step-by-step guides

- **[How-To Guides](https://mrmcmullan.github.io/flycatcher/how-to/)** - Solve specific problems

- **[API Reference](https://mrmcmullan.github.io/flycatcher/api/)** - Complete API documentation

- **[Explanations](https://mrmcmullan.github.io/flycatcher/explanations/)** - Deep dives and concepts

---

## 🛣️ Roadmap

### v0.1.0 (Released) 🚀

- [x] Core schema definition with metaclass

- [x] Field types with constraints (Integer, String, Float, Boolean, Datetime, Date)

- [x] Pydantic model generator

- [x] Polars DataFrame validator with bulk validation

- [x] SQLAlchemy table generator

- [x] Cross-field validators with DSL (`col()`)

- [x] Test suite with 70%+ coverage

- [x] Complete documentation site

- [x] PyPI publication

### v0.2.0 (In Progress) 🚧

**Theme:** Enhanced validation and database operations

- [ ] `@field_validator` support in addition to existing `@model_validator`

- [x] Enhanced Polars DSL: `.is_null()`, `.is_not_null()`, `.str.contains()`, `.str.startswith()`, `.dt.month`, `.dt.year`, `.is_in([...])`, `.is_between()`

- [ ] Pydantic enhancements: field aliases, computed fields, custom serialization

- [ ] Enable inheritance of `Schema` to create subclasses with different fields

- [ ] For more details, see the [GitHub Milestone for v0.2.0](https://github.com/mrmcmullan/flycatcher/milestone/2)

### v0.3.0 (Planned)

- [ ] DataFrame-level queries (`Schema.query()`)

- [ ] Bulk write operations (`Schema.insert()`, `Schema.update()`, `Schema.upsert()`)

- [ ] Complete ETL loop staying columnar end-to-end

- [ ] Add PascalCase metaclass

- [ ] Additional Pydantic validation modes (`mode='before'`, `mode='wrap'`)

- [ ] For more details, see the [GitHub Milestone for v0.3.0](https://github.com/mrmcmullan/flycatcher/milestone/3)

### v0.4.0+ (Future)

**Theme:** Advanced field types and relationships

- [ ] Additional field types: Enum, UUID, JSON, Array, Numeric/Decimal, Time, Binary, Interval

- [ ] SQLAlchemy relationships: Foreign keys, composite primary keys

- [ ] SQLAlchemy function-based defaults (e.g., `default=func.now()`)

- [ ] JOIN support in queries

- [ ] Aggregations (GROUP BY, COUNT, SUM)

- [ ] Schema migrations helper

## 🤝 Contributing

Contributions are welcome! Please see our [Contributing Guide] for details.

---

## 📄 License

MIT License - see [LICENSE]([LICENSE](https://github.com/mrmcmullan/flycatcher?tab=MIT-1-ov-file)) for details.

---

## 💬 Community

- **[GitHub Issues](https://github.com/mrmcmullan/flycatcher/issues)** - Bug reports and feature requests

- **[GitHub Discussions](https://github.com/mrmcmullan/flycatcher/discussions)** - Questions and community discussion

- **[Documentation](https://mrmcmullan.github.io/flycatcher)** - Full guides and API reference

---



Built with ❤️ for the DataFrame generation




  ⭐ Star us on GitHub

   | 

  📖 Read the docs

   | 

  🐛 Report a bug
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mrmcmullan/flycatcher

Awesome Lists containing this project

README