https://github.com/thomazyujibaba/ibarrow

Connect legacy ODBC databases (e.g., InterBase, Firebird) to modern Arrow/Pandas/Polars workflows via Rust and Python.
https://github.com/thomazyujibaba/ibarrow
apache arrow interbase odbc
Last synced: 2 months ago
JSON representation
Connect legacy ODBC databases (e.g., InterBase, Firebird) to modern Arrow/Pandas/Polars workflows via Rust and Python.
Host: GitHub
URL: https://github.com/thomazyujibaba/ibarrow
Owner: thomazyujibaba
License: mit
Created: 2025-09-23T18:07:51.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-11-24T07:43:53.000Z (6 months ago)
Last Synced: 2025-11-27T18:28:11.170Z (6 months ago)
Topics: apache, arrow, interbase, odbc
Language: Rust
Homepage:
Size: 74.2 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 9
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Roadmap: ROADMAP.md
Awesome Lists containing this project

README

          # ibarrow

High-performance ODBC to Arrow data conversion for Python, built with Rust.

## Features

- 🚀 **High Performance**: Built with Rust for maximum speed

- 🔄 **ODBC Integration**: Direct connection to any ODBC-compatible database

- 📊 **Arrow Format**: Native Apache Arrow support for efficient data processing

- 🐼 **Pandas/Polars Ready**: Seamless integration with popular Python data libraries

- 🛡️ **Type Safe**: Rust-powered reliability with Python convenience

- 🎯 **Two-Level API**: Simple wrappers for common use + raw functions for advanced control

## Installation

```bash

pip install ibarrow

```

## Repository

- **GitHub**: https://github.com/thomazyujibaba/ibarrow

- **PyPI**: https://pypi.org/project/ibarrow/

- **Documentation**: https://github.com/thomazyujibaba/ibarrow#readme

## Prerequisites

**Important**: You need an ODBC driver installed on your system for ibarrow to work.

### Windows

- **SQL Server**: [ODBC Driver for SQL Server](https://docs.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server)

- **PostgreSQL**: [psqlODBC](https://www.postgresql.org/ftp/odbc/versions/)

- **MySQL**: [MySQL Connector/ODBC](https://dev.mysql.com/downloads/connector/odbc/)

- **Oracle**: [Oracle Instant Client + ODBC](https://www.oracle.com/database/technologies/instant-client/winx64-64-downloads.html)

### Linux

- **SQL Server**: [Microsoft ODBC Driver for SQL Server on Linux](https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server)

- **PostgreSQL**: `sudo apt-get install odbc-postgresql` (Ubuntu/Debian) or `sudo yum install postgresql-odbc` (RHEL/CentOS)

- **MySQL**: `sudo apt-get install libmyodbc` (Ubuntu/Debian) or `sudo yum install mysql-connector-odbc` (RHEL/CentOS)

### macOS

- **Note**: macOS support is currently not available. Please use Windows or Linux for now.

### Verify ODBC Installation

You can verify your ODBC drivers are installed by checking the system:

**Windows:**

```cmd

# Check installed drivers

odbcad32.exe

```

**Linux/macOS:**

```bash

# List available drivers

odbcinst -q -d

```

## API Architecture

ibarrow provides a **two-level API** designed for different user needs:

### 🎯 **High-Level API (Recommended for 95% of users)**

- **`query_polars()`**: Direct Polars DataFrame (zero-copy, fastest)

- **`query_pandas()`**: Direct Pandas DataFrame (maximum compatibility)

### 🔧 **Low-Level API (For advanced users)**

- **`query_arrow_ipc()`**: Raw Arrow IPC bytes (maximum compatibility)

- **`query_arrow_c_data()`**: Raw Arrow C Data Interface (maximum performance)

### 📋 **When to Use Each Level**

| User Type | Recommended Function | Use Case |

|-----------|---------------------|----------|

| **Beginners** | `query_polars()` | 95% of cases - simple and fast |

| **Pandas Users** | `query_pandas()` | When you need Pandas compatibility |

| **Advanced Users** | `query_arrow_ipc()` | When you need raw Arrow data |

| **Performance Critical** | `query_arrow_c_data()` | When you need maximum control |

## Quick Start

### 🔗 **Connection Methods**

ibarrow supports two ways to connect to databases:

#### **Method 1: DSN (Data Source Name)**

```python

# Requires pre-configured DSN in ODBC Data Sources

conn = ibarrow.connect(

    dsn="my_database_dsn",

    user="username", 

    password="password"

)

```

#### **Method 2: Direct Connection String (Recommended)**

```python

# Direct connection like pyodbc - no DSN configuration needed

conn = ibarrow.connect(

    dsn="DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;",

    user="username",

    password="password"

)

```

### 🚀 **Recommended Usage (95% of cases)**

```python

import ibarrow

# Option 1: Using DSN (Data Source Name)

conn = ibarrow.connect(

    dsn="your_dsn",

    user="username",

    password="password"

)

# Option 2: Using direct connection string (like pyodbc)

conn = ibarrow.connect(

    dsn="DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;",

    user="username",

    password="password"

)

# Query and get Polars DataFrame

df = conn.query_polars("SELECT * FROM your_table")

print(df)

```

### With Custom Batch Size

```python

import ibarrow

# Create config with custom batch size

config = ibarrow.QueryConfig(batch_size=2000)

# Create connection with configuration

conn = ibarrow.connect(

    dsn="your_dsn",

    user="username",

    password="password",

    config=config

)

# Query with custom batch size

arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")

```

### Advanced Configuration

```python

import ibarrow

# Create custom configuration

config = ibarrow.QueryConfig(

    batch_size=2000,           # Rows per batch

    read_only=True,            # Read-only connection

    connection_timeout=30,      # Connection timeout in seconds

    query_timeout=60,          # Query timeout in seconds

    max_text_size=32768,       # Max text field size

    max_binary_size=16384,     # Max binary field size

    isolation_level="READ_COMMITTED"  # Transaction isolation

)

# Create connection with configuration

conn = ibarrow.connect(

    dsn="your_dsn",

    user="username",

    password="password",

    config=config

)

# Use the connection

arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")

```

### Direct DataFrame Integration

```python

import ibarrow

# Direct conversion to Polars DataFrame (uses pl.read_ipc internally)

# Create connection

conn = ibarrow.connect(

    dsn="your_dsn",

    user="username",

    password="password"

)

# Get Polars DataFrame

df_polars = conn.query_polars("SELECT * FROM your_table")

# Get Pandas DataFrame

df_pandas = conn.query_pandas("SELECT * FROM your_table")

print(df_polars)

print(df_pandas)

```

### ⚡ Zero-Copy Performance (Arrow C Data Interface)

For maximum performance, use the Arrow C Data Interface functions that completely eliminate serialization:

```python

import ibarrow

import polars as pl

import pyarrow as pa

# Zero-copy conversion to Polars DataFrame (fastest)

# Create connection

conn = ibarrow.connect(

    dsn="your_dsn",

    user="username",

    password="password"

)

# Get Polars DataFrame directly

df_polars = conn.query_arrow_c_data("SELECT * FROM your_table", return_dataframe=True)

# Or get raw PyCapsules for manual control

schema_capsule, array_capsule = conn.query_arrow_c_data("SELECT * FROM your_table")

# Convert to PyArrow Table using zero-copy

schema = pa.Schema._import_from_c(schema_capsule)

array = pa.Array._import_from_c(array_capsule)

table = pa.Table.from_arrays([array], schema=schema)

# Convert to Polars

df = pl.from_arrow(table)

```

**Arrow C Data Interface Benefits:**

- 🚀 **Zero serialization**: Data passes directly via pointers

- 💾 **Zero copies**: Eliminates memory overhead

- ⚡ **Maximum speed**: Ideal for large datasets

- 🔄 **Compatibility**: Works with PyArrow, Polars, Pandas

### Manual Arrow IPC Usage

```python

import ibarrow

import polars as pl

# Get raw Arrow IPC bytes

# Create connection

conn = ibarrow.connect(

    dsn="your_dsn",

    user="username",

    password="password"

)

# Get Arrow IPC bytes

arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")

# Convert to Polars DataFrame manually

df = pl.read_ipc(arrow_bytes)

print(df)

```

## API Reference

### `ibarrow.connect(dsn, user, password, config=None)`

Creates a connection object for database operations.

**Parameters:**

- `dsn` (str): ODBC Data Source Name or full connection string

  - **DSN format**: `"your_dsn"` (requires pre-configured DSN)

  - **Connection string format**: `"DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;"` (direct connection)

- `user` (str): Database username

- `password` (str): Database password

- `config` (QueryConfig, optional): Configuration object

**Returns:** `IbarrowConnection` object

**Connection String Examples:**

```python

# SQL Server

dsn = "DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;"

# PostgreSQL

dsn = "DRIVER={PostgreSQL};SERVER=localhost;PORT=5432;DATABASE=mydb;"

# MySQL

dsn = "DRIVER={MySQL ODBC 8.0 Driver};SERVER=localhost;PORT=3306;DATABASE=mydb;"

# Oracle

dsn = "DRIVER={Oracle in OraClient19Home1};DBQ=localhost:1521/XE;"

```

### `query_arrow_ipc(sql)`

Execute a SQL query and return Arrow IPC bytes.

**Parameters:**

- `sql` (str): SQL query to execute

**Returns:** `bytes` - Arrow IPC format data

**Raises:**

- `PyConnectionError`: Database connection issues

- `PySQLError`: SQL syntax or execution errors

- `PyArrowError`: Arrow data processing errors

### `conn.query_polars(sql)`

Execute a SQL query and return a Polars DataFrame directly.

**Parameters:** Same as `query_arrow_ipc`

**Returns:** `polars.DataFrame` - Ready-to-use DataFrame

**Note:** Uses `pl.read_ipc()` directly with bytes for optimal performance.

### `query_pandas(sql)`

Execute a SQL query and return a Pandas DataFrame directly.

**Parameters:** Same as `query_arrow_ipc`

**Returns:** `pandas.DataFrame` - Ready-to-use DataFrame

**Note:** Converts Arrow IPC to Pandas via PyArrow for compatibility.

### `QueryConfig`

Configuration class for advanced query settings.

**Parameters:**

- `batch_size` (int, optional): Number of rows per batch for processing (default: 1000)

- `read_only` (bool, optional): Read-only connection to avoid locks (default: True)

- `connection_timeout` (int, optional): Connection timeout in seconds

- `query_timeout` (int, optional): Query timeout in seconds

- `max_text_size` (int, optional): Maximum text field size in bytes (default: 65536)

- `max_binary_size` (int, optional): Maximum binary field size in bytes (default: 65536)

- `isolation_level` (str, optional): Transaction isolation level. Supported values: "read_uncommitted", "read_committed", "repeatable_read", "serializable", "snapshot"

### Configuration Benefits

- **`batch_size`**: Controls memory usage and performance. Larger batches = more memory but faster processing

- **`read_only`**: Prevents locks and improves performance for read-only operations (effective only if ODBC driver supports this flag)

- **`connection_timeout`**: Protects against hanging connections

- **`query_timeout`**: Prevents long-running queries from blocking

- **`max_text_size`**: Handles large text fields (VARCHAR, TEXT) efficiently

- **`max_binary_size`**: Handles large binary data (BLOB, VARBINARY) efficiently

- **`isolation_level`**: Controls transaction isolation for concurrent access

### Implementation Notes

- **`read_only`**: Currently implemented via ODBC connection string (`ReadOnly=1`). 

- **`batch_size`**: Controls how many rows are fetched per batch from the database, avoiding row-by-row fetching for better performance.

- **`query_timeout`**: Implemented via statement handle using `stmt.set_query_timeout()`, which is more reliable than connection string timeouts.

- **`isolation_level`**: Standardized mapping from common names (e.g., "read_committed") to driver-specific ODBC connection string values (e.g., "Isolation Level=ReadCommitted").

- **`query_polars`**: Uses Arrow IPC stream with `pl.read_ipc()` for maximum compatibility and performance.

- **Native Types**: Always preserves ODBC native types (INT, DECIMAL, FLOAT) as Arrow native types (Int64Array, Float64Array), avoiding expensive string conversions for maximum performance.

- **Pipelining**: Always processes data in streaming fashion, writing each batch immediately as it's fetched. This keeps memory usage constant (e.g., 10MB) regardless of dataset size (even 80GB+).

## Performance Comparison

### Serialization vs Zero-Copy

| Method | Level | Serialization | Memory Copies | Performance | Ideal Use |

|--------|-------|-------------|---------------|-------------|-----------|

| **`query_polars`** | **High** | Arrow IPC Stream | 1x (serialization) | ⭐⭐⭐⭐ | **95% of cases (recommended)** |

| **`query_pandas`** | **High** | Arrow IPC Stream | 1x (serialization) | ⭐⭐⭐ | Pandas compatibility |

| `query_arrow_ipc` | Low | Arrow IPC Stream | 1x (serialization) | ⭐⭐⭐ | Maximum compatibility |

| `query_arrow_c_data` | Low | **Zero** | **Zero** | **⭐⭐⭐⭐⭐** | Maximum performance |

### Typical Benchmarks (1M rows)

```

query_polars:         ~200ms  (Arrow IPC + polars.read_ipc) ⭐ RECOMMENDED

query_pandas:         ~600ms  (Arrow IPC + pyarrow + pandas)

query_arrow_ipc:      ~500ms  (Arrow IPC serialization)

query_arrow_c_data:   ~50ms   (zero-copy via pointers)

```

### 🚀 **Built-in Performance Optimizations**

**Native Types (Always Enabled):**

```

- INT columns → Int64Array (native Arrow types)

- DECIMAL columns → DecimalArray (native Arrow types)  

- FLOAT columns → Float64Array (native Arrow types)

- Performance: ~200ms for 1M numeric rows (Arrow IPC)

```

**Pipelining (Always Enabled):**

```

- Memory usage: Constant (~10MB) regardless of dataset size

- Processing: Streaming (fetch + write immediately)

- Latency: Lower (Python can start consuming data before completion)

- Example: 80GB dataset uses only ~10MB RAM

```

### When to Use Each Method

#### 🎯 **High-Level API (Recommended)**

- **`query_polars()`**: **95% of cases** - Simple, fast, zero-copy

- **`query_pandas()`**: When you need Pandas compatibility

#### 🔧 **Low-Level API (Advanced)**

- **`query_arrow_ipc()`**: Maximum compatibility, save to disk

- **`query_arrow_c_data()`**: Maximum performance, full control over data

## Error Handling

The library provides specific exception types for different error scenarios:

```python

import ibarrow

try:

    # Create connection

    conn = ibarrow.connect(dsn, user, password)

    

    # Query with batch size

    df = conn.query_polars(sql)

except ibarrow.PyConnectionError as e:

    print(f"Connection failed: {e}")

except ibarrow.PySQLError as e:

    print(f"SQL error: {e}")

except ibarrow.PyArrowError as e:

    print(f"Arrow processing error: {e}")

```

## Requirements

- Python 3.8+

- ODBC driver for your database

- Rust toolchain (for development)

## Development

### Setup

```bash

# Clone the repository

git clone https://github.com/thomazyujibaba/ibarrow.git

cd ibarrow

# Install maturin

pip install maturin[patchelf]

# Install in development mode

maturin develop

```

### Running Tests

```bash

# Install test dependencies

pip install pytest pytest-cov

# Run tests

pytest tests/ -v

```

### Building

```bash

# Build wheel

maturin build --release

# Build and install

maturin develop

```

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Troubleshooting

### Common ODBC Issues

**"Driver not found" errors:**

- Ensure the ODBC driver is properly installed

- Check that the driver name in your DSN matches exactly

- Verify the driver architecture (32-bit vs 64-bit) matches your Python installation

**Connection timeout errors:**

- Check network connectivity to the database server

- Verify firewall settings

- Ensure the database server is running and accessible

**Permission errors:**

- Verify database credentials

- Check user permissions on the database

- Ensure the ODBC driver has necessary privileges

**Performance issues:**

- Adjust `batch_size` in `QueryConfig` for optimal memory usage

- Use `read_only=True` for read-only operations

- Consider connection pooling for high-frequency queries

## Support

For issues and questions, please use the [GitHub Issues](https://github.com/thomazyujibaba/ibarrow/issues) page.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thomazyujibaba/ibarrow

Awesome Lists containing this project

README