https://github.com/thomazyujibaba/ibarrow
Connect legacy ODBC databases (e.g., InterBase, Firebird) to modern Arrow/Pandas/Polars workflows via Rust and Python.
https://github.com/thomazyujibaba/ibarrow
apache arrow interbase odbc
Last synced: 23 days ago
JSON representation
Connect legacy ODBC databases (e.g., InterBase, Firebird) to modern Arrow/Pandas/Polars workflows via Rust and Python.
- Host: GitHub
- URL: https://github.com/thomazyujibaba/ibarrow
- Owner: thomazyujibaba
- License: mit
- Created: 2025-09-23T18:07:51.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-11-24T07:43:53.000Z (5 months ago)
- Last Synced: 2025-11-27T18:28:11.170Z (4 months ago)
- Topics: apache, arrow, interbase, odbc
- Language: Rust
- Homepage:
- Size: 74.2 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Roadmap: ROADMAP.md
Awesome Lists containing this project
README
# ibarrow
High-performance ODBC to Arrow data conversion for Python, built with Rust.
## Features
- 🚀 **High Performance**: Built with Rust for maximum speed
- 🔄 **ODBC Integration**: Direct connection to any ODBC-compatible database
- 📊 **Arrow Format**: Native Apache Arrow support for efficient data processing
- 🐼 **Pandas/Polars Ready**: Seamless integration with popular Python data libraries
- 🛡️ **Type Safe**: Rust-powered reliability with Python convenience
- 🎯 **Two-Level API**: Simple wrappers for common use + raw functions for advanced control
## Installation
```bash
pip install ibarrow
```
## Repository
- **GitHub**: https://github.com/thomazyujibaba/ibarrow
- **PyPI**: https://pypi.org/project/ibarrow/
- **Documentation**: https://github.com/thomazyujibaba/ibarrow#readme
## Prerequisites
**Important**: You need an ODBC driver installed on your system for ibarrow to work.
### Windows
- **SQL Server**: [ODBC Driver for SQL Server](https://docs.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server)
- **PostgreSQL**: [psqlODBC](https://www.postgresql.org/ftp/odbc/versions/)
- **MySQL**: [MySQL Connector/ODBC](https://dev.mysql.com/downloads/connector/odbc/)
- **Oracle**: [Oracle Instant Client + ODBC](https://www.oracle.com/database/technologies/instant-client/winx64-64-downloads.html)
### Linux
- **SQL Server**: [Microsoft ODBC Driver for SQL Server on Linux](https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server)
- **PostgreSQL**: `sudo apt-get install odbc-postgresql` (Ubuntu/Debian) or `sudo yum install postgresql-odbc` (RHEL/CentOS)
- **MySQL**: `sudo apt-get install libmyodbc` (Ubuntu/Debian) or `sudo yum install mysql-connector-odbc` (RHEL/CentOS)
### macOS
- **Note**: macOS support is currently not available. Please use Windows or Linux for now.
### Verify ODBC Installation
You can verify your ODBC drivers are installed by checking the system:
**Windows:**
```cmd
# Check installed drivers
odbcad32.exe
```
**Linux/macOS:**
```bash
# List available drivers
odbcinst -q -d
```
## API Architecture
ibarrow provides a **two-level API** designed for different user needs:
### 🎯 **High-Level API (Recommended for 95% of users)**
- **`query_polars()`**: Direct Polars DataFrame (zero-copy, fastest)
- **`query_pandas()`**: Direct Pandas DataFrame (maximum compatibility)
### 🔧 **Low-Level API (For advanced users)**
- **`query_arrow_ipc()`**: Raw Arrow IPC bytes (maximum compatibility)
- **`query_arrow_c_data()`**: Raw Arrow C Data Interface (maximum performance)
### 📋 **When to Use Each Level**
| User Type | Recommended Function | Use Case |
|-----------|---------------------|----------|
| **Beginners** | `query_polars()` | 95% of cases - simple and fast |
| **Pandas Users** | `query_pandas()` | When you need Pandas compatibility |
| **Advanced Users** | `query_arrow_ipc()` | When you need raw Arrow data |
| **Performance Critical** | `query_arrow_c_data()` | When you need maximum control |
## Quick Start
### 🔗 **Connection Methods**
ibarrow supports two ways to connect to databases:
#### **Method 1: DSN (Data Source Name)**
```python
# Requires pre-configured DSN in ODBC Data Sources
conn = ibarrow.connect(
dsn="my_database_dsn",
user="username",
password="password"
)
```
#### **Method 2: Direct Connection String (Recommended)**
```python
# Direct connection like pyodbc - no DSN configuration needed
conn = ibarrow.connect(
dsn="DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;",
user="username",
password="password"
)
```
### 🚀 **Recommended Usage (95% of cases)**
```python
import ibarrow
# Option 1: Using DSN (Data Source Name)
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password"
)
# Option 2: Using direct connection string (like pyodbc)
conn = ibarrow.connect(
dsn="DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;",
user="username",
password="password"
)
# Query and get Polars DataFrame
df = conn.query_polars("SELECT * FROM your_table")
print(df)
```
### With Custom Batch Size
```python
import ibarrow
# Create config with custom batch size
config = ibarrow.QueryConfig(batch_size=2000)
# Create connection with configuration
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password",
config=config
)
# Query with custom batch size
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")
```
### Advanced Configuration
```python
import ibarrow
# Create custom configuration
config = ibarrow.QueryConfig(
batch_size=2000, # Rows per batch
read_only=True, # Read-only connection
connection_timeout=30, # Connection timeout in seconds
query_timeout=60, # Query timeout in seconds
max_text_size=32768, # Max text field size
max_binary_size=16384, # Max binary field size
isolation_level="READ_COMMITTED" # Transaction isolation
)
# Create connection with configuration
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password",
config=config
)
# Use the connection
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")
```
### Direct DataFrame Integration
```python
import ibarrow
# Direct conversion to Polars DataFrame (uses pl.read_ipc internally)
# Create connection
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password"
)
# Get Polars DataFrame
df_polars = conn.query_polars("SELECT * FROM your_table")
# Get Pandas DataFrame
df_pandas = conn.query_pandas("SELECT * FROM your_table")
print(df_polars)
print(df_pandas)
```
### ⚡ Zero-Copy Performance (Arrow C Data Interface)
For maximum performance, use the Arrow C Data Interface functions that completely eliminate serialization:
```python
import ibarrow
import polars as pl
import pyarrow as pa
# Zero-copy conversion to Polars DataFrame (fastest)
# Create connection
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password"
)
# Get Polars DataFrame directly
df_polars = conn.query_arrow_c_data("SELECT * FROM your_table", return_dataframe=True)
# Or get raw PyCapsules for manual control
schema_capsule, array_capsule = conn.query_arrow_c_data("SELECT * FROM your_table")
# Convert to PyArrow Table using zero-copy
schema = pa.Schema._import_from_c(schema_capsule)
array = pa.Array._import_from_c(array_capsule)
table = pa.Table.from_arrays([array], schema=schema)
# Convert to Polars
df = pl.from_arrow(table)
```
**Arrow C Data Interface Benefits:**
- 🚀 **Zero serialization**: Data passes directly via pointers
- 💾 **Zero copies**: Eliminates memory overhead
- ⚡ **Maximum speed**: Ideal for large datasets
- 🔄 **Compatibility**: Works with PyArrow, Polars, Pandas
### Manual Arrow IPC Usage
```python
import ibarrow
import polars as pl
# Get raw Arrow IPC bytes
# Create connection
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password"
)
# Get Arrow IPC bytes
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")
# Convert to Polars DataFrame manually
df = pl.read_ipc(arrow_bytes)
print(df)
```
## API Reference
### `ibarrow.connect(dsn, user, password, config=None)`
Creates a connection object for database operations.
**Parameters:**
- `dsn` (str): ODBC Data Source Name or full connection string
- **DSN format**: `"your_dsn"` (requires pre-configured DSN)
- **Connection string format**: `"DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;"` (direct connection)
- `user` (str): Database username
- `password` (str): Database password
- `config` (QueryConfig, optional): Configuration object
**Returns:** `IbarrowConnection` object
**Connection String Examples:**
```python
# SQL Server
dsn = "DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;"
# PostgreSQL
dsn = "DRIVER={PostgreSQL};SERVER=localhost;PORT=5432;DATABASE=mydb;"
# MySQL
dsn = "DRIVER={MySQL ODBC 8.0 Driver};SERVER=localhost;PORT=3306;DATABASE=mydb;"
# Oracle
dsn = "DRIVER={Oracle in OraClient19Home1};DBQ=localhost:1521/XE;"
```
### `query_arrow_ipc(sql)`
Execute a SQL query and return Arrow IPC bytes.
**Parameters:**
- `sql` (str): SQL query to execute
**Returns:** `bytes` - Arrow IPC format data
**Raises:**
- `PyConnectionError`: Database connection issues
- `PySQLError`: SQL syntax or execution errors
- `PyArrowError`: Arrow data processing errors
### `conn.query_polars(sql)`
Execute a SQL query and return a Polars DataFrame directly.
**Parameters:** Same as `query_arrow_ipc`
**Returns:** `polars.DataFrame` - Ready-to-use DataFrame
**Note:** Uses `pl.read_ipc()` directly with bytes for optimal performance.
### `query_pandas(sql)`
Execute a SQL query and return a Pandas DataFrame directly.
**Parameters:** Same as `query_arrow_ipc`
**Returns:** `pandas.DataFrame` - Ready-to-use DataFrame
**Note:** Converts Arrow IPC to Pandas via PyArrow for compatibility.
### `QueryConfig`
Configuration class for advanced query settings.
**Parameters:**
- `batch_size` (int, optional): Number of rows per batch for processing (default: 1000)
- `read_only` (bool, optional): Read-only connection to avoid locks (default: True)
- `connection_timeout` (int, optional): Connection timeout in seconds
- `query_timeout` (int, optional): Query timeout in seconds
- `max_text_size` (int, optional): Maximum text field size in bytes (default: 65536)
- `max_binary_size` (int, optional): Maximum binary field size in bytes (default: 65536)
- `isolation_level` (str, optional): Transaction isolation level. Supported values: "read_uncommitted", "read_committed", "repeatable_read", "serializable", "snapshot"
### Configuration Benefits
- **`batch_size`**: Controls memory usage and performance. Larger batches = more memory but faster processing
- **`read_only`**: Prevents locks and improves performance for read-only operations (effective only if ODBC driver supports this flag)
- **`connection_timeout`**: Protects against hanging connections
- **`query_timeout`**: Prevents long-running queries from blocking
- **`max_text_size`**: Handles large text fields (VARCHAR, TEXT) efficiently
- **`max_binary_size`**: Handles large binary data (BLOB, VARBINARY) efficiently
- **`isolation_level`**: Controls transaction isolation for concurrent access
### Implementation Notes
- **`read_only`**: Currently implemented via ODBC connection string (`ReadOnly=1`).
- **`batch_size`**: Controls how many rows are fetched per batch from the database, avoiding row-by-row fetching for better performance.
- **`query_timeout`**: Implemented via statement handle using `stmt.set_query_timeout()`, which is more reliable than connection string timeouts.
- **`isolation_level`**: Standardized mapping from common names (e.g., "read_committed") to driver-specific ODBC connection string values (e.g., "Isolation Level=ReadCommitted").
- **`query_polars`**: Uses Arrow IPC stream with `pl.read_ipc()` for maximum compatibility and performance.
- **Native Types**: Always preserves ODBC native types (INT, DECIMAL, FLOAT) as Arrow native types (Int64Array, Float64Array), avoiding expensive string conversions for maximum performance.
- **Pipelining**: Always processes data in streaming fashion, writing each batch immediately as it's fetched. This keeps memory usage constant (e.g., 10MB) regardless of dataset size (even 80GB+).
## Performance Comparison
### Serialization vs Zero-Copy
| Method | Level | Serialization | Memory Copies | Performance | Ideal Use |
|--------|-------|-------------|---------------|-------------|-----------|
| **`query_polars`** | **High** | Arrow IPC Stream | 1x (serialization) | ⭐⭐⭐⭐ | **95% of cases (recommended)** |
| **`query_pandas`** | **High** | Arrow IPC Stream | 1x (serialization) | ⭐⭐⭐ | Pandas compatibility |
| `query_arrow_ipc` | Low | Arrow IPC Stream | 1x (serialization) | ⭐⭐⭐ | Maximum compatibility |
| `query_arrow_c_data` | Low | **Zero** | **Zero** | **⭐⭐⭐⭐⭐** | Maximum performance |
### Typical Benchmarks (1M rows)
```
query_polars: ~200ms (Arrow IPC + polars.read_ipc) ⭐ RECOMMENDED
query_pandas: ~600ms (Arrow IPC + pyarrow + pandas)
query_arrow_ipc: ~500ms (Arrow IPC serialization)
query_arrow_c_data: ~50ms (zero-copy via pointers)
```
### 🚀 **Built-in Performance Optimizations**
**Native Types (Always Enabled):**
```
- INT columns → Int64Array (native Arrow types)
- DECIMAL columns → DecimalArray (native Arrow types)
- FLOAT columns → Float64Array (native Arrow types)
- Performance: ~200ms for 1M numeric rows (Arrow IPC)
```
**Pipelining (Always Enabled):**
```
- Memory usage: Constant (~10MB) regardless of dataset size
- Processing: Streaming (fetch + write immediately)
- Latency: Lower (Python can start consuming data before completion)
- Example: 80GB dataset uses only ~10MB RAM
```
### When to Use Each Method
#### 🎯 **High-Level API (Recommended)**
- **`query_polars()`**: **95% of cases** - Simple, fast, zero-copy
- **`query_pandas()`**: When you need Pandas compatibility
#### 🔧 **Low-Level API (Advanced)**
- **`query_arrow_ipc()`**: Maximum compatibility, save to disk
- **`query_arrow_c_data()`**: Maximum performance, full control over data
## Error Handling
The library provides specific exception types for different error scenarios:
```python
import ibarrow
try:
# Create connection
conn = ibarrow.connect(dsn, user, password)
# Query with batch size
df = conn.query_polars(sql)
except ibarrow.PyConnectionError as e:
print(f"Connection failed: {e}")
except ibarrow.PySQLError as e:
print(f"SQL error: {e}")
except ibarrow.PyArrowError as e:
print(f"Arrow processing error: {e}")
```
## Requirements
- Python 3.8+
- ODBC driver for your database
- Rust toolchain (for development)
## Development
### Setup
```bash
# Clone the repository
git clone https://github.com/thomazyujibaba/ibarrow.git
cd ibarrow
# Install maturin
pip install maturin[patchelf]
# Install in development mode
maturin develop
```
### Running Tests
```bash
# Install test dependencies
pip install pytest pytest-cov
# Run tests
pytest tests/ -v
```
### Building
```bash
# Build wheel
maturin build --release
# Build and install
maturin develop
```
## License
MIT License - see [LICENSE](LICENSE) file for details.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Troubleshooting
### Common ODBC Issues
**"Driver not found" errors:**
- Ensure the ODBC driver is properly installed
- Check that the driver name in your DSN matches exactly
- Verify the driver architecture (32-bit vs 64-bit) matches your Python installation
**Connection timeout errors:**
- Check network connectivity to the database server
- Verify firewall settings
- Ensure the database server is running and accessible
**Permission errors:**
- Verify database credentials
- Check user permissions on the database
- Ensure the ODBC driver has necessary privileges
**Performance issues:**
- Adjust `batch_size` in `QueryConfig` for optimal memory usage
- Use `read_only=True` for read-only operations
- Consider connection pooling for high-frequency queries
## Support
For issues and questions, please use the [GitHub Issues](https://github.com/thomazyujibaba/ibarrow/issues) page.