https://github.com/legout/pydala2
poor manΒ΄s data lake - Simple api to efficiently query your parquet datasets using Duckdb or polars
https://github.com/legout/pydala2
duckdb fsspec local localcache object-storage pandas polars pyarrow python
Last synced: 8 months ago
JSON representation
poor manΒ΄s data lake - Simple api to efficiently query your parquet datasets using Duckdb or polars
- Host: GitHub
- URL: https://github.com/legout/pydala2
- Owner: legout
- License: mit
- Created: 2023-08-04T11:19:46.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2025-09-24T16:10:28.000Z (9 months ago)
- Last Synced: 2025-10-10T03:27:11.957Z (8 months ago)
- Topics: duckdb, fsspec, local, localcache, object-storage, pandas, polars, pyarrow, python
- Language: Python
- Homepage:
- Size: 7.06 MB
- Stars: 6
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PyDala2
[](https://badge.fury.io/py/pydala2)
[](https://opensource.org/licenses/MIT)
[](https://pydala2.readthedocs.io)
## Overview π
PyDala2 is a high-performance Python library for managing Parquet datasets with advanced metadata capabilities. Built on Apache Arrow, it provides efficient management of Parquet datasets with features including:
- Smart dataset management with metadata optimization
- Multi-format support (Parquet, CSV, JSON)
- Multi-backend integration (Polars, PyArrow, DuckDB, Pandas)
- Advanced querying with predicate pushdown
- Schema management with automatic validation
- Performance optimization with caching and partitioning
- Catalog system for centralized dataset management
## β¨ Key Features
- **π High Performance**: Built on Apache Arrow with optimized memory usage and processing speed
- **π Smart Dataset Management**: Efficient Parquet handling with metadata optimization and caching
- **π Multi-backend Support**: Seamlessly switch between Polars, PyArrow, DuckDB, and Pandas
- **π Advanced Querying**: SQL-like filtering with predicate pushdown for maximum efficiency
- **π Schema Management**: Automatic validation, evolution, and tracking of data schemas
- **β‘ Performance Optimization**: Built-in caching, compression, and intelligent partitioning
- **π‘οΈ Type Safety**: Comprehensive validation and error handling throughout the library
- **ποΈ Catalog System**: Centralized dataset management across namespaces
## π Quick Start
### Installation
```bash
# Install PyDala2
pip install pydala2
# Install with all optional dependencies
pip install pydala2[all]
# Install with specific backends
pip install pydala2[polars,duckdb]
```
### Basic Usage
```python
from pydala import ParquetDataset
import pandas as pd
# Create a dataset
dataset = ParquetDataset("data/my_dataset")
# Write data
data = pd.DataFrame({
'id': range(100),
'category': ['A', 'B', 'C'] * 33 + ['A'],
'value': [i * 2 for i in range(100)]
})
dataset.write_to_dataset(
data=data,
partition_cols=['category']
)
# Read with filtering - automatic backend selection
result = dataset.filter("category IN ('A', 'B') AND value > 50")
# Export to different formats
df_polars = result.table.to_polars() # or use shortcut: result.t.pl
df_pandas = result.table.df # or result.t.df
duckdb_rel = result.table.ddb # or result.t.ddb
```
### Using Different Backends
```python
# PyDala2 provides automatic backend selection
# Just access data in your preferred format:
# Polars LazyFrame (recommended for performance)
lazy_df = dataset.table.pl # or dataset.t.pl
result = (
lazy_df
.filter(pl.col("value") > 100)
.group_by("category")
.agg(pl.mean("value"))
.collect()
)
# DuckDB (for SQL queries)
result = dataset.ddb_con.sql("""
SELECT category, AVG(value) as avg_value
FROM dataset
GROUP BY category
""").to_arrow()
# PyArrow Table (for columnar operations)
table = dataset.table.arrow # or dataset.t.arrow
# Pandas DataFrame (for compatibility)
df_pandas = dataset.table.df # or dataset.t.df
# Direct export methods
df_polars = dataset.table.to_polars(lazy=False)
table = dataset.table.to_arrow()
df_pandas = dataset.table.to_pandas()
```
### Catalog Management
```python
from pydala import Catalog
# Create catalog from YAML configuration
catalog = Catalog("catalog.yaml")
# YAML configuration example:
# tables:
# sales_2023:
# path: "/data/sales/2023"
# filesystem: "local"
# customers:
# path: "/data/customers"
# filesystem: "local"
# Query across datasets using automatic table loading
result = catalog.query("""
SELECT
s.*,
c.customer_name,
c.segment
FROM sales_2023 s
JOIN customers c ON s.customer_id = c.id
WHERE s.date >= '2023-01-01'
""")
# Or access datasets directly
sales_dataset = catalog.get_dataset("sales_2023")
filtered_sales = sales_dataset.filter("amount > 1000")
```
## π Documentation
Comprehensive documentation is available at [pydala2.readthedocs.io](https://pydala2.readthedocs.io):
### Getting Started
- [Installation Guide](https://pydala2.readthedocs.io/getting-started)
- [Quick Start Tutorial](https://pydala2.readthedocs.io/quick-start)
### User Guide
- [Basic Usage](https://pydala2.readthedocs.io/user-guide/basic-usage)
- [Data Operations](https://pydala2.readthedocs.io/user-guide/data-operations)
- [Performance Optimization](https://pydala2.readthedocs.io/user-guide/performance)
- [Catalog Management](https://pydala2.readthedocs.io/user-guide/catalog-management)
- [Schema Management](https://pydala2.readthedocs.io/user-guide/schema-management)
### API Reference
- [Core Classes](https://pydala2.readthedocs.io/api/core)
- [Dataset Classes](https://pydala2.readthedocs.io/api/datasets)
- [Table Operations](https://pydala2.readthedocs.io/api/table)
- [Metadata Management](https://pydala2.readthedocs.io/api/metadata)
- [Catalog System](https://pydala2.readthedocs.io/api/catalog)
- [Filesystem](https://pydala2.readthedocs.io/api/filesystem)
- [Utilities](https://pydala2.readthedocs.io/api/utilities)
### Advanced Topics
- [Performance Tuning](https://pydala2.readthedocs.io/advanced/performance-tuning)
- [Integration Patterns](https://pydala2.readthedocs.io/advanced/integration)
- [Deployment Guide](https://pydala2.readthedocs.io/advanced/deployment)
- [Troubleshooting](https://pydala2.readthedocs.io/advanced/troubleshooting)
## π€ Contributing
Contributions are welcome! Please see our [Contributing Guide](https://pydala2.readthedocs.io/contributing) for details.
## π License
[MIT License](LICENSE)