https://github.com/bigmb/mb_pandas
Personal addition to pandas data ETL for faster and better performance
https://github.com/bigmb/mb_pandas
Last synced: 8 days ago
JSON representation
Personal addition to pandas data ETL for faster and better performance
- Host: GitHub
- URL: https://github.com/bigmb/mb_pandas
- Owner: bigmb
- Created: 2022-11-20T05:40:51.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-12-16T09:48:20.000Z (5 months ago)
- Last Synced: 2025-02-17T07:44:13.726Z (3 months ago)
- Language: Python
- Size: 91.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# mb_pandas
A Python package providing enhanced pandas functionality with async support and optimized operations.
## Features
- **Asynchronous DataFrame Loading**: Load large CSV and Parquet files efficiently using async I/O
- **Optimized DataFrame Merging**: Merge large DataFrames using chunking or Dask
- **Data Type Conversions**: Convert between string representations and Python objects
- **DataFrame Profiling**: Generate detailed profiling reports and comparisons
- **Data Transformation**: Various utilities for DataFrame transformations## Installation
```bash
pip install mb_pandas
```## Dependencies
- Python >= 3.8
- numpy
- pandas
- colorama## Modules
### transform.py
Functions for DataFrame transformations and merging operations.
```python
from mb_pandas.transform import merge_chunk, merge_dask, check_null, remove_unnamed,rename_columns# Merge large DataFrames in chunks
result = merge_chunk(df1, df2, chunksize=10000)# Merge using Dask for distributed computing
result = merge_dask(df1, df2)# Check and handle null values
df = check_null('data.csv', fillna=True)# Remove unnamed columns
df = remove_unnamed(df)# Rename column
df = rename_columns(data,'labels2','labels')
```### dfload.py
Asynchronous DataFrame loading utilities.
```python
from mb_pandas.dfload import load_any_df# Load any supported file format
df = load_any_df('data.csv')
df = load_any_df('data.parquet')# Convert string columns to Python objects
df = load_any_df('data.csv', literal_ast_columns=['json_col'])
```### aio.py
Asynchronous I/O utilities.
```python
from mb_pandas.aio import read_text, srun# Read file asynchronously
content = await read_text('file.txt', context_vars={'async': True})# Run async function synchronously
result = srun(async_function, *args)
```### convert_data.py
Data type conversion utilities.
```python
from mb_pandas.convert_data import convert_string_to_list, convert_string_to_dict, convert_string_to_type# Convert string representations to lists
df = convert_string_to_list(df, 'list_column')# Convert string representations to dictionaries
df = convert_string_to_dict(df, 'dict_column')# Convert strings to specific types
df = convert_string_to_type(df, 'number_column', int)
```### profiler.py
DataFrame profiling and comparison utilities.
```python
from mb_pandas.profiler import create_profile, profile_compare# Generate profiling report
create_profile(df, 'report.html', target=['target_column'])# Compare two DataFrames
profile_compare(df1, df2, 'comparison.html')
```## Key Functions
### merge_chunk(df1, df2, chunksize=10000)
Merge two DataFrames in chunks to handle large datasets efficiently.### merge_dask(df1, df2)
Merge two DataFrames using Dask for improved performance with large datasets.### load_any_df(file_path, show_progress=True)
Load DataFrames from various file formats with progress tracking.### convert_string_to_list(df, column)
Convert string representations of lists in a DataFrame column to actual lists.### create_profile(df, profile_name='report.html')
Generate a detailed profiling report for a DataFrame.## Error Handling
All functions include comprehensive error handling with descriptive messages:
```python
try:
df = load_any_df('data.csv')
except ValueError as e:
print(f"Error loading file: {e}")
```## Logging
Most functions accept an optional logger parameter for operation tracking:
```python
import logging
logger = logging.getLogger()
df = load_any_df('data.csv', logger=logger)
```## Performance Tips
1. Use `merge_chunk` for large DataFrame merges that fit in memory
2. Use `merge_dask` for very large datasets that benefit from distributed computing
3. Enable `show_progress=True` to monitor long-running operations
4. Use `minimal=True` in profiling for large datasets
5. Consider sampling large datasets before profiling