An open API service indexing awesome lists of open source software.

https://github.com/birchkwok/spinesutils

A library that provides template code for Python development to shorten the project development cycle.
https://github.com/birchkwok/spinesutils

data-science machine-learning machine-learning-algorithms preprocessing-data

Last synced: 3 months ago
JSON representation

A library that provides template code for Python development to shorten the project development cycle.

Awesome Lists containing this project

README

          


spinesUtils


Accelerate your Python development workflow


## Overview

**spinesUtils** is a powerful library that provides ready-to-use features and utilities for Python development to shorten the project development cycle. Our goal is to help developers focus on solving their core problems instead of reimplementing common functionality.

## Features

- [x] **Logging functionality** - High-performance logging tools with zero learning curve
- [x] **Type checking and parameter validation** - Robust validation decorators
- [x] **CSV file reading acceleration** - Performance-optimized data loading
- [x] **Imbalanced data classifiers** - Specialized ML tools for imbalanced datasets
- [x] **Pandas DataFrame data compression** - Memory optimization for large datasets
- [x] **DataFrame insight tools** - Quick data analysis and visualization
- [x] **Large data train-test splitting** - Efficient data partitioning for ML pipelines
- [x] **Intuitive timer** - Feature-rich yet easy-to-use precision timer

This library is currently undergoing rapid iteration. If you encounter any issues with its functionalities, feel free to [raise an issue](https://github.com/BirchKwok/spinesUtils/issues).

## Installation

You can install spinesUtils from PyPI:

```bash
pip install spinesUtils
```

## Usage Examples

### Logger

The Logger class provides convenient logging without worrying about handler conflicts with the native Python logging module.

```python
# load spinesUtils module
from spinesUtils.logging import Logger # The alias is FastLogger

# create a logger instance, with name "MyLogger", and no file handler, the default level is "INFO"
# You can specify a file path `fp` during instantiation. If not specified, logs will not be written to a file.
logger = Logger(name="MyLogger", fp=None, level="DEBUG")

logger.log("This is an info log emitted by the log function.", level='INFO')
logger.debug("This is an debug message")
logger.info("This is an info message.")
logger.warning("This is an warning message.")
logger.error("This is an error message.")
logger.critical("This is an critical message.")
```

#### Performance Comparison

FastLogger vs Python's standard logging library (1 million messages, 20 threads):

| Metric | Standard logging | FastLogger | Improvement |
|--------|-----------------|------------|-------------|
| Total time (seconds) | 17.73 | 0.82 | 21.58x faster |
| Messages per second | 56,389 | 1,216,862 | 21.58x higher |
| Write speed (MB/s) | 6.94 | 14.04 | 2.02x faster |
| Average message size (bytes) | 129.00 | 12.10 | 10.66x smaller |
| Total log file size (MB) | 123.02 | 11.54 | 10.66x smaller |

*Test environment: MacBook Pro (Apple Silicon M1 Pro, 32GB RAM)*

### Type Checking and Parameter Validation

Ensure your functions receive the correct input types and values:

```python
from spinesUtils.asserts import *

# Check parameter type
@ParameterTypeAssert({
'a': (int, float),
'b': (int, float)
})
def add(a, b):
return a + b

# Check parameter value
@ParameterValuesAssert({
'a': lambda x: x > 0,
'b': lambda x: x > 0
})
def divide(a, b):
return a / b

# Generate function kwargs
params = generate_function_kwargs(add, a=1, b=2)
```

### CSV Reading Acceleration

Read large CSV files efficiently:

```python
from spinesUtils import read_csv

df = read_csv(
fp='/path/to/your/file.csv',
sep=',', # equal to pandas read_csv.sep
turbo_method='polars', # use turbo_method to speed up load time
chunk_size=None, # it can be integer if you want to use pandas backend
transform2low_mem=True, # compresses file to save memory
verbose=False
)
```

### Classifiers for Imbalanced Data

Handle imbalanced datasets effectively:

```python
from spinesUtils.models import MultiClassBalanceClassifier
from sklearn.ensemble import RandomForestClassifier

classifier = MultiClassBalanceClassifier(
base_estimator=RandomForestClassifier(n_estimators=100),
n_classes=3,
random_state=0,
verbose=0
)

# Fit and predict as you would with any scikit-learn estimator
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
```

### DataFrame Data Compression

Optimize memory usage for large DataFrames:

```python
from spinesUtils import transform_dtypes_low_mem

# Compress a single DataFrame
transform_dtypes_low_mem(df, verbose=True, inplace=True)

# Batch compress multiple DataFrames
from spinesUtils import transform_batch_dtypes_low_mem
transform_batch_dtypes_low_mem([df1, df2, df3, df4], verbose=True, inplace=True)
```

### DataFrame Insight Tools

Quickly analyze your data:

```python
from spinesUtils import df_preview, classify_samples_dist

# Get comprehensive DataFrame insights
df_insight = df_preview(df)
```

### Data Splitting Utilities

Efficiently split large datasets:

```python
from spinesUtils import train_test_split_bigdata, train_test_split_bigdata_df
from spinesUtils.feature_tools import get_x_cols

# Return numpy arrays
X_train, X_valid, X_test, y_train, y_valid, y_test = train_test_split_bigdata(
df=df,
x_cols=get_x_cols(df, y_col='target_column'),
y_col='target_column',
shuffle=True,
return_valid=True,
train_size=0.8,
valid_size=0.5
)

# Return pandas DataFrames
train_df, valid_df, test_df = train_test_split_bigdata_df(
df=df,
x_cols=get_x_cols(df, y_col='target_column'),
y_col='target_column',
shuffle=True,
return_valid=True,
train_size=0.8,
valid_size=0.5
)
```

### Timer Utility

Time your code execution simply:

```python
from spinesUtils.timer import Timer

# As a context manager
with Timer().session() as t:
# Your code here
t.sleep(1)
print(f"Step 1 time: {t.last_timestamp_diff():.2f}s")

# Mark a middle point
t.middle_point()

# More code
t.sleep(2)
print(f"Step 2 time: {t.last_timestamp_diff():.2f}s")

print(f"Total time: {t.total_elapsed_time():.2f}s")

# Or use it manually
timer = Timer()
timer.start()
# Your code here
timer.end()
print(f"Elapsed: {timer.total_elapsed_time():.2f}s")
```