https://github.com/johntocci/nullaxe
Nullaxe is a powerful and user-friendly Python library designed for cleaning and preprocessing data. It works seamlessly with both pandas and polars DataFrames, making it a versatile tool for data scientists and developers.
https://github.com/johntocci/nullaxe
data data-analysis data-science datacleaning pandas polars python
Last synced: 3 months ago
JSON representation
Nullaxe is a powerful and user-friendly Python library designed for cleaning and preprocessing data. It works seamlessly with both pandas and polars DataFrames, making it a versatile tool for data scientists and developers.
- Host: GitHub
- URL: https://github.com/johntocci/nullaxe
- Owner: JohnTocci
- License: mit
- Created: 2025-09-07T20:17:02.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-10-13T00:22:07.000Z (8 months ago)
- Last Synced: 2025-12-15T14:04:39.726Z (6 months ago)
- Topics: data, data-analysis, data-science, datacleaning, pandas, polars, python
- Language: Python
- Homepage:
- Size: 201 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Nullaxe
[](https://pypi.org/project/nullaxe/)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/psf/black)
**Nullaxe** is a comprehensive, high-performance data cleaning and preprocessing library for Python, designed to work seamlessly with both **pandas** and **polars** DataFrames. With its intuitive, chainable API, Nullaxe transforms the traditionally tedious process of data cleaning into an elegant, readable workflow.
---
## Key Features
- **Fluent, Chainable API**: Clean your data in a single, readable chain of commands
- **Dual Backend Support**: Works effortlessly with both pandas and polars DataFrames
- **Comprehensive Cleaning**: From basic cleaning to advanced data extraction and transformation
- **Display Formatting Pipeline**: Format columns for presentation (currency, percentages, thousands separators, date formatting, truncation, title-cased headers)
- **Intelligent Outlier Detection**: Multiple methods including IQR and Z-score analysis
- **Advanced Data Extraction**: Extract emails, phone numbers, and custom patterns with regex
- **Smart Type Handling**: Automatic type inference and standardization
- **Performance Optimized**: Designed for speed and memory efficiency
- **Extensible**: Easily add custom cleaning functions to your pipeline
---
## Installation
Install Nullaxe easily with pip:
```bash
pip install nullaxe
```
**Requirements:**
- Python 3.8+
- pandas >= 1.0
- polars >= 0.19
---
## Quick Start
Here's how to transform messy data into clean, analysis-ready datasets:
```python
import pandas as pd
import nullaxe as nlx
# Create a messy sample dataset
data = {
'First Name': [' John ', 'Jane', ' Peter', 'JOHN', None],
'Last Name': ['Smith', 'Doe', 'Jones', 'Smith', 'Brown'],
'Age': [28, 34, None, 28, 45],
'Email': ['john@email.com', 'invalid-email', 'peter@test.org', 'john@email.com', None],
'Phone': ['123-456-7890', '(555) 123-4567', 'not-a-phone', '123.456.7890', '+1-800-555-0199'],
'Salary': ['$70,000', '80000', '$65,000.50', '$70,000', '€75,000'],
'Active': ['True', 'False', 'yes', 'TRUE', 'N'],
'Notes': [' Important client ', '', ' Follow up ', None, 'VIP']
}
df = pd.DataFrame(data)
# Clean the entire dataset with a single chain
clean_df = (
nlx(df)
.clean_column_names() # Standardize column names
.fill_missing(value='Unknown') # Fill missing values
.remove_whitespace() # Clean whitespace
.remove_duplicates() # Remove duplicate rows
.standardize_booleans() # Convert boolean-like values
.extract_email() # Extract email addresses
.extract_phone_numbers() # Extract phone numbers
.extract_and_clean_numeric() # Extract numeric values from strings
.drop_single_value_columns() # Remove columns with only one value
.remove_outliers(method='iqr') # Handle outliers
.format_for_display( # NEW: Format for presentation
rules={
'salary': {'type': 'currency', 'symbol': '$', 'decimals': 2},
'age': {'type': 'thousands'},
},
column_case='title'
)
.to_df() # Return the cleaned, formatted DataFrame
)
print(clean_df.head())
```
---
## Complete API Reference
### Initialization
```python
import nullaxe as nlx
# Initialize with any DataFrame
cleaner = nlx(df) # Works with pandas or polars DataFrames
```
### Column Name Standardization
Transform column names to consistent formats:
```python
# General column cleaning with case conversion
.clean_column_names(case='snake') # Options: 'snake', 'camel', 'pascal', 'kebab', 'title', 'lower', 'screaming_snake'
# Specific case conversions
.snakecase() # column_name
.camelcase() # columnName
.pascalcase() # ColumnName
.kebabcase() # column-name
.titlecase() # Column Name
.lowercase() # column name
.screaming_snakecase() # COLUMN_NAME
```
### Data Deduplication
Remove duplicate data efficiently:
```python
.remove_duplicates() # Remove duplicate rows across all columns
```
### Missing Data Management
Handle missing values with precision:
```python
# Fill missing values
.fill_missing(value=0) # Fill all columns with 0
.fill_missing(value='Unknown', subset=['name']) # Fill specific columns
# Drop missing values
.drop_missing() # Drop rows with any missing values
.drop_missing(how='all') # Drop rows where all values are missing
.drop_missing(thresh=3) # Keep rows with at least 3 non-null values
.drop_missing(axis='columns') # Drop columns with missing values
.drop_missing(subset=['name', 'email']) # Consider only specific columns
```
### Text and Whitespace Cleaning
Clean and standardize text data:
```python
.remove_whitespace() # Remove leading/trailing whitespace
.replace_text('old', 'new') # Replace text in all columns
.replace_text('old', 'new', subset=['name']) # Replace in specific columns
.remove_punctuation() # Remove punctuation marks
.remove_punctuation(subset=['description']) # Remove from specific columns
```
### Column Management
Manage DataFrame structure:
```python
.drop_single_value_columns() # Remove columns with only one unique value
.remove_unwanted_rows_and_cols() # Remove rows/cols with unwanted values
.remove_unwanted_rows_and_cols( # Custom unwanted values
unwanted_values=['', 'N/A', 'NULL']
)
```
### Outlier Detection and Handling
Sophisticated outlier management:
```python
# General outlier handling
.handle_outliers() # Default: IQR method, factor=1.5
.handle_outliers(method='zscore', factor=2.0) # Z-score method
.handle_outliers(subset=['salary', 'age']) # Specific columns only
# Cap outliers (replace with threshold values)
.cap_outliers() # Cap using IQR method
.cap_outliers(method='zscore', factor=2.5) # Cap using Z-score
# Remove outlier rows entirely
.remove_outliers() # Remove rows with outliers
.remove_outliers(method='iqr', factor=1.5) # Custom parameters
```
**Outlier Detection Methods:**
- **IQR (Interquartile Range)**: `Q1 - factor*IQR` to `Q3 + factor*IQR`
- **Z-Score**: Values beyond `factor` standard deviations from the mean
### Data Type Standardization
Convert and standardize data types:
```python
# Boolean standardization
.standardize_booleans() # Convert 'yes/no', 'true/false', etc.
.standardize_booleans(
true_values=['yes', 'y', '1', 'true'], # Custom true values
false_values=['no', 'n', '0', 'false'], # Custom false values
columns=['active', 'verified'] # Specific columns
)
```
**Default Boolean Mappings:**
- **True**: 'true', '1', 't', 'yes', 'y', 'on'
- **False**: 'false', '0', 'f', 'no', 'n', 'off'
### Advanced Data Extraction
Extract structured data from unstructured text:
```python
# Email extraction
.extract_email() # Extract emails from all columns
.extract_email(subset=['contact_info']) # From specific columns
# Phone number extraction
.extract_phone_numbers() # Extract phone numbers
.extract_phone_numbers(subset=['contact']) # From specific columns
# Numeric data extraction and cleaning
.extract_and_clean_numeric() # Extract numbers from text
.extract_and_clean_numeric(subset=['prices']) # From specific columns
# Custom regex extraction (interactive)
.extract_with_regex() # Prompts for regex pattern
.extract_with_regex(subset=['text_column']) # From specific columns
# Combined numeric cleaning
.clean_numeric() # Extract + outlier handling
.clean_numeric(method='zscore', factor=2.0) # Custom outlier parameters
```
### Display / Presentation Formatting (NEW in 0.3.0)
Format cleaned data for reports, dashboards, exports:
```python
.format_for_display(
rules={
'price': {'type': 'currency', 'symbol': '$', 'decimals': 2},
'growth': {'type': 'percentage', 'decimals': 1},
'volume': {'type': 'thousands'},
'description': {'type': 'truncate', 'length': 30},
'event_date': {'type': 'datetime', 'format': '%B %d, %Y'}
},
column_case='title' # or None to preserve original column names
)
```
Supported rule types:
- `currency`: symbol + thousands + decimal precision
- `percentage`: multiplies by 100 + suffix `%`
- `thousands`: adds thousands separators, removes trailing `.0` for whole floats
- `truncate`: shortens long text and appends `...`
- `datetime`: parses and formats date/time strings
You can also call the function directly:
```python
from nullaxe.functions import format_for_display
formatted = format_for_display(df, rules=..., column_case='title')
```
### Output
```python
.to_df() # Return the cleaned DataFrame
```
---
## Advanced Usage Examples
### Real-World Data Cleaning Pipeline
```python
import pandas as pd
import nullaxe as nlx
# Load messy customer data
df = pd.read_csv('messy_customer_data.csv')
# Comprehensive cleaning + formatting pipeline
clean_customers = (
nlx(df)
.clean_column_names(case='snake')
.fill_missing(value='Not Provided')
.remove_whitespace()
.standardize_booleans(columns=['is_active', 'newsletter_opt_in'])
.extract_email(subset=['contact_info'])
.extract_phone_numbers(subset=['contact_info'])
.extract_and_clean_numeric(subset=['revenue', 'age'])
.remove_outliers(method='iqr', factor=2.0, subset=['revenue'])
.drop_single_value_columns()
.remove_duplicates()
.format_for_display(
rules={
'revenue': {'type': 'currency', 'symbol': '$', 'decimals': 2},
'age': {'type': 'thousands'},
'signup_date': {'type': 'datetime', 'format': '%Y-%m-%d'}
},
column_case='title'
)
.to_df()
)
```
### Financial Data Processing
```python
financial_clean = (
nlx(transactions_df)
.clean_column_names(case='snake')
.fill_missing(value=0, subset=['amount'])
.extract_and_clean_numeric(subset=['amount', 'fee'])
.standardize_booleans(subset=['is_recurring'])
.cap_outliers(method='zscore', factor=3.0, subset=['amount'])
.remove_whitespace()
.format_for_display(
rules={'amount': {'type': 'currency', 'symbol': '$', 'decimals': 2}},
column_case='title'
)
.to_df()
)
```
### Survey Data Standardization
```python
survey_clean = (
nlx(survey_df)
.clean_column_names(case='snake')
.standardize_booleans(
true_values=['Yes', 'Y', 'Agree', 'True', '1'],
false_values=['No', 'N', 'Disagree', 'False', '0']
)
.fill_missing(value='No Response')
.remove_whitespace()
.drop_single_value_columns()
.format_for_display(
rules={'age': {'type': 'thousands'}},
column_case='title'
)
.to_df()
)
```
---
## Method Chaining Benefits
Nullaxe's chainable API provides several advantages:
1. **Readability**: Each step is clear and self-documenting
2. **Maintainability**: Easy to add, remove, or reorder operations
3. **Performance**: Optimized internal operations reduce memory overhead
4. **Flexibility**: Mix and match operations based on your data's needs
```python
# Traditional approach (verbose and hard to follow)
df = remove_duplicates(df)
df = fill_missing(df, value='Unknown')
df = standardize_booleans(df)
df = remove_outliers(df, method='iqr')
# Nullaxe approach (clean and readable)
df = (nlx(df)
.remove_duplicates()
.fill_missing(value='Unknown')
.standardize_booleans()
.remove_outliers(method='iqr')
.format_for_display(rules={'value': {'type': 'currency'}}, column_case='title')
.to_df())
```
---
## Performance Tips
1. **Use polars for large datasets** - Nullaxe automatically optimizes for polars' performance
2. **Chain operations efficiently** - Nullaxe minimizes intermediate copies
3. **Specify subsets** - Process only the columns you need
4. **Choose appropriate outlier methods** - IQR is faster, Z-score is more sensitive
```python
# Performance-optimized pipeline
result = (
nlx(large_df)
.remove_duplicates()
.drop_single_value_columns()
.fill_missing(value=0, subset=['numeric_cols'])
.remove_outliers(method='iqr', subset=['revenue'])
.format_for_display(rules={'revenue': {'type': 'currency'}}, column_case=None)
.to_df()
)
```
---
## Testing and Quality Assurance
Nullaxe includes comprehensive test coverage with 118+ test cases covering:
- pandas and polars compatibility
- Edge cases and error handling
- Performance optimization
- Data integrity preservation
- Type safety and validation
- Presentation formatting (currency, percentage, thousands, truncation, datetime, column casing)
Run tests locally:
```bash
git clone https://github.com/johntocci/nullaxe
cd nullaxe
pip install -e .[dev]
pytest tests/
```
---
## Contributing
We welcome contributions! Nullaxe is designed to be extensible and community-driven.
### How to Contribute
1. **Fork the repository** on GitHub
2. **Create a feature branch**: `git checkout -b feature/amazing-feature`
3. **Add your changes** with comprehensive tests
4. **Follow the coding standards** (black formatting, type hints)
5. **Run the test suite**: `pytest tests/`
6. **Submit a pull request** with a clear description
### Development Setup
```bash
# Clone and setup development environment
git clone https://github.com/johntocci/nullaxe
cd nullaxe
pip install -e .[dev]
# Run tests
pytest tests/
# Format code
black src/ tests/
```
### Adding New Functions
Nullaxe's modular architecture makes it easy to add new cleaning functions:
1. Create your function in `src/nullaxe/functions/`
2. Add it to the imports in `src/nullaxe/functions/__init__.py`
3. Add a corresponding method to the `Nullaxe` class
4. Write comprehensive tests in `tests/`
---
## Changelog
- Migration: replace `import sanex as sx` with `import nullaxe as nlx` and `sx(` with `nlx(`
### Version 0.3.0
- Added `format_for_display` function + chain method for presentation formatting
- Added support for currency, percentage, thousands, truncate, datetime formatting
- Title-case header option integrated into formatting step
- Refactored internal formatting for pandas + polars parity
- Expanded test suite (now 118+ tests) including display formatting
- Improved thousands formatting (no trailing .0 on whole floats)
### Version 0.2.0
- Added comprehensive data extraction capabilities
- Enhanced outlier detection with multiple methods
- Improved text processing and punctuation removal
- Fixed boolean standardization edge cases
- Resolved missing data handling in complex workflows
- Performance optimizations for large datasets
- Comprehensive documentation updates
### Version 0.1.0
- Initial release with core cleaning functionality
- Chainable API implementation
- pandas and polars support
---
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
## Acknowledgments
- Built with love for the data science community
- Inspired by the need for simple, powerful data cleaning tools
- Thanks to all contributors and users who help improve Nullaxe
---
**Made with love by [John Tocci](https://github.com/johntocci)**
[Star us on GitHub](https://github.com/johntocci/nullaxe) | [Report Issues](https://github.com/johntocci/nullaxe/issues) | [Request Features](https://github.com/johntocci/nullaxe/issues)