{"id":34039478,"url":"https://github.com/johntocci/nullaxe","last_synced_at":"2026-04-06T01:31:12.500Z","repository":{"id":314020359,"uuid":"1052303886","full_name":"JohnTocci/Nullaxe","owner":"JohnTocci","description":"Nullaxe is a powerful and user-friendly Python library designed for cleaning and preprocessing data. It works seamlessly with both pandas and polars DataFrames, making it a versatile tool for data scientists and developers. ","archived":false,"fork":false,"pushed_at":"2025-10-13T00:22:07.000Z","size":206,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-15T14:04:39.726Z","etag":null,"topics":["data","data-analysis","data-science","datacleaning","pandas","polars","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JohnTocci.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-07T20:17:02.000Z","updated_at":"2025-10-12T03:03:52.000Z","dependencies_parsed_at":"2025-09-10T05:11:04.430Z","dependency_job_id":"1e47cc45-9f74-4aa3-888c-988b55e0a7b6","html_url":"https://github.com/JohnTocci/Nullaxe","commit_stats":null,"previous_names":["johntocci/nullaxe","johntocci/sanex"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/JohnTocci/Nullaxe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JohnTocci%2FNullaxe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JohnTocci%2FNullaxe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JohnTocci%2FNullaxe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JohnTocci%2FNullaxe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JohnTocci","download_url":"https://codeload.github.com/JohnTocci/Nullaxe/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JohnTocci%2FNullaxe/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31456591,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T21:22:52.476Z","status":"ssl_error","status_checked_at":"2026-04-05T21:22:51.943Z","response_time":75,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-analysis","data-science","datacleaning","pandas","polars","python"],"created_at":"2025-12-13T21:40:12.698Z","updated_at":"2026-04-06T01:31:12.492Z","avatar_url":"https://github.com/JohnTocci.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eNullaxe\u003c/h1\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n[![PyPI version](https://img.shields.io/pypi/v/nullaxe.svg)](https://pypi.org/project/nullaxe/)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n\u003c/div\u003e\n\n**Nullaxe** is a comprehensive, high-performance data cleaning and preprocessing library for Python, designed to work seamlessly with both **pandas** and **polars** DataFrames. With its intuitive, chainable API, Nullaxe transforms the traditionally tedious process of data cleaning into an elegant, readable workflow.\n\n---\n\n## Key Features\n\n- **Fluent, Chainable API**: Clean your data in a single, readable chain of commands\n- **Dual Backend Support**: Works effortlessly with both pandas and polars DataFrames\n- **Comprehensive Cleaning**: From basic cleaning to advanced data extraction and transformation\n- **Display Formatting Pipeline**: Format columns for presentation (currency, percentages, thousands separators, date formatting, truncation, title-cased headers)\n- **Intelligent Outlier Detection**: Multiple methods including IQR and Z-score analysis\n- **Advanced Data Extraction**: Extract emails, phone numbers, and custom patterns with regex\n- **Smart Type Handling**: Automatic type inference and standardization\n- **Performance Optimized**: Designed for speed and memory efficiency\n- **Extensible**: Easily add custom cleaning functions to your pipeline\n\n---\n\n## Installation\n\nInstall Nullaxe easily with pip:\n\n```bash\npip install nullaxe\n```\n\n**Requirements:**\n- Python 3.8+\n- pandas \u003e= 1.0\n- polars \u003e= 0.19\n\n---\n\n## Quick Start\n\nHere's how to transform messy data into clean, analysis-ready datasets:\n\n```python\nimport pandas as pd\nimport nullaxe as nlx\n\n# Create a messy sample dataset\ndata = {\n    'First Name': ['  John  ', 'Jane', '  Peter', 'JOHN', None],\n    'Last Name': ['Smith', 'Doe', 'Jones', 'Smith', 'Brown'],\n    'Age': [28, 34, None, 28, 45],\n    'Email': ['john@email.com', 'invalid-email', 'peter@test.org', 'john@email.com', None],\n    'Phone': ['123-456-7890', '(555) 123-4567', 'not-a-phone', '123.456.7890', '+1-800-555-0199'],\n    'Salary': ['$70,000', '80000', '$65,000.50', '$70,000', '€75,000'],\n    'Active': ['True', 'False', 'yes', 'TRUE', 'N'],\n    'Notes': ['  Important client  ', '', '   Follow up   ', None, 'VIP']\n}\ndf = pd.DataFrame(data)\n\n# Clean the entire dataset with a single chain\nclean_df = (\n    nlx(df)\n    .clean_column_names()                    # Standardize column names\n    .fill_missing(value='Unknown')           # Fill missing values\n    .remove_whitespace()                     # Clean whitespace\n    .remove_duplicates()                     # Remove duplicate rows\n    .standardize_booleans()                  # Convert boolean-like values\n    .extract_email()                         # Extract email addresses\n    .extract_phone_numbers()                 # Extract phone numbers\n    .extract_and_clean_numeric()             # Extract numeric values from strings\n    .drop_single_value_columns()             # Remove columns with only one value\n    .remove_outliers(method='iqr')           # Handle outliers\n    .format_for_display(                     # NEW: Format for presentation\n        rules={\n            'salary': {'type': 'currency', 'symbol': '$', 'decimals': 2},\n            'age': {'type': 'thousands'},\n        },\n        column_case='title'\n    )\n    .to_df()                                 # Return the cleaned, formatted DataFrame\n)\n\nprint(clean_df.head())\n```\n\n---\n\n## Complete API Reference\n\n### Initialization\n\n```python\nimport nullaxe as nlx\n\n# Initialize with any DataFrame\ncleaner = nlx(df)  # Works with pandas or polars DataFrames\n```\n\n### Column Name Standardization\n\nTransform column names to consistent formats:\n\n```python\n# General column cleaning with case conversion\n.clean_column_names(case='snake')  # Options: 'snake', 'camel', 'pascal', 'kebab', 'title', 'lower', 'screaming_snake'\n\n# Specific case conversions\n.snakecase()                       # column_name\n.camelcase()                       # columnName\n.pascalcase()                      # ColumnName\n.kebabcase()                       # column-name\n.titlecase()                       # Column Name\n.lowercase()                       # column name\n.screaming_snakecase()             # COLUMN_NAME\n```\n\n### Data Deduplication\n\nRemove duplicate data efficiently:\n\n```python\n.remove_duplicates()               # Remove duplicate rows across all columns\n```\n\n### Missing Data Management\n\nHandle missing values with precision:\n\n```python\n# Fill missing values\n.fill_missing(value=0)                           # Fill all columns with 0\n.fill_missing(value='Unknown', subset=['name'])  # Fill specific columns\n\n# Drop missing values\n.drop_missing()                                  # Drop rows with any missing values\n.drop_missing(how='all')                         # Drop rows where all values are missing\n.drop_missing(thresh=3)                          # Keep rows with at least 3 non-null values\n.drop_missing(axis='columns')                    # Drop columns with missing values\n.drop_missing(subset=['name', 'email'])          # Consider only specific columns\n```\n\n### Text and Whitespace Cleaning\n\nClean and standardize text data:\n\n```python\n.remove_whitespace()                             # Remove leading/trailing whitespace\n.replace_text('old', 'new')                      # Replace text in all columns\n.replace_text('old', 'new', subset=['name'])     # Replace in specific columns\n.remove_punctuation()                            # Remove punctuation marks\n.remove_punctuation(subset=['description'])      # Remove from specific columns\n```\n\n### Column Management\n\nManage DataFrame structure:\n\n```python\n.drop_single_value_columns()                     # Remove columns with only one unique value\n.remove_unwanted_rows_and_cols()                 # Remove rows/cols with unwanted values\n.remove_unwanted_rows_and_cols(                  # Custom unwanted values\n    unwanted_values=['', 'N/A', 'NULL']\n)\n```\n\n### Outlier Detection and Handling\n\nSophisticated outlier management:\n\n```python\n# General outlier handling\n.handle_outliers()                               # Default: IQR method, factor=1.5\n.handle_outliers(method='zscore', factor=2.0)    # Z-score method\n.handle_outliers(subset=['salary', 'age'])       # Specific columns only\n\n# Cap outliers (replace with threshold values)\n.cap_outliers()                                  # Cap using IQR method\n.cap_outliers(method='zscore', factor=2.5)       # Cap using Z-score\n\n# Remove outlier rows entirely\n.remove_outliers()                               # Remove rows with outliers\n.remove_outliers(method='iqr', factor=1.5)       # Custom parameters\n```\n\n**Outlier Detection Methods:**\n- **IQR (Interquartile Range)**: `Q1 - factor*IQR` to `Q3 + factor*IQR`\n- **Z-Score**: Values beyond `factor` standard deviations from the mean\n\n### Data Type Standardization\n\nConvert and standardize data types:\n\n```python\n# Boolean standardization\n.standardize_booleans()                          # Convert 'yes/no', 'true/false', etc.\n.standardize_booleans(\n    true_values=['yes', 'y', '1', 'true'],       # Custom true values\n    false_values=['no', 'n', '0', 'false'],     # Custom false values\n    columns=['active', 'verified']              # Specific columns\n)\n```\n\n**Default Boolean Mappings:**\n- **True**: 'true', '1', 't', 'yes', 'y', 'on'\n- **False**: 'false', '0', 'f', 'no', 'n', 'off'\n\n### Advanced Data Extraction\n\nExtract structured data from unstructured text:\n\n```python\n# Email extraction\n.extract_email()                                 # Extract emails from all columns\n.extract_email(subset=['contact_info'])          # From specific columns\n\n# Phone number extraction\n.extract_phone_numbers()                         # Extract phone numbers\n.extract_phone_numbers(subset=['contact'])       # From specific columns\n\n# Numeric data extraction and cleaning\n.extract_and_clean_numeric()                     # Extract numbers from text\n.extract_and_clean_numeric(subset=['prices'])    # From specific columns\n\n# Custom regex extraction (interactive)\n.extract_with_regex()                            # Prompts for regex pattern\n.extract_with_regex(subset=['text_column'])      # From specific columns\n\n# Combined numeric cleaning\n.clean_numeric()                                 # Extract + outlier handling\n.clean_numeric(method='zscore', factor=2.0)      # Custom outlier parameters\n```\n\n### Display / Presentation Formatting (NEW in 0.3.0)\n\nFormat cleaned data for reports, dashboards, exports:\n\n```python\n.format_for_display(\n    rules={\n        'price': {'type': 'currency', 'symbol': '$', 'decimals': 2},\n        'growth': {'type': 'percentage', 'decimals': 1},\n        'volume': {'type': 'thousands'},\n        'description': {'type': 'truncate', 'length': 30},\n        'event_date': {'type': 'datetime', 'format': '%B %d, %Y'}\n    },\n    column_case='title'  # or None to preserve original column names\n)\n```\n\nSupported rule types:\n- `currency`: symbol + thousands + decimal precision\n- `percentage`: multiplies by 100 + suffix `%`\n- `thousands`: adds thousands separators, removes trailing `.0` for whole floats\n- `truncate`: shortens long text and appends `...`\n- `datetime`: parses and formats date/time strings\n\nYou can also call the function directly:\n```python\nfrom nullaxe.functions import format_for_display\nformatted = format_for_display(df, rules=..., column_case='title')\n```\n\n### Output\n\n```python\n.to_df()                                         # Return the cleaned DataFrame\n```\n\n---\n\n## Advanced Usage Examples\n\n### Real-World Data Cleaning Pipeline\n\n```python\nimport pandas as pd\nimport nullaxe as nlx\n\n# Load messy customer data\ndf = pd.read_csv('messy_customer_data.csv')\n\n# Comprehensive cleaning + formatting pipeline\nclean_customers = (\n    nlx(df)\n    .clean_column_names(case='snake')\n    .fill_missing(value='Not Provided')\n    .remove_whitespace()\n    .standardize_booleans(columns=['is_active', 'newsletter_opt_in'])\n    .extract_email(subset=['contact_info'])\n    .extract_phone_numbers(subset=['contact_info'])\n    .extract_and_clean_numeric(subset=['revenue', 'age'])\n    .remove_outliers(method='iqr', factor=2.0, subset=['revenue'])\n    .drop_single_value_columns()\n    .remove_duplicates()\n    .format_for_display(\n        rules={\n            'revenue': {'type': 'currency', 'symbol': '$', 'decimals': 2},\n            'age': {'type': 'thousands'},\n            'signup_date': {'type': 'datetime', 'format': '%Y-%m-%d'}\n        },\n        column_case='title'\n    )\n    .to_df()\n)\n```\n\n### Financial Data Processing\n\n```python\nfinancial_clean = (\n    nlx(transactions_df)\n    .clean_column_names(case='snake')\n    .fill_missing(value=0, subset=['amount'])\n    .extract_and_clean_numeric(subset=['amount', 'fee'])\n    .standardize_booleans(subset=['is_recurring'])\n    .cap_outliers(method='zscore', factor=3.0, subset=['amount'])\n    .remove_whitespace()\n    .format_for_display(\n        rules={'amount': {'type': 'currency', 'symbol': '$', 'decimals': 2}},\n        column_case='title'\n    )\n    .to_df()\n)\n```\n\n### Survey Data Standardization\n\n```python\nsurvey_clean = (\n    nlx(survey_df)\n    .clean_column_names(case='snake')\n    .standardize_booleans(\n        true_values=['Yes', 'Y', 'Agree', 'True', '1'],\n        false_values=['No', 'N', 'Disagree', 'False', '0']\n    )\n    .fill_missing(value='No Response')\n    .remove_whitespace()\n    .drop_single_value_columns()\n    .format_for_display(\n        rules={'age': {'type': 'thousands'}},\n        column_case='title'\n    )\n    .to_df()\n)\n```\n\n---\n\n## Method Chaining Benefits\n\nNullaxe's chainable API provides several advantages:\n\n1. **Readability**: Each step is clear and self-documenting\n2. **Maintainability**: Easy to add, remove, or reorder operations\n3. **Performance**: Optimized internal operations reduce memory overhead\n4. **Flexibility**: Mix and match operations based on your data's needs\n\n```python\n# Traditional approach (verbose and hard to follow)\ndf = remove_duplicates(df)\ndf = fill_missing(df, value='Unknown')\ndf = standardize_booleans(df)\ndf = remove_outliers(df, method='iqr')\n\n# Nullaxe approach (clean and readable)\ndf = (nlx(df)\n      .remove_duplicates()\n      .fill_missing(value='Unknown')\n      .standardize_booleans()\n      .remove_outliers(method='iqr')\n      .format_for_display(rules={'value': {'type': 'currency'}}, column_case='title')\n      .to_df())\n```\n\n---\n\n## Performance Tips\n\n1. **Use polars for large datasets** - Nullaxe automatically optimizes for polars' performance\n2. **Chain operations efficiently** - Nullaxe minimizes intermediate copies\n3. **Specify subsets** - Process only the columns you need\n4. **Choose appropriate outlier methods** - IQR is faster, Z-score is more sensitive\n\n```python\n# Performance-optimized pipeline\nresult = (\n    nlx(large_df)\n    .remove_duplicates()\n    .drop_single_value_columns()\n    .fill_missing(value=0, subset=['numeric_cols'])\n    .remove_outliers(method='iqr', subset=['revenue'])\n    .format_for_display(rules={'revenue': {'type': 'currency'}}, column_case=None)\n    .to_df()\n)\n```\n\n---\n\n## Testing and Quality Assurance\n\nNullaxe includes comprehensive test coverage with 118+ test cases covering:\n\n- pandas and polars compatibility\n- Edge cases and error handling\n- Performance optimization\n- Data integrity preservation\n- Type safety and validation\n- Presentation formatting (currency, percentage, thousands, truncation, datetime, column casing)\n\nRun tests locally:\n```bash\ngit clone https://github.com/johntocci/nullaxe\ncd nullaxe\npip install -e .[dev]\npytest tests/\n```\n\n---\n\n## Contributing\n\nWe welcome contributions! Nullaxe is designed to be extensible and community-driven.\n\n### How to Contribute\n\n1. **Fork the repository** on GitHub\n2. **Create a feature branch**: `git checkout -b feature/amazing-feature`\n3. **Add your changes** with comprehensive tests\n4. **Follow the coding standards** (black formatting, type hints)\n5. **Run the test suite**: `pytest tests/`\n6. **Submit a pull request** with a clear description\n\n### Development Setup\n\n```bash\n# Clone and setup development environment\ngit clone https://github.com/johntocci/nullaxe\ncd nullaxe\npip install -e .[dev]\n\n# Run tests\npytest tests/\n\n# Format code\nblack src/ tests/\n```\n\n### Adding New Functions\n\nNullaxe's modular architecture makes it easy to add new cleaning functions:\n\n1. Create your function in `src/nullaxe/functions/`\n2. Add it to the imports in `src/nullaxe/functions/__init__.py`\n3. Add a corresponding method to the `Nullaxe` class\n4. Write comprehensive tests in `tests/`\n\n---\n\n## Changelog\n\n- Migration: replace `import sanex as sx` with `import nullaxe as nlx` and `sx(` with `nlx(`\n### Version 0.3.0\n- Added `format_for_display` function + chain method for presentation formatting\n- Added support for currency, percentage, thousands, truncate, datetime formatting\n- Title-case header option integrated into formatting step\n- Refactored internal formatting for pandas + polars parity\n- Expanded test suite (now 118+ tests) including display formatting\n- Improved thousands formatting (no trailing .0 on whole floats)\n\n### Version 0.2.0\n- Added comprehensive data extraction capabilities\n- Enhanced outlier detection with multiple methods\n- Improved text processing and punctuation removal\n- Fixed boolean standardization edge cases\n- Resolved missing data handling in complex workflows\n- Performance optimizations for large datasets\n- Comprehensive documentation updates\n\n### Version 0.1.0\n- Initial release with core cleaning functionality\n- Chainable API implementation\n- pandas and polars support\n\n---\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n## Acknowledgments\n\n- Built with love for the data science community\n- Inspired by the need for simple, powerful data cleaning tools\n- Thanks to all contributors and users who help improve Nullaxe\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\n**Made with love by [John Tocci](https://github.com/johntocci)**\n\n[Star us on GitHub](https://github.com/johntocci/nullaxe) | [Report Issues](https://github.com/johntocci/nullaxe/issues) | [Request Features](https://github.com/johntocci/nullaxe/issues)\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjohntocci%2Fnullaxe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjohntocci%2Fnullaxe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjohntocci%2Fnullaxe/lists"}