{"id":38565034,"url":"https://github.com/aborruso/csvnorm","last_synced_at":"2026-02-08T09:20:14.324Z","repository":{"id":332832700,"uuid":"957861461","full_name":"aborruso/csvnorm","owner":"aborruso","description":"A Python CLI tool for validating and normalizing CSV files","archived":false,"fork":false,"pushed_at":"2026-01-20T19:03:40.000Z","size":880,"stargazers_count":1,"open_issues_count":1,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-21T01:57:38.224Z","etag":null,"topics":["cli","csv","data-validation","duckdb","etl","normalization","python"],"latest_commit_sha":null,"homepage":"https://aborruso.github.io/csvnorm/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aborruso.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-03-31T09:00:07.000Z","updated_at":"2026-01-20T19:03:45.000Z","dependencies_parsed_at":null,"dependency_job_id":"5ed056e1-401b-43f1-8aaa-8836dce1e363","html_url":"https://github.com/aborruso/csvnorm","commit_stats":null,"previous_names":["aborruso/prepare_data","aborruso/csvnorm"],"tags_count":36,"template":false,"template_full_name":null,"purl":"pkg:github/aborruso/csvnorm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborruso%2Fcsvnorm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborruso%2Fcsvnorm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborruso%2Fcsvnorm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborruso%2Fcsvnorm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aborruso","download_url":"https://codeload.github.com/aborruso/csvnorm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborruso%2Fcsvnorm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28990697,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-01T20:57:35.821Z","status":"ssl_error","status_checked_at":"2026-02-01T20:57:29.580Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","csv","data-validation","duckdb","etl","normalization","python"],"created_at":"2026-01-17T07:52:58.399Z","updated_at":"2026-02-08T09:20:14.316Z","avatar_url":"https://github.com/aborruso.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![PyPI version](https://badge.fury.io/py/csvnorm.svg)](https://pypi.org/project/csvnorm/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/aborruso/csvnorm)\n\n# csvnorm\n\nA command-line utility to validate and normalize CSV files for initial exploration.\n\n## Version 1.0 Breaking Change\n\n**If upgrading from v0.x:** The default output has changed from file to stdout for better Unix composability.\n\n```bash\n# v0.x behavior\ncsvnorm data.csv              # Created data.csv in current directory\n\n# v1.0 behavior (NEW)\ncsvnorm data.csv              # Outputs to stdout\ncsvnorm data.csv -o data.csv  # Explicitly save to file\ncsvnorm data.csv \u003e data.csv   # Or use shell redirect\n```\n\nThis follows the Unix philosophy and matches tools like `jq`, `csvkit`, and `xsv`.\n\n## Installation\n\nRecommended (uv):\n\n```bash\nuv tool install csvnorm\n```\n\nOr with pip:\n\n```bash\npip install csvnorm\n```\n\n## Purpose\n\nThis tool prepares CSV files for **basic exploratory data analysis (EDA)**, not for complex transformations. It focuses on achieving a clean, standardized baseline format that allows you to quickly assess data quality and structure before designing more sophisticated ETL pipelines.\n\n**What it does:**\n- Validates CSV structure and reports errors\n- Normalizes encoding to UTF-8 when needed\n- Normalizes delimiters and field names\n- Creates a consistent starting point for data exploration\n\n**What it doesn't do:**\n- Complex data transformations or business logic\n- Type inference or data validation beyond structure\n- Heavy processing or aggregations\n\n## Features\n\n- **CSV Validation**: Checks for common CSV errors and inconsistencies using DuckDB\n- **Delimiter Normalization**: Converts all field separators to standard commas (`,`)\n- **Field Name Normalization**: Converts column headers to snake_case format\n- **Encoding Normalization**: Auto-detects encoding and converts to UTF-8 when needed (ASCII is already UTF-8 compatible)\n- **Processing Summary**: Displays comprehensive statistics (rows, columns, file sizes) and error details\n- **Error Reporting**: Exports detailed error file for invalid rows with summary panel\n- **Remote URL Support**: Process CSV files directly from HTTP/HTTPS URLs without downloading (unless `--fix-mojibake` is used)\n\n## Usage\n\n```bash\ncsvnorm input.csv [options]\ncsvnorm -                    # read from stdin\n```\n\n**By default, csvnorm writes to stdout** for easy piping and composability with other Unix tools. Use `-` as input to read from stdin.\n\n### Options\n\n| Option | Description |\n|--------|-------------|\n| `-o, --output-file PATH` | Write to file instead of stdout |\n| `-f, --force` | Force overwrite of existing output file (when `-o` is specified) |\n| `-k, --keep-names` | Keep original column names (disable snake_case) |\n| `-d, --delimiter CHAR` | Set custom output delimiter (default: `,`) |\n| `-s, --skip-rows N` | Skip first N rows of input file (useful for metadata/comments) |\n| `--fix-mojibake [N]` | Fix mojibake using ftfy (optional sample size `N`; use `0` to force repair) |\n| `--strict` | Exit with error code 1 if any validation errors occur (fail-fast mode) |\n| `--check` | Validate CSV without processing or normalizing (exit code 0=valid, 1=invalid) |\n| `--download-remote` | Download remote CSV locally before processing (needed for remote .zip/.gz) |\n| `-V, --verbose` | Enable verbose output for debugging |\n| `-v, --version` | Show version number |\n| `-h, --help` | Show help message |\n\n### Examples\n\n```bash\n# Default: output to stdout\ncsvnorm data.csv\n\n# Read from stdin\ncat data.csv | csvnorm -\ncurl -s https://example.com/data.csv | csvnorm - -o clean.csv\ncsvnorm - --check \u003c data.csv\n\n# Preview first rows\ncsvnorm data.csv | head -20\n\n# Pipe to other tools\ncsvnorm data.csv | csvcut -c name,age | csvstat\n\n# Save to file\ncsvnorm data.csv -o output.csv\n\n# Shell redirect\ncsvnorm data.csv \u003e output.csv\n\n# Process remote CSV from URL\ncsvnorm \"https://raw.githubusercontent.com/aborruso/csvnorm/refs/heads/main/test/Trasporto%20Pubblico%20Locale%20Settore%20Pubblico%20Allargato%20-%20Indicatore%202000-2020%20Trasferimenti%20Correnti%20su%20Entrate%20Correnti.csv\" -o output.csv\n\n# Process remote compressed CSV (download first, then handle gzip/zip locally)\ncsvnorm \"https://example.com/data.csv.gz\" --download-remote -o output.csv\n\n# Custom delimiter\ncsvnorm data.csv -d ';' -o output.csv\n\n# Keep original headers\ncsvnorm data.csv --keep-names -o output.csv\n\n# Skip first 2 rows (metadata or comments)\ncsvnorm data.csv --skip-rows 2 -o output.csv\n\n# Force overwrite with verbose output\ncsvnorm data.csv -f -V -o processed.csv\n\n# Fix mojibake using ftfy (default sample size)\ncsvnorm data.csv --fix-mojibake -o fixed.csv\n\n# Fix mojibake with custom sample size\ncsvnorm data.csv --fix-mojibake 4000 -o fixed.csv\n\n# Force mojibake repair even with low badness score\ncsvnorm data.csv --fix-mojibake 0 -o fixed.csv\n\n# Fail-fast mode: exit with error if validation errors occur\ncsvnorm data.csv --strict \u003e output.csv || echo \"Validation failed!\"\n\n# Use in pipelines where data quality is critical\ncsvnorm remote_data.csv --strict | other_tool || handle_error\n\n# Quick validation check (no processing or output)\ncsvnorm data.csv --check \u0026\u0026 echo \"Valid CSV\" || echo \"Invalid CSV\"\n\n# Check remote CSV for validity\ncsvnorm https://example.com/data.csv --check\n\n# Use in CI/CD pipelines for validation\ncsvnorm raw_data.csv --check || exit 1\n```\n\n### Output\n\n**Default behavior (stdout):**\n- Writes normalized CSV to stdout\n- Progress and errors go to stderr\n- Validation errors (if any) are shown to stderr **before** the output data\n- Reject file saved to `./reject_errors.csv` in current working directory\n- Perfect for piping to other tools or shell redirection\n\n**File output (with `-o`):**\n- Creates a normalized CSV file at the specified path with:\n  - UTF-8 encoding\n  - Consistent field delimiters\n  - Normalized column names (unless `--keep-names` is specified)\n- Error report if any invalid rows are found (saved as `{output_name}_reject_errors.csv` in the same directory)\n- Shows success table with statistics (rows, columns, file sizes)\n- Supports absolute and relative paths\n- Any file extension is allowed (not limited to `.csv`)\n\n**Input file protection:**\n- csvnorm will **never** overwrite the input file, even with `--force`\n- If you try to use the same path for input and output, you'll get an error\n- Use `-o` to specify a different output path\n\n**Remote URLs:**\n- Encoding is handled automatically by DuckDB\n- If `--fix-mojibake` is enabled, the URL is downloaded to a temp file first\n\n**Mojibake repair (`--fix-mojibake [N]`):**\n- Mojibake is garbled text produced by decoding bytes with the wrong character encoding (e.g., `CittÃ ` instead of `Città`).\n- Enables optional mojibake repair using ftfy (for already-misdecoded text).\n- `N` is the sample size (number of characters) used by the detector; default is 5000.\n- The repair runs only when ftfy's badness heuristic flags the sample as \"bad.\"\n- Use `N=0` to force repair without detection (useful for files with low badness scores but visible mojibake).\n- **Note**: ftfy cannot recover bytes that were irreversibly lost in the original encoding. Replacement characters (`�`) may remain where data was corrupted beyond repair.\n- HTTP timeout is set to 30 seconds\n- Only public URLs are supported (no authentication)\n\nThe tool provides modern terminal output (shown only when using `-o` to write to a file) with:\n- Progress indicators for multi-step processing\n- Color-coded error messages with panels\n- Success summary table with statistics (rows, columns, file sizes)\n- Encoding conversion status (converted/no conversion/remote; ASCII is already UTF-8 compatible)\n- Error summary panel with reject count and error types when validation fails\n- ASCII art banner with `--version` and `-V` verbose mode\n\n**Success Example:** (shown only when using `-o`)\n```\n ✓ Success\n Input:        test/utf8_basic.csv\n Output:       output/utf8_basic.csv\n Encoding:     ascii (ASCII is UTF-8 compatible; no conversion needed)\n Rows:         2\n Columns:      3\n Input size:   42 B\n Output size:  43 B\n Headers:      normalized to snake_case\n```\n\n**Error Example:** (shown only when using `-o`)\n```\n ✓ Success\n Input:        test/malformed_rows.csv\n Output:       output/malformed_rows.csv\n Encoding:     ascii (ASCII is UTF-8 compatible; no conversion needed)\n Rows:         1\n Columns:      4\n Input size:   24 B\n Output size:  40 B\n Headers:      normalized to snake_case\n\n╭──────────────────────────── ! Validation Failed ─────────────────────────────╮\n│ Validation Errors:                                                           │\n│                                                                              │\n│ Rejected rows: 2                                                             │\n│                                                                              │\n│ Error types:                                                                 │\n│   • Expected Number of Columns: 3 Found: 2                                   │\n│   • Expected Number of Columns: 3 Found: 4                                   │\n│                                                                              │\n│ Details: output/malformed_rows_reject_errors.csv                             │\n╰──────────────────────────────────────────────────────────────────────────────╯\n```\n\n### Exit Codes\n\n| Code | Meaning |\n|------|---------|\n| 0 | Success |\n| 1 | Error (validation failed, file not found, etc.) |\n\n## Requirements\n\n- Python 3.9+\n- Dependencies (automatically installed):\n  - `charset-normalizer\u003e=3.0.0` - Encoding detection\n  - `duckdb\u003e=0.9.0` - CSV validation and normalization\n  - `ftfy\u003e=6.3.1` - Mojibake repair\n  - `rich\u003e=13.0.0` - Modern terminal output formatting\n  - `rich-argparse\u003e=1.0.0` - Enhanced CLI help formatting\n\nOptional extras:\n- `[dev]` - Development dependencies (`pytest\u003e=7.0.0`, `pytest-cov\u003e=4.0.0`, `ruff\u003e=0.1.0`)\n\n## Development\n\n### Setup\n\n```bash\ngit clone https://github.com/aborruso/csvnorm\ncd csvnorm\n\n# Create and activate venv with uv (recommended)\nuv venv\nsource .venv/bin/activate\nuv pip install -e \".[dev]\"\n\n# Or with pip\npip install -e \".[dev]\"\n```\n\n### Testing\n\n```bash\npytest tests/ -v\n```\n\n### Project Structure\n\n```\ncsvnorm/\n├── src/csvnorm/\n│   ├── __init__.py      # Package version\n│   ├── __main__.py      # python -m support\n│   ├── cli.py           # CLI argument parsing\n│   ├── core.py          # Main processing pipeline\n│   ├── encoding.py      # Encoding detection/conversion\n│   ├── validation.py    # DuckDB validation\n│   └── utils.py         # Helper functions\n├── tests/               # Test suite\n├── test/                # CSV fixtures\n└── pyproject.toml       # Package configuration\n```\n\n## Stay Updated\n\n### Get notified of new releases\n**Watch → Custom → ✓ Releases** to receive notifications for all new versions.\n\n### Get notified of breaking changes only\n**[Subscribe to Announcements](https://github.com/aborruso/csvnorm/discussions/categories/announcements)** to be notified only about:\n- Breaking changes (major version bumps)\n- Security updates\n- Important deprecation notices\n\nWe follow [Semantic Versioning](https://semver.org/):\n- **MAJOR** (e.g., 1.0.0 → 2.0.0): Breaking changes\n- **MINOR** (e.g., 1.0.0 → 1.1.0): New features, backward compatible\n- **PATCH** (e.g., 1.0.0 → 1.0.1): Bug fixes only\n\nSee [docs/COMMUNICATION.md](docs/COMMUNICATION.md) for details.\n\n## License\n\nMIT License (c) 2026 aborruso@gmail.com - See LICENSE file for details\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faborruso%2Fcsvnorm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faborruso%2Fcsvnorm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faborruso%2Fcsvnorm/lists"}