{"id":33926325,"url":"https://github.com/softwaresalt/csv-managed","last_synced_at":"2025-12-12T10:10:28.876Z","repository":{"id":319431034,"uuid":"1078042702","full_name":"softwaresalt/csv-managed","owner":"softwaresalt","description":"csv-managed is a Rust command-line utility for high‑performance exploration and transformation of CSV data at scale, emphasizing streaming, typed operations, and reproducible workflows via schema and index files.","archived":false,"fork":false,"pushed_at":"2025-12-04T03:37:02.000Z","size":716,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-06T08:55:23.157Z","etag":null,"topics":["big-data","cli-app","data-cleansing","data-engineering","data-standardization","data-transformation","data-wrangling","high-performance","ml-engineering"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/softwaresalt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-17T05:58:16.000Z","updated_at":"2025-11-28T11:56:46.000Z","dependencies_parsed_at":null,"dependency_job_id":"2b0c9874-f0f9-4fe4-813a-f57453e56f3b","html_url":"https://github.com/softwaresalt/csv-managed","commit_stats":null,"previous_names":["softwaresalt/csv-managed"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/softwaresalt/csv-managed","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/softwaresalt%2Fcsv-managed","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/softwaresalt%2Fcsv-managed/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/softwaresalt%2Fcsv-managed/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/softwaresalt%2Fcsv-managed/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/softwaresalt","download_url":"https://codeload.github.com/softwaresalt/csv-managed/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/softwaresalt%2Fcsv-managed/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27680584,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-12T02:00:06.775Z","response_time":129,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","cli-app","data-cleansing","data-engineering","data-standardization","data-transformation","data-wrangling","high-performance","ml-engineering"],"created_at":"2025-12-12T10:10:28.200Z","updated_at":"2025-12-12T10:10:28.870Z","avatar_url":"https://github.com/softwaresalt.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# csv-managed\n\n`csv-managed` is a high‑performance Rust CLI for exploring, validating, transforming, and indexing very large CSV/TSV (and future delimited) datasets using streaming, typed schemas, and multi‑variant indexes.\n\n## Feature Matrix (Concise)\n\n| Area | Highlights |\n|------|-----------|\n| Delimiters \u0026 Encodings | Comma/tab/pipe/semicolon/custom; independent input/output encoding; stdin/stdout streaming |\n| Schema Discovery | Sample or full scan inference; diff, overrides, placeholder normalization, snapshots |\n| Header Detection | Automatic header/headerless with synthetic `field_#`; force via `--assume-header` |\n| Datatype Transformations | Ordered `datatype_mappings` chains (parse, round, trim, case) before final typing |\n| Decimal \u0026 Currency | Fixed `decimal(p,s)` (≤28 precision) and currency scale (2 or 4) enforcement |\n| Indexing \u0026 Sorting | Multi-variant B-Tree index; longest matching prefix acceleration; covering expansion |\n| Filtering \u0026 Derivation | Typed comparisons + Evalexpr expressions; temporal helpers; positional aliases |\n| Verification | Streaming per-cell type enforcement; tiered invalid reporting |\n| Statistics \u0026 Frequency | Numeric + temporal metrics; distinct counts with `--frequency` / `--top` |\n| Append \u0026 Pipelines | Multi-file union with schema consistency; efficient chained stdin workflows |\n| Boolean \u0026 Table Output | Configurable boolean formats; elastic preview/table rendering |\n| Snapshots | Layout/inference regression guard (`--snapshot`) |\n| Error \u0026 Logging | Contextual failures; debug logging for inference/index/mappings |\n\nExtended details moved to dedicated docs (see Documentation Map).\n\n---\n\n## Global Documentation Table of Contents\n\n### Core (This README)\n\n1. [Feature Matrix](#feature-matrix-concise)\n2. [Global Documentation TOC](#global-documentation-table-of-contents)\n3. [Quick Start](#quick-start)\n4. [Installation](#installation)\n5. [Core Concepts (Brief)](#core-concepts-brief)\n6. [Datatypes](#datatypes-supported)\n7. [Expressions Overview](#expressions--derived-logic-overview)\n8. [Indexes \u0026 Sorting Overview](#indexes--sorting-overview)\n9. [Streaming \u0026 Pipelines Overview](#streaming--pipelines-overview)\n10. [Command Guide](#command-guide-summary)\n11. [Advanced Topics](#advanced-topics)\n12. [Roadmap](#roadmap)\n13. [Contributing](#contributing)\n14. [License](#license)\n15. [Support](#support)\n\n### Deep Dives (Docs Directory)\n\n1. [Schema Inference Internals](docs/schema-inference.md)\n2. [Schema Command Examples](docs/schema-examples.md)\n3. [Datatype Mappings Deep Dive](docs/datatype-mappings.md)\n4. [Statistics \u0026 Frequency Deep Dive](docs/stats.md)\n5. [Header Detection \u0026 FAQ](docs/header-detection.md)\n6. [Naming Conventions](docs/naming-conventions.md)\n7. [Snapshots vs Verification](docs/snapshots-and-verification.md)\n8. [Expressions Reference \u0026 Extended Examples](docs/expressions.md)\n9. [Indexing \u0026 Sorting Guide](docs/indexing-and-sorting.md)\n10. [Pipelines \u0026 Multi-Stage Patterns](docs/pipelines.md)\n11. [Encoding Normalization](docs/encoding-normalization.md)\n12. [Boolean Formatting \u0026 Table Output](docs/boolean-formatting.md)\n13. [Operational Notes (Perf / Errors / Logging / Testing)](docs/operations.md)\n14. [CLI Help Reference](docs/cli-help.md)\n\n### Quick Cross-Reference\n\n| Capability | Primary Doc |\n|------------|-------------|\n| Inference algorithm details | [schema-inference](docs/schema-inference.md) |\n| Placeholder / NA handling | [schema-inference](docs/schema-inference.md), [schema-examples](docs/schema-examples.md) |\n| Decimal \u0026 Currency rules | [schema-inference](docs/schema-inference.md), [datatype-mappings](docs/datatype-mappings.md) |\n| Mapping strategies \u0026 strategies matrix | [datatype-mappings](docs/datatype-mappings.md) |\n| Overrides vs mappings vs replacements | [schema-examples](docs/schema-examples.md), [datatype-mappings](docs/datatype-mappings.md) |\n| Header detection heuristic | [header-detection](docs/header-detection.md) |\n| Naming / snake_case rationale | [naming-conventions](docs/naming-conventions.md) |\n| Snapshot vs verify comparison | [snapshots-and-verification](docs/snapshots-and-verification.md) |\n| Invalid reporting tiers | [snapshots-and-verification](docs/snapshots-and-verification.md) |\n| Index variant design \u0026 covering | [indexing-and-sorting](docs/indexing-and-sorting.md) |\n| Streaming pipeline safety (header shape) | [pipelines](docs/pipelines.md) |\n| Encoding normalization patterns | [encoding-normalization](docs/encoding-normalization.md) |\n| Boolean output modes | [boolean-formatting](docs/boolean-formatting.md) |\n| Statistics aggregation \u0026 frequency counting | [stats](docs/stats.md) |\n| Expressions functions, quoting, bucketing | [expressions](docs/expressions.md) |\n| Performance \u0026 logging guidance | [operations](docs/operations.md) |\n| CLI option reference | [cli-help](docs/cli-help.md) |\n\n\u003e Use this TOC as a hub: internal anchors for quick orientation; deep dives for authoritative detail.\n\n---\n\n## Documentation Map\n\n| Topic | Doc |\n|-------|-----|\n| Expressions (full reference \u0026 examples) | [expressions](docs/expressions.md) |\n| Indexing \u0026 Sorting internals | [indexing-and-sorting](docs/indexing-and-sorting.md) |\n| Multi-stage pipelines \u0026 header shape rules | [pipelines](docs/pipelines.md) |\n| Schema inference internals | [schema-inference](docs/schema-inference.md) |\n| Schema command usage examples | [schema-examples](docs/schema-examples.md) |\n| Header detection algorithm \u0026 FAQ | [header-detection](docs/header-detection.md) |\n| Naming conventions (snake_case rationale) | [naming-conventions](docs/naming-conventions.md) |\n| Snapshots vs verification + reporting tiers | [snapshots-and-verification](docs/snapshots-and-verification.md) |\n| Boolean formatting \u0026 table output | [boolean-formatting](docs/boolean-formatting.md) |\n| Encoding normalization pipelines | [encoding-normalization](docs/encoding-normalization.md) |\n| Datatype mappings \u0026 transformation strategies | [datatype-mappings](docs/datatype-mappings.md) |\n| Statistics \u0026 frequency metrics | [stats](docs/stats.md) |\n| Operational notes (performance, errors, logging, testing) | [operations](docs/operations.md) |\n| CLI flag reference (captured help output) | [cli-help](docs/cli-help.md) |\n\nRoadmap/backlog: see the [roadmap](.plan/backlog.md).\n\n---\n\n## Quick Start\n\n```powershell\n# 1. Infer schema\n./target/release/csv-managed.exe schema infer -i ./data/orders.csv -o ./data/orders-schema.yml --sample-rows 0\n# 2. Build indexes\n./target/release/csv-managed.exe index -i ./data/orders.csv -o ./data/orders.idx --spec default=order_date:asc,customer_id:asc --spec recent=order_date:desc -m ./data/orders-schema.yml\n# 3. Typed processing (filters / derives / sort)\n./target/release/csv-managed.exe process -i ./data/orders.csv -m ./data/orders-schema.yml -x ./data/orders.idx --index-variant default --sort order_date:asc,customer_id:asc --filter \"status = shipped\" --derive 'total_with_tax=amount*1.0825' --row-numbers -o ./data/orders_filtered.csv\n# 4. Stats (numeric \u0026 temporal)\n./target/release/csv-managed.exe stats -i ./data/orders.csv -m ./data/orders-schema.yml\n# 5. Frequency counts\n./target/release/csv-managed.exe stats -i ./data/orders.csv -m ./data/orders-schema.yml --frequency --top 10\n# 6. Preview (no file output allowed with --preview)\n./target/release/csv-managed.exe process -i ./data/orders.csv --preview --limit 15\n```\n\n\u003e See extended examples in collapsible sections throughout this README.\n\n---\n\n## Installation\n\n```bash\ncargo build --release\n```\n\nBinary (Windows): `target\\release\\csv-managed.exe`\n\nFrom crates.io:\n\n```bash\ncargo install csv-managed\n```\n\nLocal path dev install:\n\n```bash\ncargo install --path .\n```\n\nHelper command (wraps `cargo install`):\n\n```powershell\n./target/release/csv-managed.exe install --locked\n```\n\nEnvironment logging examples:\n\n```powershell\n$env:RUST_LOG='info'\n```\n\n```batch\nset RUST_LOG=info\n```\n\n---\n\n## Core Concepts (Brief)\n\nSchemas declare column order, types, optional renames, mapping chains, and replacements. Per-cell flow: raw → mappings → replacements → final parse. See `docs/schema-inference.md` and `docs/schema-examples.md`.\n\nHeader detection, naming guidance, and FAQ: `docs/header-detection.md`, `docs/naming-conventions.md`.\n\nOverrides vs mappings vs replacements decision table: `docs/schema-examples.md`.\n\nVerification tiers + snapshot comparison: `docs/snapshots-and-verification.md`.\n\n### Snapshot Internals (Deep Dive)\n\nSnapshot includes: header+type hash (SHA-256), textual inference table, observation summaries. Hash changes on any header reorder or type change. Regenerate intentionally after approved inference adjustments.\n\n---\n\n## Datatypes (Supported)\n\n| Type | Examples | Notes |\n|------|----------|-------|\n| String | any UTF‑8 | Post-mapping names usable in expressions |\n| Integer | `42`, `-7` | 64-bit signed |\n| Float | `3.14`, `2` | f64 (integers accepted) |\n| Boolean | `true/false`, `yes/no`, `1/0` | Input variants normalized; output format selectable |\n| Date | `2024-08-01`, `08/01/2024` | Canonical `YYYY-MM-DD` |\n| DateTime | `2024-08-01T13:45:00` | Naive (no TZ) |\n| Time | `06:00:00`, `14:30` | Canonical `HH:MM:SS` |\n| Currency | `$12.34`, `123.4567` | Enforce 2 or 4 scale; symbol stripped |\n| Decimal | `123.4567`, `(1,234.50)` | Fixed precision/scale ≤28 |\n| Guid | RFC 4122 hyphenated or 32hex | Case-insensitive |\n\n---\n\n## Expressions \u0026 Derived Logic (Overview)\n\nDerived columns: `--derive name=expr`  •  Filters: `--filter`, `--filter-expr`  •  Positional aliases: `c0, c1, ...`  •  `row_number` when `--row-numbers` enabled.\n\n### Quick Cheat Sheets\n\n| Pattern | Example | Description |\n|---------|---------|-------------|\n| Arithmetic | `total_with_tax=amount*1.0825` | Multiply numeric column |\n| Conditional flag | `high_value=if(amount\u003e1000,1,0)` | 1/0 indicator |\n| Date diff | `ship_lag=date_diff_days(shipped_at,ordered_at)` | Days between dates |\n| Time diff | `window=time_diff_seconds(end_time,start_time)` | Seconds difference |\n| Concat | `channel_tag=concat(channel,\"-\",region)` | Combine strings |\n| Guid passthrough | `id_copy=id` | Duplicate column |\n| Row number | `row_index=row_number` | Sequential index |\n\n| Aspect | `--filter` | `--filter-expr` |\n|--------|-----------|-----------------|\n| Operators | Basic typed comparisons | Full Evalexpr syntax |\n| Logic | AND via repetition | AND/OR, nested `if` |\n| Temporal helpers | Direct typed compare | `date_diff_days`, etc. |\n| Complexity | Concise | Arbitrary expression |\n\n### Full Expression Reference\n\n**Temporal Helpers**: `date_add`, `date_sub`, `date_diff_days`, `date_format`, `datetime_add_seconds`, `datetime_diff_seconds`, `datetime_format`, `datetime_to_date`, `datetime_to_time`, `time_add_seconds`, `time_diff_seconds`.\n\n**Pitfalls**:\n\n* PowerShell quoting: wrap whole expression in single quotes, internal literals in double quotes.\n* `c0` is first column (0-based). Verify mapping order.\n* `row_number` exists only if `--row-numbers` set.\n* Use helpers, not raw string comparisons, for temporal correctness.\n* Mapping chains precede replacements which precede final parse; expressions see normalized values.\n\n**Function Index (alphabetical)**: `concat`, `date_add`, `date_diff_days`, `date_format`, `date_sub`, `datetime_add_seconds`, `datetime_diff_seconds`, `datetime_format`, `datetime_to_date`, `datetime_to_time`, `if`, `time_add_seconds`, `time_diff_seconds`.\n\n**Debugging**: Increase logging with `RUST_LOG=csv_managed=debug`. Future deep expression tracing may emit `expr:` prefixed debug lines.\n\n---\n\n## Indexes \u0026 Sorting (Overview)\n\nIndexes store byte offsets keyed by concatenated column values. A single `.idx` contains multiple named variants (different column sequences and directions). `process` chooses the variant with the longest matching prefix for a requested `--sort` unless `--index-variant` pins a specific one.\n\n**Building**:\n\n```powershell\n./target/release/csv-managed.exe index -i ./data/orders.csv -o ./data/orders.idx \\\n  --spec default=order_date:asc,customer_id:asc \\\n  --spec recent=order_date:desc -m ./data/orders-schema.yml\n```\n\n**Covering** (`--covering`): Generate systematic direction/prefix permutations from a concise pattern (e.g. `geo=date:asc|desc,customer:asc`).\n\nFallback: When no index variant matches the entire sort signature, an in-memory stable multi-column sort executes (still streaming transforms earlier/later as possible).\n\n---\n\n## Streaming \u0026 Pipelines (Overview)\n\nUse `-i -` to read from stdin; schema strongly recommended for typed semantics. Each stage must explicitly declare stdin usage. Avoid header shape changes between typed stages unless you also provide a matching updated schema.\n\nCore guidelines:\n\n* Keep early projections narrow.\n* Apply filters before sorting or heavy derives.\n* Normalize encodings up front (`--input-encoding` / `--output-encoding`).\n* Use `--preview --limit` for fast inspection; remove before chaining downstream.\n\n### Extended Pipeline Examples \u0026 Troubleshooting\n\n**Filter then stats**:\n\n```powershell\nGet-Content .\\tests\\data\\big_5_players_stats_2023_2024.csv | \\\n  .\\target\\release\\csv-managed.exe process -i - --schema .\\tests\\data\\big_5_players_stats-schema.yml \\\n  --filter \"Performance_Gls \u003e= 10\" --limit 40 | \\\n  .\\target\\release\\csv-managed.exe stats -i - --schema .\\tests\\data\\big_5_players_stats-schema.yml -C Performance_Gls\n```\n\n**Append with one streamed input**:\n\n```powershell\nGet-Content .\\tests\\data\\big_5_players_stats_2023_2024.csv | \\\n  .\\target\\release\\csv-managed.exe append -i - -i .\\tmp\\big_5_preview.csv \\\n  --schema .\\tests\\data\\big_5_players_stats-schema.yml -o .\\tmp\\players_union.csv\n```\n\n**Encoding normalization**:\n\n```powershell\nGet-Content .\\tmp\\big_5_windows1252.csv | \\\n  .\\target\\release\\csv-managed.exe process -i - --input-encoding windows-1252 \\\n  --schema .\\tests\\data\\big_5_players_stats-schema.yml --columns Player --columns Squad --limit 5 --table\n```\n\n**Troubleshooting**:\n\n| Symptom | Cause | Fix |\n|---------|-------|-----|\n| Hang | Upstream not producing | Add `--preview --limit` to inspect |\n| Column not found | Rename/mapping changed | Re-check `schema columns` / header output |\n| Zero stats rows | Filters excluded all rows | Relax/remove filters |\n| Invalid datatype downstream | Schema mismatch | Supply correct schema per stage |\n\n---\n\n## Command Guide (Summary)\n\nConcise flag references; see concept sections for deep behavior.\n\n### schema\n\nProbe, infer, verify, list columns, diff and snapshot inference output.\n\n| Sub/Flag | Summary |\n|----------|---------|\n| `probe` | Inference preview table (no file) |\n| `infer` | Inference + optional write (`-o`) + diff/snapshot integration |\n| `verify` | Streaming type \u0026 replacement validation |\n| `columns` | Tabular listing of schema columns |\n| `--snapshot` | Layout regression guard |\n| `--diff \u003cschema\u003e` | Unified diff vs existing schema |\n| `--assume-header` | Override header detection |\n| `--mapping` | Emit mapping scaffold \u0026 snake_case suggestions |\n| `--replace-template` | Inject empty `replace` arrays |\n\n### process\n\nTransform \u0026 emit rows: filtering, derives, column selection, sorting (indexed or fallback), boolean formatting, row numbering, preview/table output.\n\n### stats\n\nNumeric \u0026 temporal summary metrics; `--frequency` for distinct counts; filter integration.\n\n### append\n\nConcatenate multiple CSV inputs enforcing header/schema consistency.\n\n### index\n\nBuild multi-variant B-tree index files (`--spec`, `--covering`) for accelerated sort alignment.\n\n### install\n\nWrapper around `cargo install csv-managed` (version / force / locked / root flags).\n\n### schema columns\n\nList schema-declared columns and datatypes (resolves renames).\n\n---\n\n## Advanced Topics\n\n### Performance Considerations\n\n* Indexed sort avoids retaining all rows in memory.\n* Early filtering diminishes downstream CPU \u0026 sort footprint.\n* Median requires buffering column values; limit wide median usage on huge datasets.\n* Decimal \u0026 currency parsing add overhead—declare only where needed.\n\n### Error Handling\n\n`anyhow` contexts annotate origin (I/O, parse, schema, expression). Fast failure on unknown columns, invalid expressions, header mismatches, precision overflow, unsupported mapping strategies.\n\n### Logging\n\nSet `RUST_LOG=csv_managed=debug` for phase insights (inference voting, index selection, mapping application). Higher verbosity may impact throughput—toggle only when diagnosing.\n\n### Testing\n\nRun `cargo test`. Integration tests cover inference, indexing, process flags, piping, stats. Use `assert_cmd` for pipeline locking. Add new tests for any behavior that changes output formatting (update snapshots intentionally).\n\n---\n\n## Roadmap\n\nSee consolidated backlog \u0026 release planning in `[.plan/backlog.md](.plan/backlog.md)` for upcoming features (join redesign, primary key indexes, batch definition ingestion, additional file formats).\n\n---\n\n## Contributing\n\n1. Fork \u0026 branch (`feat/\u003cname\u003e`).  \n2. Add unit + integration tests.  \n3. `cargo fmt \u0026\u0026 cargo clippy \u0026\u0026 cargo test` must pass.  \n4. Update README sections or move features from roadmap to implemented list.  \n\n## License\n\nSee [LICENSE](https://github.com/softwaresalt/csv-managed?tab=MIT-1-ov-file#MIT-1-ov-file).\n\n## Support\n\nOpen issues for bugs, enhancements, or documentation gaps. Pull requests welcome.\n\nJoin the community conversation, ask questions, or propose ideas in **[GitHub Discussions](https://github.com/softwaresalt/csv-managed/discussions)**.\n\n---\n\n## Documentation Notes\n\nDeep dive sections removed from README and relocated to `docs/`. Use the Documentation Map above for full references.\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoftwaresalt%2Fcsv-managed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoftwaresalt%2Fcsv-managed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoftwaresalt%2Fcsv-managed/lists"}