{"id":47821333,"url":"https://github.com/friendlymatthew/arrow-csv2","last_synced_at":"2026-04-03T19:10:13.897Z","repository":{"id":295701682,"uuid":"990333921","full_name":"friendlymatthew/arrow-csv2","owner":"friendlymatthew","description":"Vectorized CSV parsing for Apache Arrow","archived":false,"fork":false,"pushed_at":"2026-03-26T03:29:24.000Z","size":56,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-26T18:23:36.994Z","etag":null,"topics":["arrow","csv","datafusion","object-storage","rust","simd"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/friendlymatthew.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-05-26T00:46:40.000Z","updated_at":"2026-03-26T03:29:28.000Z","dependencies_parsed_at":"2026-03-19T04:03:47.224Z","dependency_job_id":null,"html_url":"https://github.com/friendlymatthew/arrow-csv2","commit_stats":null,"previous_names":["friendlymatthew/csv","friendlymatthew/simdcsv"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/friendlymatthew/arrow-csv2","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/friendlymatthew%2Farrow-csv2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/friendlymatthew%2Farrow-csv2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/friendlymatthew%2Farrow-csv2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/friendlymatthew%2Farrow-csv2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/friendlymatthew","download_url":"https://codeload.github.com/friendlymatthew/arrow-csv2/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/friendlymatthew%2Farrow-csv2/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31372198,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-03T17:53:18.093Z","status":"ssl_error","status_checked_at":"2026-04-03T17:53:17.617Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arrow","csv","datafusion","object-storage","rust","simd"],"created_at":"2026-04-03T19:10:12.723Z","updated_at":"2026-04-03T19:10:13.891Z","avatar_url":"https://github.com/friendlymatthew.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# arrow-csv2\n\nThis project researches how far we can push CSV-to-Arrow performance, from single threaded decoding to parallel ingestion from object store, and offers a performant parallel file opener for Datafusion and vectorized CSV decoder for Arrow.\n\n# Background\n\nTo read a CSV file in parallel, you split it into byte ranges and assign each range to a thread. Since a split point almost certainly lands mid-row, each thread seeks forward to the next newline to find a record boundary. This is the approach used by state of the art databases like DuckDB's [parallel CSV file reader](https://github.com/duckdb/duckdb/pull/6977).\n\nHowever, this assumes every newline is a record boundary. But [RFC-4180](https://www.rfc-editor.org/rfc/rfc4180.html) allows newlines inside quoted fields (e.g. `a,\"b\\nc\",d`). A split point inside a quoted field causes threads to treat a quoted newline as a record boundary, producing silently wrong results like dropped rows ([#13787](https://github.com/duckdb/duckdb/issues/13787), [#7578](https://github.com/duckdb/duckdb/issues/7578), [#13047](https://github.com/duckdb/duckdb/issues/13047)) or incorrect field values ([#9036](https://github.com/duckdb/duckdb/pull/9036)).\n\nThis project solves it by framing CSV quote tracking as a monoid, which makes it safe to parallelize. The only state needed to determine whether a newline is a record boundary is whether the parser is currently inside a quoted field or not, which is just the parity of the quote count before that position. _Quote parity forms a monoid under XOR with identity false, which means partitions can be classified independently and combined in any order_. Though, the combination step is sequential since each partition's true starting state depends on the accumulated parity of all preceding partitions. In practice, this is trivially cheap and could be restructured as a parallel prefix scan if needed, since the op is associative.\n\nOnce the correct newline bitsets are selected, the resolver finds the first record boundary newline in each partition to determine record aligned byte ranges. Each partition then parses its range independently, producing Arrow record batches in parallel.\n\n# Features\n\nThis project hooks into the Datafusion pipeline at the `FileSource` level. It offers a `ParallelCsvSource` and `ParallelCsvOpener` which implements Datafusion's `FileSource` and `FileOpener` traits.\n\nThe file opener uses a custom CSV decoder that follows the existing `arrow-csv`'s Decoder API.\n\n# Status\n\nOn a M4 Macbook, `arrow-csv2` reads the full Clickbench `hits.csv` dataset (82gb, uncompressed) in **21.861s (3.46gb/s)** with the default settings (64MB partitions, concurrency 16).\n\nBenchmarks on a 100MB ClickBench slice (M4 MacBook):\n\n**DISCLAIMERS**\n\n- For correct RFC 4180 parsing, DataFusion falls back to single-threaded (204ms). arrow-csv2 achieves 36ms at 16 partitions while maintaining correctness, a **5.7x speedup**.\n- DuckDB numbers include Arrow IPC conversion overhead. (I'm also generally dubious about these numbers)\n- arrow-csv2 is RFC 4180 compliant at all partition counts.\n\n| Partitions/Threads | arrow-csv2 | DataFusion    | DuckDB        |\n| ------------------ | ---------- | ------------- | ------------- |\n| 1                  | 199ms      | 205ms (1.03x) | 600ms (3.02x) |\n| 2                  | 107ms      | 122ms (1.14x) | 458ms (4.28x) |\n| 4                  | 61ms       | 74ms (1.21x)  | 387ms (6.34x) |\n| 8                  | 38ms       | 49ms (1.29x)  | 338ms (8.89x) |\n| 12                 | 37ms       | 50ms (1.35x)  | 330ms (8.92x) |\n| 16                 | 36ms       | 54ms (1.50x)  | 324ms (9.00x) |\n\n# Usage\n\nAt the moment, `arrow-csv2` makes use of NEON intrinsics (sorry).\n\n```sh\n# run the full 82gb uncompressed Clickbench dataset\ncargo r --bin parse_clickbench --release\n\n# run benchmarks (uses a 100MB slice of the Clickbench dataset)\n./download_clickbench.sh\ncargo r --bin slice_clickbench\ncargo bench\n\n# tests (includes roundtripping with arrow-csv!)\ncargo t\n```\n\n# Reading\n\nhttps://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/\u003cbr\u003e\nhttps://www.rfc-editor.org/rfc/rfc4180.html\u003cbr\u003e\nhttps://arxiv.org/pdf/1902.08318\u003cbr\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffriendlymatthew%2Farrow-csv2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffriendlymatthew%2Farrow-csv2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffriendlymatthew%2Farrow-csv2/lists"}