{"id":18812160,"url":"https://github.com/drorspei/arrow-csv-benchmark","last_synced_at":"2026-01-11T18:30:15.141Z","repository":{"id":132911781,"uuid":"304091761","full_name":"drorspei/arrow-csv-benchmark","owner":"drorspei","description":"Short benchmark for arrow's read_csv","archived":false,"fork":false,"pushed_at":"2021-02-27T19:26:03.000Z","size":146,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-12-30T00:14:30.926Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/drorspei.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-14T17:44:42.000Z","updated_at":"2020-10-14T18:10:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"d75dc863-1c7c-4883-92e8-18198dfccb2f","html_url":"https://github.com/drorspei/arrow-csv-benchmark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorspei%2Farrow-csv-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorspei%2Farrow-csv-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorspei%2Farrow-csv-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorspei%2Farrow-csv-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/drorspei","download_url":"https://codeload.github.com/drorspei/arrow-csv-benchmark/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239748247,"owners_count":19690232,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T23:29:59.254Z","updated_at":"2026-01-11T18:30:15.094Z","avatar_url":"https://github.com/drorspei.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# arrow-csv-benchmark\nShort benchmark for arrow's read_csv\n\n# Why\n\nI made this repo after experiencing low read speeds (0.5GiB/s) on real work csvs.\n\n# What this does\n\n1. Generates a big csv with many string, float, and null columns, using joblib for parallelization,\n2. Puts the csv into a `BytesIO` object,\n3. Calls `pyarrow.csv.read_csv` a few times on the csv bytes.\n\nThe Dockerfile sets up a minimal container for running the benchmark.\n\n# My Results\n\nRunning this on Azure, machine size `Standard E48s_v3 (48 vcpus, 384 GiB memory)`, on Linux (ubuntu 18.04), unused other than this benchmark, consistently shows speeds of less than 1GiB/s, and often below 0.5GiB/s.\n\nIncluded in the repo are profiling dumps, made manually with py-spy. I started them 5 seconds after the beginning of each `read_csv`, and stopped them after about 15 seconds. This was always more than 5 seconds before the `read_csv` finished.\n\nIf the profiles are to be trusted, there is considerable time spent in the shared pointer's lock mechanisms. As for the reading of the bytes, I'm not sure what goes into this or why it takes time.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdrorspei%2Farrow-csv-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdrorspei%2Farrow-csv-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdrorspei%2Farrow-csv-benchmark/lists"}