{"id":13730302,"url":"https://github.com/lighttransport/nanocsv","last_synced_at":"2025-04-11T19:40:55.407Z","repository":{"id":50097754,"uuid":"188210725","full_name":"lighttransport/nanocsv","owner":"lighttransport","description":"Multithreaded header only C++11 CSV parser","archived":false,"fork":false,"pushed_at":"2024-03-12T18:03:48.000Z","size":1054,"stargazers_count":29,"open_issues_count":0,"forks_count":3,"subscribers_count":8,"default_branch":"devel","last_synced_at":"2025-03-25T15:34:27.796Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lighttransport.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-05-23T10:15:52.000Z","updated_at":"2024-01-05T22:38:12.000Z","dependencies_parsed_at":"2024-01-31T01:03:37.117Z","dependency_job_id":"b75cc59b-9e9b-4f83-b381-9276d7e24d5a","html_url":"https://github.com/lighttransport/nanocsv","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lighttransport%2Fnanocsv","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lighttransport%2Fnanocsv/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lighttransport%2Fnanocsv/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lighttransport%2Fnanocsv/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lighttransport","download_url":"https://codeload.github.com/lighttransport/nanocsv/tar.gz/refs/heads/devel","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248468272,"owners_count":21108787,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T02:01:13.030Z","updated_at":"2025-04-11T19:40:55.381Z","avatar_url":"https://github.com/lighttransport.png","language":"C","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# NanoCSV, Faster C++11 multithreaded header-only CSV parser\n\n![C/C++ CI](https://github.com/lighttransport/nanocsv/workflows/C/C++%20CI/badge.svg)\n\nNanoCSV is a faster C++11 multithreaded header-only CSV parser with only STL dependency.\nNanoCSV is designed for CSV data with numeric values.\n\n![tty](img/tty.gif)\n\n\n## Status\n\nIn development.\nNot recommended to use NanoCSV in production at the moment.\n\n## Requirements\n\n* C++11 compiler(with `thread` support)\n\n## Usage\n\n```c++\n\n// defined this only in **one** c++ file.\n#define NANOCSV_IMPLEMENTATION\n#include \"nanocsv.h\"\n\nint main(int argc, char **argv)\n{\n  if (argc \u003c 2) {\n    std::cout \u003c\u003c \"csv_parser_example input.csv (num_threads) (delimiter)\\n\";\n  }\n\n  std::string filename(\"./data/array-4-5.csv\");\n  int num_threads = -1; // -1 = use all system threads\n  char delimiter = ' '; // delimiter character.\n\n  if (argc \u003e 1) {\n    filename = argv[1];\n  }\n\n  if (argc \u003e 2) {\n    num_threads = std::atoi(argv[2]);\n  }\n\n  if (argc \u003e 3) {\n    delimiter = argv[3][0];\n  }\n\n  nanocsv::ParseOption\u003cfloat\u003e option;\n  option.delimiter = delimiter;\n  option.req_num_threads = num_threads;\n  option.verbose = true; // verbse message will be stored in `warn`.\n  option.ignore_header = true; // Parse header(the first line. default = true).\n\n  std::string warn;\n  std::string err;\n\n  nanocsv::CSV\u003cfloat\u003e csv;\n\n  bool ret = nanocsv::ParseCSVFromFile(filename, option, \u0026csv, \u0026warn, \u0026err);\n\n  if (!warn.empty()) {\n    std::cout \u003c\u003c \"WARN: \" \u003c\u003c warn \u003c\u003c \"\\n\";\n  }\n\n\n  if (!ret) {\n\n    if (!err.empty()) {\n      std::cout \u003c\u003c \"ERROR: \" \u003c\u003c err \u003c\u003c \"\\n\";\n    }\n\n    return EXIT_FAILURE;\n  }\n\n  std::cout \u003c\u003c \"num records(rows) = \" \u003c\u003c csv.num_records \u003c\u003c \"\\n\";\n  std::cout \u003c\u003c \"num fields(columns) = \" \u003c\u003c csv.num_fields \u003c\u003c \"\\n\";\n\n  // values are 1D array of length [num_records * num_fields]\n  // std::cout \u003c\u003c csv.values[4 * num_fields + 3] \u003c\u003c \"\\n\";\n\n  // header string is stored in `csv.header`\n  if (!option.ignore_header) {\n    for (size_t i = 0; i \u003c csv.header.size(); i++) {\n      std::cout \u003c\u003c csv.header[i] \u003c\u003c \"\\n\";\n    }\n  }\n\n\n  return EXIT_SUCCESS;\n}\n```\n\n## NaN, Inf\n\nnanocsv supports parsing\n\n* `nan`, `-nan` as NaN, -NaN\n* `inf`, `-inf` as Inf, -Inf\n\n## Support for N/A and null value\n\nIn default, missing value(e.g. N/A(including invalid numeric string), NaN) are replaced by `nan`, and null(empty) value(e.g. \"\") are replaced by `nan`.\n\nYou can control the behavior with the following parametes in `ParseOption`.\n\n* `replace_na` : Replace N/A, NaN value?\n  * `na_value` : The value to be replaced for N/A, NaN value\n* `replace_null` : Replace null(empty) value?\n  * `null_value` : The value to be replaced for null value\n\n## Parse Text CSV\n\nParsing Text CSV(each field is just a string) is also supported.\n(Use differnt API. See the source code for details.)\n\n## Compiler options\n\n* NANOCSV_NO_IO : Disable I/O(file access, stdio, mmap).\n* NANOCSV_WITH_RYU : Use ryu library to parse floating-point string. https://github.com/ulfjack/ryu . This will give precise handling of floating point values.\n  * NANOCSV_WITH_RYU_NOINCLUDE: Do not include Ryu header files in `nanocsv.h`. This is useful when you want to include Ryu header files outside of `nanocsv.h`.\n\n\n## TODO\n\n* [ ] Support UTF-8\n  * [x] Detect BOM header\n  * [ ] Validate UTF-8 string\n* [ ] Support UTF-16 and UTF-32?\n* [ ] mmap based API\n* [ ] Reduce memory usage. Currently nanocsv allocates some memory for intermediate buffer.\n* [ ] Robust error handling.\n* [x] Support header.\n* [x] Support comment line(A line start with `#`)\n* [ ] Support different number of fields among records;\n* [ ] Parse complex value(e.g. `3.0 + 4.2j`)\n* [ ] Parse special value like `#INF`, `#NAN`.\n  * https://docs.microsoft.com/en-us/cpp/c-runtime-library/format-specification-syntax-printf-and-wprintf-functions?view=vs-2019\n* [ ] Use floaxie https://github.com/aclex/floaxie for better floating point string parsing.\n* [ ] CSV writer.\n* [ ] Write tests.\n* [ ] Remove libm(`pow`) dependency.\n\n## Performance\n\nDataset is 8192 x 4096, 800 MB in file size(generated by `tools/gencsv/gen.py`)\n\n* Thradripper 1950X\n* DDR4 2666 64 GB memory\n\n![perf](img/perf-chart.png)\n\n### 1 thread.\n\n```\ntotal parsing time: 3833.33 ms\n  line detection : 1264.99 ms\n  alloc buf      : 0.016351 ms\n  parse          : 2508.83 ms\n  construct      : 55.726 ms\n```\n\n### 16 thread.\n\n```\ntotal parsing time: 545.646 ms\n  line detection : 159.078 ms\n  alloc buf      : 0.077979 ms\n  parse          : 337.207 ms\n  construct      : 46.7815 ms\n```\n\n\n### 23 threads\n\nSince 23 threads are faster than 32 thread for 1950x.\n\n```\ntotal parsing time: 494.849 ms\n  line detection : 127.176 ms\n  alloc buf      : 0.050988 ms\n  parse          : 314.287 ms\n  construct      : 50.7568 ms\n```\n\nRoughly **7.7 times faster** than signle therad parsing.\n\n### Note on memory consumption\n\nNot sure, but it should not exceed 3 * filesize, so guess 2.4 GB.\n\n### In python\n\nUsing `numpy.loadtxt` to load data takes 23.4 secs.\n\n23 threaded naocsv parsing is Roughly **40 times faster** than `numpy.loadtxt`.\n\n## References\n\n* RFC 4180 https://www.ietf.org/rfc/rfc4180.txt\n\n## License\n\nMIT License\n\n### Third-party license\n\n* stack_container : Copyright (c) 2006-2008 The Chromium Authors. BSD-style license.\n* acutest : MIT license. Used for unit tester.\n* ryu : Apache 2.0 or Boost 1.0 dual license.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flighttransport%2Fnanocsv","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flighttransport%2Fnanocsv","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flighttransport%2Fnanocsv/lists"}