Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lighttransport/nanocsv

Multithreaded header only C++11 CSV parser
https://github.com/lighttransport/nanocsv

Last synced: 2 months ago
JSON representation

Multithreaded header only C++11 CSV parser

Host: GitHub
URL: https://github.com/lighttransport/nanocsv
Owner: lighttransport
License: other
Created: 2019-05-23T10:15:52.000Z (over 5 years ago)
Default Branch: devel
Last Pushed: 2024-03-12T18:03:48.000Z (10 months ago)
Last Synced: 2024-08-04T02:09:27.075Z (6 months ago)
Language: C
Size: 1.01 MB
Stars: 29
Watchers: 9
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

AwesomeCppGameDev - nanocsv

README

        # NanoCSV, Faster C++11 multithreaded header-only CSV parser

![C/C++ CI](https://github.com/lighttransport/nanocsv/workflows/C/C++%20CI/badge.svg)

NanoCSV is a faster C++11 multithreaded header-only CSV parser with only STL dependency.

NanoCSV is designed for CSV data with numeric values.

![tty](img/tty.gif)

## Status

In development.

Not recommended to use NanoCSV in production at the moment.

## Requirements

* C++11 compiler(with `thread` support)

## Usage

```c++

// defined this only in **one** c++ file.

#define NANOCSV_IMPLEMENTATION

#include "nanocsv.h"

int main(int argc, char **argv)

{

  if (argc < 2) {

    std::cout << "csv_parser_example input.csv (num_threads) (delimiter)\n";

  }

  std::string filename("./data/array-4-5.csv");

  int num_threads = -1; // -1 = use all system threads

  char delimiter = ' '; // delimiter character.

  if (argc > 1) {

    filename = argv[1];

  }

  if (argc > 2) {

    num_threads = std::atoi(argv[2]);

  }

  if (argc > 3) {

    delimiter = argv[3][0];

  }

  nanocsv::ParseOption option;

  option.delimiter = delimiter;

  option.req_num_threads = num_threads;

  option.verbose = true; // verbse message will be stored in `warn`.

  option.ignore_header = true; // Parse header(the first line. default = true).

  std::string warn;

  std::string err;

  nanocsv::CSV csv;

  bool ret = nanocsv::ParseCSVFromFile(filename, option, &csv, &warn, &err);

  if (!warn.empty()) {

    std::cout << "WARN: " << warn << "\n";

  }

  if (!ret) {

    if (!err.empty()) {

      std::cout << "ERROR: " << err << "\n";

    }

    return EXIT_FAILURE;

  }

  std::cout << "num records(rows) = " << csv.num_records << "\n";

  std::cout << "num fields(columns) = " << csv.num_fields << "\n";

  // values are 1D array of length [num_records * num_fields]

  // std::cout << csv.values[4 * num_fields + 3] << "\n";

  // header string is stored in `csv.header`

  if (!option.ignore_header) {

    for (size_t i = 0; i < csv.header.size(); i++) {

      std::cout << csv.header[i] << "\n";

    }

  }

  return EXIT_SUCCESS;

}

```

## NaN, Inf

nanocsv supports parsing

* `nan`, `-nan` as NaN, -NaN

* `inf`, `-inf` as Inf, -Inf

## Support for N/A and null value

In default, missing value(e.g. N/A(including invalid numeric string), NaN) are replaced by `nan`, and null(empty) value(e.g. "") are replaced by `nan`.

You can control the behavior with the following parametes in `ParseOption`.

* `replace_na` : Replace N/A, NaN value?

  * `na_value` : The value to be replaced for N/A, NaN value

* `replace_null` : Replace null(empty) value?

  * `null_value` : The value to be replaced for null value

## Parse Text CSV

Parsing Text CSV(each field is just a string) is also supported.

(Use differnt API. See the source code for details.)

## Compiler options

* NANOCSV_NO_IO : Disable I/O(file access, stdio, mmap).

* NANOCSV_WITH_RYU : Use ryu library to parse floating-point string. https://github.com/ulfjack/ryu . This will give precise handling of floating point values.

  * NANOCSV_WITH_RYU_NOINCLUDE: Do not include Ryu header files in `nanocsv.h`. This is useful when you want to include Ryu header files outside of `nanocsv.h`.

## TODO

* [ ] Support UTF-8

  * [x] Detect BOM header

  * [ ] Validate UTF-8 string

* [ ] Support UTF-16 and UTF-32?

* [ ] mmap based API

* [ ] Reduce memory usage. Currently nanocsv allocates some memory for intermediate buffer.

* [ ] Robust error handling.

* [x] Support header.

* [x] Support comment line(A line start with `#`)

* [ ] Support different number of fields among records;

* [ ] Parse complex value(e.g. `3.0 + 4.2j`)

* [ ] Parse special value like `#INF`, `#NAN`.

  * https://docs.microsoft.com/en-us/cpp/c-runtime-library/format-specification-syntax-printf-and-wprintf-functions?view=vs-2019

* [ ] Use floaxie https://github.com/aclex/floaxie for better floating point string parsing.

* [ ] CSV writer.

* [ ] Write tests.

* [ ] Remove libm(`pow`) dependency.

## Performance

Dataset is 8192 x 4096, 800 MB in file size(generated by `tools/gencsv/gen.py`)

* Thradripper 1950X

* DDR4 2666 64 GB memory

![perf](img/perf-chart.png)

### 1 thread.

```

total parsing time: 3833.33 ms

  line detection : 1264.99 ms

  alloc buf      : 0.016351 ms

  parse          : 2508.83 ms

  construct      : 55.726 ms

```

### 16 thread.

```

total parsing time: 545.646 ms

  line detection : 159.078 ms

  alloc buf      : 0.077979 ms

  parse          : 337.207 ms

  construct      : 46.7815 ms

```

### 23 threads

Since 23 threads are faster than 32 thread for 1950x.

```

total parsing time: 494.849 ms

  line detection : 127.176 ms

  alloc buf      : 0.050988 ms

  parse          : 314.287 ms

  construct      : 50.7568 ms

```

Roughly **7.7 times faster** than signle therad parsing.

### Note on memory consumption

Not sure, but it should not exceed 3 * filesize, so guess 2.4 GB.

### In python

Using `numpy.loadtxt` to load data takes 23.4 secs.

23 threaded naocsv parsing is Roughly **40 times faster** than `numpy.loadtxt`.

## References

* RFC 4180 https://www.ietf.org/rfc/rfc4180.txt

## License

MIT License

### Third-party license

* stack_container : Copyright (c) 2006-2008 The Chromium Authors. BSD-style license.

* acutest : MIT license. Used for unit tester.

* ryu : Apache 2.0 or Boost 1.0 dual license.