Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lukaswagner/csv-parser
Quick, multi-threaded CSV parser with focus on handling huge files.
https://github.com/lukaswagner/csv-parser
csv js parser ts
Last synced: 3 months ago
JSON representation
Quick, multi-threaded CSV parser with focus on handling huge files.
- Host: GitHub
- URL: https://github.com/lukaswagner/csv-parser
- Owner: lukaswagner
- License: mit
- Created: 2021-01-29T12:39:12.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2022-05-13T13:00:41.000Z (over 2 years ago)
- Last Synced: 2024-10-08T01:38:17.542Z (3 months ago)
- Topics: csv, js, parser, ts
- Language: TypeScript
- Homepage: https://csv.lwgnr.dev
- Size: 840 KB
- Stars: 3
- Watchers: 1
- Forks: 2
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# csv-parser
Quick, multi-threaded CSV parser with focus on handling huge files.
![build+lint badge](https://github.com/lukaswagner/csv-parser/actions/workflows/ci.yml/badge.svg)
## Features
- **Quick:** Loads a 400MB CSV (5 float columns, 10M rows) in 8 seconds (on an i7-4790K). [Papa Parse](https://www.papaparse.com/), which claims to be the fastest CSV parser in the browser took twice as long to parse the same file on the same system – when parsing everything as strings. When enabling type parsing, it took over a minute.
- **Supports preprocessing the data:** Loading the data first and processing it later can waste RAM by storing columns which you don't actually need. Instead, you can specify generator functions to create custom columns from the parsed input data, allowing you to immediately discard it afterwards to reduce memory usage.
- **Data is returned in chunks:** Especially for very large files (multiple GB), you may want to work with the available data before everything is parsed. This also allows the usage of infinite data streams.
- **Sensible data storage:** All scalar data is stored as [typed arrays](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Typed_arrays) with `ArrayBuffer`s/`SharedArrayBuffer`s as underlying storage. This has multiple advantages:
- Lower memory footprint: You can choose the required byte size of your buffer. Depending on your requirements, this can drastically reduce memory usage in comparison to an array of numbers (each being a 64-bit float). But even when using a `Float64Array`, the memory usage tends to be less, as each `number` in the respective the `Array` typically measures [more than 8 bytes](https://blog.dashlane.com/how-is-data-stored-in-v8-js-engine-memory/#BlogArticle-HowisdatastoredinV8JSenginememory?-PrimitiveTypes).
- Easier usage of low-level interfaces, such as sending data to the [GPU with WebGL](https://developer.mozilla.org/en-US/docs/Web/API/WebGLRenderingContext/bufferData).
- `SharedArrayBuffer` allows multiple threads (workers) to access the data without duplicating it. Can be enabled by setting the `sharedArrayBuffer` option to `true`. You'll need to add [two security headers](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer#security_requirements) when hosting your website.## Usage
See the [`apps` directory](https://github.com/lukaswagner/csv-parser/tree/master/apps) for multiple example implementations using the parser.
## Note regarding huge files on Google Chrome
Chrome only ever allows a tab to use 4GB of RAM, even when running on 64-bit machines. This means when parsing big files, you may run into issues with your tab crashing with a `STATUS_BREAKPOINT` error message. Example: Parsing a 2GB file with 5 32-bit float columns and 50M rows, the parsed arrays will measure roughly 1GB. During parsing, the memory consumption can approach 4GB due to intermediate values being created. As a workaround for this limitation, you can use Firefox, which does allow using more than 4GB RAM.