Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mzaks/mojo-csv
https://github.com/mzaks/mojo-csv
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/mzaks/mojo-csv
- Owner: mzaks
- License: mit
- Created: 2023-10-27T05:14:54.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-06-03T12:53:16.000Z (7 months ago)
- Last Synced: 2024-07-25T01:57:41.880Z (5 months ago)
- Language: Mojo
- Size: 2.45 MB
- Stars: 50
- Watchers: 2
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- mojo-is-awesome - Mojo CSV - read and write csv data
- awesome-mojo-max-mlir - mzaks/mojo-csv - csv?style=social"/> : This library provides facilities to read and write data in CSV format according to [RFC-4180](https://www.rfc-editor.org/rfc/rfc4180) (File Processing)
- awesome-mojo-max-mlir - mzaks/mojo-csv - csv?style=social"/> : This library provides facilities to read and write data in CSV format according to [RFC-4180](https://www.rfc-editor.org/rfc/rfc4180) (File Processing)
README
# MOJO-CSV
This library provides facilities to read and write data in CSV format according to [RFC-4180](https://www.rfc-editor.org/rfc/rfc4180)
## Writing data to CSV format
In order to convert data into CSV format user needs to crete an instance of `CsvBuilder`.The `CsvBuilder` has two instantiation options:
1. Instatiate the builder with column count `CsvBuilder(3)`
2. Instatiate the builder with column names `CsvBuilder("a", "b", "c")`After the builder is instatitated it is possible to push values through following API:
- `fn push[D: DType](inout self, value: SIMD[D, 1]):` Allows to push numeric value
- `fn push(inout self, s: String, consider_escaping: Bool = True):` Allows to push string value, by default the value will be examined for special characters in order to identify if it needs to be escaped
- `fn push[T: AnyType, to_str: fn(v:T) -> String](inout self, value: T, consider_escaping: Bool = False):` Allows to push any type, given that a function to transform the type into a `String` is provided as compile time parameter, the `consider_escaping` argument acts as described above, the default is set to False
- `fn push_empty(inout self):` functionaly same as `push("")`Based on the provided number of columns, the pushed values will be escaped, if needed and desired and concatenated by `,` or `\r\n` according to RFC-4180. `fn fill_up_row(inout self):` allows to fillup current row with empty values if needed.
To get the CSV formated data, user needs to call `fn finish(owned self) -> String:` which will return the desired string and destroy the builder. The `finish` method internally calls `fill_up_row` and appends `\r\n` to the end of the file, making sure that the resulting string is valid according to RFC-4180.
### Note:
Pushing string values with `consider_escaping` set to `True` is up to 10x slower, but makes sure that the resulting CSV is valid. In case the user is certain that provided string does not contain special characters, they should set `consider_escaping` parameter to `False`## Reading CSV formated data
In order to read a CSV string the user need to instantiate a `CsvTable` with the string. By default `CsvTable` will use SIMD based tokenization which is about 20% faster then the non SIMD one. However user can decide to not use the SIMD based tokenization by setting the instantiation argument `with_simd` to `False`.After the `CsvTable` is instantiated user can examine the number of columns and number fo rows by accessing `column_count` field and calling `fn row_count(self) -> Int:` method.
In order to get values from the table user can call `fn get(self, row: Int, column: Int) -> String:` method, which returns already unescaped string value.
## Benchmarks
In ordert to evaluate the performance characterisitcs of the library we provide two CSV examples (downloaded from https://www.stats.govt.nz/large-datasets/csv-files-for-download/, file names `Subnational-period-life-tables-2017-2019-CSV.csv` and `balance-of-payments-and-international-investment-position-june-2023-quarter.csv`)
Based on this files and the benchmark test we run on Apple M1 Mac mini, we expect the library to be able to parse/tokenize 1 GiB under 3 seconds. Iterating over all values as strings should take under 3.5 seconds.
Writing 1 GiB of data without escaping consideration should take under 4 seconds and with escaping considerations under 35 seconds.