Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/vadimkantorov/fasttsv

TSV parser for Python in pure vectorized NumPy code
https://github.com/vadimkantorov/fasttsv

numpy parsing simd tsv vectorized-code

Last synced: 24 days ago
JSON representation

TSV parser for Python in pure vectorized NumPy code

Awesome Lists containing this project

README

        

# fasttsv
TSV parser for Python in using only NumPy ops.

This is not production-ready code, just a primer on a branchless parsing technique using vectorized code.

#### TODO
- support strings
- support negative integers and floats

# Approach
1. Read the whole file into a byte array in memory
2. Find positions of tabs and decimal points
3. Compute digit count for every field
3. For the integer case, given the maximum number of digits in the file, precompute the parsed integers finishing on a given positions for all possible digit counts
4. For every field, use the computed digit count to index into the precomputed parsed integers array
5. Assemble values for the real-valued columns: the integral and remainder parts are neighboring parsed integers

# Features, scope and limitations
1. Supports integer, real (only decimal point notation) and utf-8 string columns (quotes not supported)
2. Uses only NumPy methods, and can be extended to GPU using Google Jax or PyTorch. It can also run on Pyodide.

# Further reading
For some truly fascinating vectorized parsing check out [simdjson](https://github.com/lemire/simdjson) and [csvmonkey](https://github.com/dw/csvmonkey).