Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vadimkantorov/fasttsv
TSV parser for Python in pure vectorized NumPy code
https://github.com/vadimkantorov/fasttsv
numpy parsing simd tsv vectorized-code
Last synced: 24 days ago
JSON representation
TSV parser for Python in pure vectorized NumPy code
- Host: GitHub
- URL: https://github.com/vadimkantorov/fasttsv
- Owner: vadimkantorov
- Created: 2019-06-06T10:53:02.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-06-15T12:25:01.000Z (over 5 years ago)
- Last Synced: 2024-11-13T10:48:16.422Z (3 months ago)
- Topics: numpy, parsing, simd, tsv, vectorized-code
- Language: Python
- Homepage:
- Size: 24.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# fasttsv
TSV parser for Python in using only NumPy ops.This is not production-ready code, just a primer on a branchless parsing technique using vectorized code.
#### TODO
- support strings
- support negative integers and floats# Approach
1. Read the whole file into a byte array in memory
2. Find positions of tabs and decimal points
3. Compute digit count for every field
3. For the integer case, given the maximum number of digits in the file, precompute the parsed integers finishing on a given positions for all possible digit counts
4. For every field, use the computed digit count to index into the precomputed parsed integers array
5. Assemble values for the real-valued columns: the integral and remainder parts are neighboring parsed integers# Features, scope and limitations
1. Supports integer, real (only decimal point notation) and utf-8 string columns (quotes not supported)
2. Uses only NumPy methods, and can be extended to GPU using Google Jax or PyTorch. It can also run on Pyodide.# Further reading
For some truly fascinating vectorized parsing check out [simdjson](https://github.com/lemire/simdjson) and [csvmonkey](https://github.com/dw/csvmonkey).