Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/orottier/text-index
Scan a text file, build a sorted index, persist it, query it
https://github.com/orottier/text-index
Last synced: 5 days ago
JSON representation
Scan a text file, build a sorted index, persist it, query it
- Host: GitHub
- URL: https://github.com/orottier/text-index
- Owner: orottier
- Created: 2019-03-03T18:52:58.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-03-26T17:33:56.000Z (over 5 years ago)
- Last Synced: 2024-10-31T11:35:36.392Z (15 days ago)
- Language: Rust
- Size: 48.8 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# text_index
Blazing fast csv file indexing, persisting and querying
## What?
This utility starts making sense when you are dealing with 20GB+ csv files, that need some manual pre-processing before moving into your database, or that you just want to tinker with.
You can build an index for any of your csv columns, as text, integer, or float type. The index will be stored on disk, typically about 3% of the original size for a text column. Parsing is performed with the excellent `csv` crate, processing about 800K records/sec (all threads combined) on my 2014 macbook when indexing a 64 column file.
Querying should typically be in the order of ~100 ms for equality lookups (ranges are slower). The index is stored sharded, so lookup times should not increase dramatically with input size.
## Usage
### Build the index
You can choose to index a column as text (str), integer (int) or floating point (float).
```
USAGE:
text_index index [TYPE]OPTIONS:
-t Max number of THREADS
-v Verbose output (-v, -vv supported)ARGS:
Column number (starts at 1)
Type (str(default), int, float)
```e.g. `text_index input.csv -t 4 index 1 str`
### Query the index
```
USAGE:
text_index filter [VALUE2]ARGS:
Column number (starts at 1)
Operator (eq, lt, le, gt, ge, in, pre (starts with))
Value
Value2 (when operator is `in`)
```e.g. `text_index input.csv filter 1 eq "search_string"`
## The future
- Support more text file formats, such as newline delimited json, or log files
- Multithreaded querying
- Support gzipped input files
- Swap friendly indexing (limit memory usage)