https://github.com/mostlygeek/go-csv-gz-test

How can can we filter .csv.gz files?
https://github.com/mostlygeek/go-csv-gz-test

Last synced: 4 months ago
JSON representation

How can can we filter .csv.gz files?

Host: GitHub
URL: https://github.com/mostlygeek/go-csv-gz-test
Owner: mostlygeek
Created: 2018-05-02T18:41:03.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2018-05-03T17:59:48.000Z (about 8 years ago)
Last Synced: 2025-04-14T17:12:26.903Z (about 1 year ago)
Language: Go
Size: 7.81 KB
Stars: 5
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

This was a fun little team project to see how we can filter
S3 inventory .csv.gz files fastest!

## Other implementations:

* [mythmon's Rust implementation](https://github.com/mythmon/rust-gz-csv-test)
* [peterbe's Python implementation](https://gist.github.com/peterbe/f147fd093aef43304a5c7e0a89c1ea0a) + [blog](https://www.peterbe.com/plog/fastest-python-datetime-parser)

## Usage

```
# get some working data, downloads 1GB from S3 into testdata/ subdirectory
> ./download.sh

# Processing using a one file at a time
> go run ./filter.go

# Processing in parallel (workers = num cpus)
> GOPAR=1 go run ./filter.go
```

## My results (on my late 2017 13" MBP)

```
Strategy: One file at a time ...
Total: 31521045, Matched: 710093, Ratio: 2.25%
Time: 52.740166887s
```

```
Strategy: Parallel, 4 Workers ...
Total: 31521045, Matched: 710093, Ratio: 2.25%
Time: 27.207802611s
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mostlygeek/go-csv-gz-test

Awesome Lists containing this project

README