Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/skjolber/unzip-csv

High-performance (i.e. multi-threaded) unpacking and processing of CSV files directly from ZIP archives.
https://github.com/skjolber/unzip-csv

csv http parallel performance unzip

Last synced: 4 months ago
JSON representation

High-performance (i.e. multi-threaded) unpacking and processing of CSV files directly from ZIP archives.

Host: GitHub
URL: https://github.com/skjolber/unzip-csv
Owner: skjolber
License: apache-2.0
Created: 2018-07-20T21:24:55.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-10-21T16:26:42.000Z (4 months ago)
Last Synced: 2024-10-22T04:46:03.706Z (4 months ago)
Topics: csv, http, parallel, performance, unzip
Language: Java
Size: 46.7 MB
Stars: 1
Watchers: 3
Forks: 1
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # unzip-csv

High-performance (i.e. multi-threaded) unpacking and processing of CSV files directly from ZIP archives.

Projects using this library will benefit from:

 * parallel unzip and processing of files (in streaming fashion)

 * per-file adapters (based on file-name)

 * hooks for pre- and post-processing 

 * cutting large files into segments for further parallel processing 

 * unzip specific files over-the-wire (with HTTP Range)

   * parallel, on-demand download

   * extract specific files without downloading the full archive

Cutting files into segments (based on newline) assumes that 

 * line order is not important,

 * newlines can be decoded from reading the file backwards - this is true for popular character encodings like [UTF-8], ASCII and ISO-8859-1. 

 * CSV entries are without linebreaks; not even linebreak wrapped in quotes.

 

If cutting into segments is not possible then using a [parallel reader](src/main/java/com/github/skjolber/unzip/csv/ParallelReader.java) is also an option; effectively decoding to characters and parsing in parallel.

Bugs, feature suggestions and help requests can be filed with the [issue-tracker].

## Obtain

The project is implemented in Java and built using [Maven]. The project is available on the central Maven repository.

Example dependency config:

```xml

    com.github.skjolber.unzip-csv

    unzip-csv

    1.0.10

```

# Usage

The top level `FileEntryHandler` is passed to the `ZipFileEngine`. 

```java

FileEntryHandler handler = ...; // your code here

ZipFileEngine engine = new ZipFileEngine(handler);

boolean success = engine.handle(new FileZipFileFactory(file));

```

where the default thread count is one per core.

## Pre- and post-processing

Wire `FileEntryHandler` methods 

```java

void beginFileCollection(String name);

void beginFileEntry(String name);

void endFileEntry(String name, ThreadPoolExecutor executor);

void endFileCollection(String name, ThreadPoolExecutor executor);

FileEntryStreamHandler getFileEntryStreamHandler(String name, long size, ThreadPoolExecutor executor);

```

for pre- or post-processing. Call the super method wherever it exists. Notice the `ThreadPoolExecutor` which allows for queuing more work.

## Sesseltjonna CSV parser

Implement an instance of `CsvLineHandlerFactory` to return specific handlers for each file in the ZIP archive.

Then use the provided adapters to implement a `FileEntryHandler` like in [this example](src/test/java/com/github/skjolber/unzip/TestSesselTjonnaCsvFileEntryHandler.java). 

## Univocity CSV parser

Implement an instance of `CsvLineHandlerFactory` to return specific handlers for each file in the ZIP archive. Then create a `DefaultUnivocityCsvFileEntryHandler` and pass it to the ZipFileEngine`.

# Details

The main performance-enhanching functions are

 * parallelization; unzip and process files and/or parts of files

 * unzip remote files while downloading only the necessary parts

ZIP files carry a `central directory` in the end of the file, detailing the name and location of the files within. parallel unzipping works better if the archive is compressed using the DEFLATE compression algorithm - see Apache [commons-compress](https://commons.apache.org/proper/commons-compress/zip.html) for additional details.

## Benchmarks

For simple [GTFS feeds] with an archive size of approximately 70 MB, which both unzips and processes file segments in parallel (parsing the CSV file lines), the performance on my laptop (4 cores + hyperthreading) is appoximately 1.5x-2x that of a linear unzip. 

Depending on your scenario, the the effect of processing file segments in parallel might be a considerable speedup, improving on the above result.

# Contact

If you have any questions or comments, please email me at [email protected].

Feel free to connect with me on [LinkedIn], see also my [Github page].

## License

[Apache 2.0]

## Links

Other high-performance CSV parsers:

 * [SimpleFlatMapper](https://simpleflatmapper.org/)

 * [Univocity-parsers](https://github.com/uniVocity/univocity-parsers)

# History

 - 1.0.10: Maintenance release 

 - 1.0.9: Update CSV parsers

 - 1.0.8: Automatic module name for JDK9+.

 - 1.0.7: Add optional Parallel reader (from SimpleFlatMapper project).

[GTFS feeds]:			https://www.entur.org/dev/rutedata/

[Apache 2.0]: 			http://www.apache.org/licenses/LICENSE-2.0.html

[issue-tracker]:		https://github.com/skjolber/unzip-csv/issues

[Maven]:				http://maven.apache.org/

[LinkedIn]:				http://lnkd.in/r7PWDz

[Github page]:			https://skjolber.github.io

[UTF-8]:				https://stackoverflow.com/questions/22257486/iterate-backwards-through-a-utf8-multibyte-string