https://github.com/mxmlnkn/indexed_bzip2
Fast parallel random access to bzip2 and gzip files in Python
https://github.com/mxmlnkn/indexed_bzip2
bzip2 cli command-line command-line-tool cpp cpp17-library decompression gzip library parallel python python-library random-access
Last synced: 7 months ago
JSON representation
Fast parallel random access to bzip2 and gzip files in Python
- Host: GitHub
- URL: https://github.com/mxmlnkn/indexed_bzip2
- Owner: mxmlnkn
- License: apache-2.0
- Created: 2019-12-01T12:54:46.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2024-09-16T19:17:20.000Z (about 1 year ago)
- Last Synced: 2024-10-14T03:22:37.317Z (12 months ago)
- Topics: bzip2, cli, command-line, command-line-tool, cpp, cpp17-library, decompression, gzip, library, parallel, python, python-library, random-access
- Language: C++
- Homepage:
- Size: 30.7 MB
- Stars: 72
- Watchers: 6
- Forks: 2
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE-APACHE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# Parallel Random Access to bzip2 and gzip
[](http://opensource.org/licenses/MIT)
[](https://github.com/mxmlnkn/indexed_bzip2/actions/workflows/test-cpp.yml)
[](https://codecov.io/gh/mxmlnkn/indexed_bzip2)

[](https://discord.gg/Wra6t6akh2)
[](https://t.me/joinchat/FUdXxkXIv6c4Ib8bgaSxNg)This repository contains the code for the [`indexed_bzip2`](python/indexed_bzip2) and [`rapidgzip`](python/rapidgzip) Python modules.
Both are built upon the same basic architecture to enable block-parallel decoding based on prefetching and caching.# rapidgzip
[](https://github.com/mxmlnkn/indexed_bzip2/blob/master/python/rapidgzip/CHANGELOG.md)
[](https://badge.fury.io/py/rapidgzip)
[](https://pypi.org/project/rapidgzip/)
[](https://pypi.org/project/rapidgzip/)
[](https://pepy.tech/project/rapidgzip)
This module provides:
- a `rapidgzip` command line tool for parallel decompression of gzip files with a similar command line interface to `gzip` so that it can be used as a replacement.
- a `rapidgzip.open` Python method for reading and seeking inside gzip files using multiple threads for a speedup of **21** over the built-in gzip module using a 12-core processor.The random seeking support is similar to the one provided by [indexed_gzip](https://github.com/pauldmccarthy/indexed_gzip) and the parallel capabilities are effectively a working version of [pugz](https://github.com/Piezoid/pugz), which is only a concept and only works with a limited subset of file contents, namely non-binary (ASCII characters 0 to 127) compressed files.
| Module | Bandwidth / (MB/s) | Speedup |
|-------------------------------------|--------------------|---------|
| gzip | 250 | 1 |
| rapidgzip with parallelization = 1 | 488 | 1.9 |
| rapidgzip with parallelization = 2 | 902 | 3.6 |
| rapidgzip with parallelization = 12 | 4463 | 17.7 |
| rapidgzip with parallelization = 24 | 5240 | 20.8 |[See here for the extended Readme.](python/rapidgzip)
There also exists a dedicated repository for rapidgzip [here](https://github.com/mxmlnkn/rapidgzip).
It was created for visibility reasons and in order to keep indexed_bzip2 and rapidgzip releases separate.
The main development will take place in [this](https://github.com/mxmlnkn/indexed_bzip2) repository while the rapidgzip repository will be updated at least for each release.
Issues regarding rapidgzip should be opened at [its repository](https://github.com/mxmlnkn/rapidgzip/issues).A paper describing the implementation details and showing the scaling behavior with up to 128 cores has been submitted to and [accepted](https://www.hpdc.org/2023/program/technical-sessions/) in [ACM HPDC'23](https://www.hpdc.org/2023/), The 32nd International Symposium on High-Performance Parallel and Distributed Computing.
If you use this software for your scientific publication, please cite it as stated [here](python/rapidgzip#citation).
The author's version can be found [here]() and the accompanying presentation [here](results/Presentation-2023-06-22.pdf).# indexed_bzip2
[](https://github.com/mxmlnkn/indexed_bzip2/blob/master/python/indexed_bzip2/CHANGELOG.md)
[](https://badge.fury.io/py/indexed-bzip2)
[](https://pypi.org/project/indexed-bzip2/)
[](https://pypi.org/project/indexed-bzip2/)
[](https://pepy.tech/project/indexed-bzip2)
[](https://anaconda.org/conda-forge/indexed_bzip2)
[](https://anaconda.org/conda-forge/indexed_bzip2)This module provides:
- an `ibzip2` command line tool to decompress bzip2 files in parallel with a similar command line interface to `bzip2` so that it can be used as a replacement.
- an `ibzip2.open` Python method for reading and seeking inside bzip2 files using multiple threads for a speedup of **6** over the built-in bzip2 module using a 12-core processor.The parallel decompression capabilities are similar to [lbzip2](https://lbzip2.org/) but with a more permissive license and with support to be used as a library with random seeking capabilities similar to [seek-bzip2](https://github.com/galaxyproject/seek-bzip2).
| Module | Runtime / s | Bandwidth / (MB/s) | Speedup |
|-----------------------------------------|-------------|--------------------|---------|
| bz2 | 386 | 5.2 | 1 |
| indexed_bzip2 with parallelization = 1 | 472 | 4.2 | 0.8 |
| indexed_bzip2 with parallelization = 2 | 265 | 7.6 | 1.5 |
| indexed_bzip2 with parallelization = 12 | 64 | 31.4 | 6.1 |
| indexed_bzip2 with parallelization = 24 | 63 | 31.8 | 6.1 |[See here for the extended Readme.](python/indexed_bzip2)
# License
Licensed under either of
* Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
* MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)at your option.
### Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.