{"id":16755448,"url":"https://github.com/mxmlnkn/indexed_bzip2","last_synced_at":"2025-03-21T19:11:50.430Z","repository":{"id":37555265,"uuid":"225161274","full_name":"mxmlnkn/indexed_bzip2","owner":"mxmlnkn","description":"Fast parallel random access to bzip2 and gzip files in Python","archived":false,"fork":false,"pushed_at":"2024-09-16T19:17:20.000Z","size":32182,"stargazers_count":72,"open_issues_count":3,"forks_count":2,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-10-14T03:22:37.317Z","etag":null,"topics":["bzip2","cli","command-line","command-line-tool","cpp","cpp17-library","decompression","gzip","library","parallel","python","python-library","random-access"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mxmlnkn.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-12-01T12:54:46.000Z","updated_at":"2024-09-27T08:20:27.000Z","dependencies_parsed_at":"2024-01-06T20:30:30.332Z","dependency_job_id":"b70640f1-b983-4af5-a4e2-9e592efccae6","html_url":"https://github.com/mxmlnkn/indexed_bzip2","commit_stats":null,"previous_names":[],"tags_count":43,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxmlnkn%2Findexed_bzip2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxmlnkn%2Findexed_bzip2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxmlnkn%2Findexed_bzip2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxmlnkn%2Findexed_bzip2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mxmlnkn","download_url":"https://codeload.github.com/mxmlnkn/indexed_bzip2/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244852671,"owners_count":20521154,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bzip2","cli","command-line","command-line-tool","cpp","cpp17-library","decompression","gzip","library","parallel","python","python-library","random-access"],"created_at":"2024-10-13T03:22:33.569Z","updated_at":"2025-03-21T19:11:50.407Z","avatar_url":"https://github.com/mxmlnkn.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# Parallel Random Access to bzip2 and gzip\n\n[![License](https://img.shields.io/badge/license-MIT-blue.svg)](http://opensource.org/licenses/MIT)\n[![C++ Code Checks](https://github.com/mxmlnkn/indexed_bzip2/actions/workflows/test-cpp.yml/badge.svg)](https://github.com/mxmlnkn/indexed_bzip2/actions/workflows/test-cpp.yml)\n[![codecov](https://codecov.io/gh/mxmlnkn/indexed_bzip2/branch/master/graph/badge.svg?token=94ZD4UTZQW)](https://codecov.io/gh/mxmlnkn/indexed_bzip2)\n![C++17](https://img.shields.io/badge/C++-17-blue.svg)\n[![Discord](https://img.shields.io/discord/783411320354766878?label=discord)](https://discord.gg/Wra6t6akh2)\n[![Telegram](https://img.shields.io/badge/Chat-Telegram-%2330A3E6)](https://t.me/joinchat/FUdXxkXIv6c4Ib8bgaSxNg)\n\n\u003c/div\u003e\n\nThis repository contains the code for the [`indexed_bzip2`](python/indexed_bzip2) and [`rapidgzip`](python/rapidgzip) Python modules.\nBoth are built upon the same basic architecture to enable block-parallel decoding based on prefetching and caching.\n\n\u003cdiv align=\"center\"\u003e\n\n# rapidgzip\n\n[![Changelog](https://img.shields.io/badge/Changelog-Markdown-blue)](https://github.com/mxmlnkn/indexed_bzip2/blob/master/python/rapidgzip/CHANGELOG.md)\n[![PyPI version](https://badge.fury.io/py/rapidgzip.svg)](https://badge.fury.io/py/rapidgzip)\n[![Python Version](https://img.shields.io/pypi/pyversions/rapidgzip)](https://pypi.org/project/rapidgzip/)\n[![PyPI Platforms](https://img.shields.io/badge/pypi-linux%20%7C%20macOS%20%7C%20Windows-brightgreen)](https://pypi.org/project/rapidgzip/)\n[![Downloads](https://static.pepy.tech/badge/rapidgzip/month)](https://pepy.tech/project/rapidgzip)\n\n![](https://raw.githubusercontent.com/mxmlnkn/indexed_bzip2/master/results/asciinema/rapidgzip-comparison.gif)\n\n\u003c/div\u003e\n\nThis module provides: \n - a `rapidgzip` command line tool for parallel decompression of gzip files with a similar command line interface to `gzip` so that it can be used as a replacement.\n - a `rapidgzip.open` Python method for reading and seeking inside gzip files using multiple threads for a speedup of **21** over the built-in gzip module using a 12-core processor.\n\nThe random seeking support is similar to the one provided by [indexed_gzip](https://github.com/pauldmccarthy/indexed_gzip) and the parallel capabilities are effectively a working version of [pugz](https://github.com/Piezoid/pugz), which is only a concept and only works with a limited subset of file contents, namely non-binary (ASCII characters 0 to 127) compressed files.\n\n| Module                              | Bandwidth / (MB/s) | Speedup |\n|-------------------------------------|--------------------|---------|\n| gzip                                |  250               |  1      |\n| rapidgzip with parallelization = 1  |  488               |  1.9    |\n| rapidgzip with parallelization = 2  |  902               |  3.6    |\n| rapidgzip with parallelization = 12 | 4463               | 17.7    |\n| rapidgzip with parallelization = 24 | 5240               | 20.8    |\n\n[See here for the extended Readme.](python/rapidgzip)\n\nThere also exists a dedicated repository for rapidgzip [here](https://github.com/mxmlnkn/rapidgzip).\nIt was created for visibility reasons and in order to keep indexed_bzip2 and rapidgzip releases separate.\nThe main development will take place in [this](https://github.com/mxmlnkn/indexed_bzip2) repository while the rapidgzip repository will be updated at least for each release.\nIssues regarding rapidgzip should be opened at [its repository](https://github.com/mxmlnkn/rapidgzip/issues).\n\nA paper describing the implementation details and showing the scaling behavior with up to 128 cores has been submitted to and [accepted](https://www.hpdc.org/2023/program/technical-sessions/) in [ACM HPDC'23](https://www.hpdc.org/2023/), The 32nd International Symposium on High-Performance Parallel and Distributed Computing.\nIf you use this software for your scientific publication, please cite it as stated [here](python/rapidgzip#citation).\nThe author's version can be found [here](\u003cresults/paper/Knespel, Brunst - 2023 - Rapidgzip - Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching.pdf\u003e) and the accompanying presentation [here](results/Presentation-2023-06-22.pdf).\n\n\n\u003cdiv align=\"center\"\u003e\n\n# indexed_bzip2\n\n[![Changelog](https://img.shields.io/badge/Changelog-Markdown-blue)](https://github.com/mxmlnkn/indexed_bzip2/blob/master/python/indexed_bzip2/CHANGELOG.md)\n[![PyPI version](https://badge.fury.io/py/indexed-bzip2.svg)](https://badge.fury.io/py/indexed-bzip2)\n[![Python Version](https://img.shields.io/pypi/pyversions/indexed_bzip2)](https://pypi.org/project/indexed-bzip2/)\n[![PyPI Platforms](https://img.shields.io/badge/pypi-linux%20%7C%20macOS%20%7C%20Windows-brightgreen)](https://pypi.org/project/indexed-bzip2/)\n[![Downloads](https://static.pepy.tech/badge/indexed-bzip2/month)](https://pepy.tech/project/indexed-bzip2)\n\u003cbr\u003e\n[![Conda Platforms](https://img.shields.io/conda/v/conda-forge/indexed_bzip2?color=brightgreen)](https://anaconda.org/conda-forge/indexed_bzip2)\n[![Conda Platforms](https://img.shields.io/conda/pn/conda-forge/indexed_bzip2?color=brightgreen)](https://anaconda.org/conda-forge/indexed_bzip2)\n\n\u003c/div\u003e\n\nThis module provides:\n  - an `ibzip2` command line tool to decompress bzip2 files in parallel with a similar command line interface to `bzip2` so that it can be used as a replacement.\n  - an `ibzip2.open` Python method for reading and seeking inside bzip2 files using multiple threads for a speedup of **6** over the built-in bzip2 module using a 12-core processor.\n\nThe parallel decompression capabilities are similar to [lbzip2](https://lbzip2.org/) but with a more permissive license and with support to be used as a library with random seeking capabilities similar to [seek-bzip2](https://github.com/galaxyproject/seek-bzip2).\n\n| Module                                  | Runtime / s | Bandwidth / (MB/s) | Speedup |\n|-----------------------------------------|-------------|--------------------|---------|\n| bz2                                     | 386         |  5.2               | 1       |\n| indexed_bzip2 with parallelization = 1  | 472         |  4.2               | 0.8     |\n| indexed_bzip2 with parallelization = 2  | 265         |  7.6               | 1.5     |\n| indexed_bzip2 with parallelization = 12 |  64         | 31.4               | 6.1     |\n| indexed_bzip2 with parallelization = 24 |  63         | 31.8               | 6.1     |\n\n[See here for the extended Readme.](python/indexed_bzip2)\n\n\n# License\n\nLicensed under either of\n\n * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)\n * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)\n\nat your option.\n\n### Contribution\n\nUnless you explicitly state otherwise, any contribution intentionally submitted\nfor inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any\nadditional terms or conditions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmxmlnkn%2Findexed_bzip2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmxmlnkn%2Findexed_bzip2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmxmlnkn%2Findexed_bzip2/lists"}