{"id":50659226,"url":"https://github.com/ftosoni/green-compressed-storage","last_synced_at":"2026-06-08T01:07:55.206Z","repository":{"id":345137382,"uuid":"1091628859","full_name":"ftosoni/green-compressed-storage","owner":"ftosoni","description":"An energy-optimised, high-density RocksDB solution for massive source code archives, leveraging Pareto-optimal compression to maximise throughput and energy efficiency.","archived":false,"fork":false,"pushed_at":"2026-03-17T20:22:54.000Z","size":31854,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-18T09:48:45.572Z","etag":null,"topics":["green-software","key-value-stores","large-data-management","lossless-compression","source-code-archival"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ftosoni.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-07T09:35:06.000Z","updated_at":"2026-03-17T20:22:59.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ftosoni/green-compressed-storage","commit_stats":null,"previous_names":["ftosoni/green-compressed-storage"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/ftosoni/green-compressed-storage","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ftosoni%2Fgreen-compressed-storage","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ftosoni%2Fgreen-compressed-storage/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ftosoni%2Fgreen-compressed-storage/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ftosoni%2Fgreen-compressed-storage/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ftosoni","download_url":"https://codeload.github.com/ftosoni/green-compressed-storage/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ftosoni%2Fgreen-compressed-storage/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34043838,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-07T02:00:07.652Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["green-software","key-value-stores","large-data-management","lossless-compression","source-code-archival"],"created_at":"2026-06-08T01:07:54.414Z","updated_at":"2026-06-08T01:07:55.196Z","avatar_url":"https://github.com/ftosoni.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Green Compressed Storage\n\nAn energy-optimised key-value store for massive source code archives, delivering Pareto-optimal compression with order-of-magnitude gains in throughput and energy efficiency.\n\n## 📝 Short Description\n\n**Green Compressed Storage** is an innovative, energy-aware key-value store designed to handle massive source code datasets. Built on **RocksDB**, it specialises in optimising the trade-off between space, time, and energy consumption. It achieves superior data density and high-speed retrieval by utilising finely-tuned **zstd** configurations, making it ideal for large-scale archival and analysis of code.\n\n## 📋 Prerequisites\n\nThe following dependencies are required to compile and run the project.\n\n1. Compilation Tools and Perf\n\nInstall the C++ compiler (Clang, G++), CMake, and the perf performance analysis tool suite.\n\n### Compilation Tools\n```bash\nsudo apt install clang g++ gcc cmake\n```\n\n### Energy Profiling Tools (Perf)\n```bash\nsudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`\n```\n\n### Arrow\n\nThe project utilises Apache Arrow for handling Parquet data. The Arrow C++ libraries must be installed for compilation. If you have the official Arrow repository configured, use the following commands:\n\n```bash\nsudo apt update\nsudo apt install -y libarrow-dev libparquet-dev\n```\n\n### Python Dependencies\n\nThe supporting scripts require Python and several libraries.\n\n```bash\nsudo apt install python3 python3-pip\npip3 install pandas pyarrow\n```\n\n## 🛠️ Build Instructions\n\nThe project uses **CMake** for building. Ensure you have `cmake` and a C++ compiler installed.\n\n1.  **Create a build directory and run CMake:**\n\n    ```bash\n    git submodule update --init --recursive\n    mkdir build\n    cd build\n    cmake ..\n    ```\n\n2.  **Compile the project:**\n\n    ```bash\n    make\n    ```\n\nThe main executable, `green-compressed-storage`, will be located in a directory like `./cmake-build-release/` (depending on your build configuration).\n\n---\n\n## 🏃 Example Run\n\nThis section demonstrates how to build the database, generate test keys, and execute single-get retrieval tests.\n\nThe `sample_data` directory contains dummy code files (extracted from the [MediaWiki repository](https://www.mediawiki.org/wiki/Download)) in a Parquet format for testing purposes. The file has two columns: `inverted_filepath` (key) and `content` (source code). In this example, we will be using the first as key and the second as value in our database.\n\n### 1. Generate Query Keys\n\nFirst, you must **generate the key sample** for the uniform and power-law (Zipfian-like) distributions, as described in the accompanying paper. The sample keys will be written to a new Parquet file in the `sample_data` directory.\n\nThe command below samples 100 keys for uniform single-gets and 100 for power-law single-gets (named `single-get-zipf` in the code).\n\n```bash\ncd scripts\npython3 -u generate_query_data_shuffle.py ../sample_data/mediawiki10k.parquet inverted_filepath 0.0 42\ncd ..\n````\n\n\u003e **Note**: The parameter `0.0` ensures all selected keys are for retrieval, and `42` is the random seed for key selection.\n\n### 2\\. Build the Database (Insert Operation)\n\nExecute the primary application to insert the data into the key-value store. This example uses Zstandard with level 6 and a block size of 64 KiB.\n\n```bash\n# Define the path to your project root (replace \u003cPATH_TO_PROJECT\u003e with the actual path)\nPROJECT_PATH=$(pwd)\nEXECUTABLE_PATH=\"./build/green-compressed-storage\"\n\n$EXECUTABLE_PATH \\\n    --parquetfile=$PROJECT_PATH/sample_data/mediawiki10k \\\n    --db-path=$PROJECT_PATH/zstd_6_65536 \\\n    --key-column=inverted_filepath \\\n    --compression=zstd \\\n    --compression-level=6 \\\n    --block-size=65536 \\\n    --run-test=insert \\\n    --sampling-rate-zipf=1.5 \\\n    --sampling-rate=1.0 \\\n    --probability=0.0\n```\n\n### 3\\. Run Retrieval Tests (Single-Gets)\n\nExecute single-get retrieval tests using the generated key sample.\n\n**Uniformly Distributed Keys:**\n\n```bash\n$EXECUTABLE_PATH \\\n    --parquetfile=$PROJECT_PATH/sample_data/mediawiki10k-s42 \\\n    --db-path=$PROJECT_PATH/zstd_6_65536 \\\n    --key-column=inverted_filepath \\\n    --compression=zstd \\\n    --compression-level=6 \\\n    --block-size=65536 \\\n    --run-test=single-get \\\n    --nt=0 \\\n    --sampling-rate-zipf=1.5 \\\n    --sampling-rate=1.0 \\\n    --probability=0.0 \n```\n\n**Power Law Distributed Keys:**\n\nTo test retrieval with power law-distributed keys, simply substitute `--run-test=single-get` with **`--run-test=single-get-zipf`** in the command above.\n\n\u003e **Additional Test Modes:** You may also try **`--run-test=multi-get`** and **`--run-test=multi-get-zipf`** for multi-key retrieval tests.\n\n### 4\\. Profiling Energy Consumption\n\nTo profile the energy package consumption, ensure the **Perf suite** is installed on your system and prepend the execution command with `perf stat`.\n\nFor each test, prepend `perf stat -a -e power/energy-pkg/` to estimate the package-level consumption.\n\nTo conclude the `README.md` for this project, it is standard practice to include sections for citing the work, acknowledging contributors or funding, and specifying the licence.\n\nGiven the academic nature of the paper, here is a professional way to structure the end of your file:\n\n### Citation\n\nIf you use this software or the data from our experiments in your research, please cite our paper:\n\n```bibtex\n@inproceedings{ferragina2026energy,\n  author    = {Ferragina, Paolo and Tosoni, Francesco},\n  title     = {The Energy-Throughput Trade-off in Lossless-Compressed Source Code Storage},\n  booktitle = {2026 IEEE International Conference on Software Analysis, Evolution and Reengineering - Companion (SANER-C)},\n  year      = {2026},\n  pages     = {157--164},\n  doi       = {10.1109/SANER-C67878.2026.00027},\n  publisher = {IEEE}\n}\n```\n\n### Acknowledgements\n\nThis work was supported by the L'EMbeDS Department at the Sant'Anna School of Advanced Studies, Pisa, Italy. All the computations presented in this paper were performed\nusing the GRICAD infrastructure ([https://gricad.univ-grenoble-alpes.fr](https://gricad.univ-grenoble-alpes.fr)), which is supported by Grenoble research communities. We thank SOS Gricad and the Software Heritage team for valuable insights, suggestions, and continuous support for our work.\n\n### Licence\n\nThis project is licensed under the Apache 2.0 Licence - see the [LICENSE.md](LICENSE.md) file for details.\n\n-----\n\n*For any questions or further information regarding the experiments or the compressed key-value store design, please contact the authors at the Sant'Anna School of Advanced Studies.*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fftosoni%2Fgreen-compressed-storage","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fftosoni%2Fgreen-compressed-storage","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fftosoni%2Fgreen-compressed-storage/lists"}