https://github.com/ftosoni/green-compressed-storage
An energy-optimised, high-density RocksDB solution for massive source code archives, leveraging Pareto-optimal compression to maximise throughput and energy efficiency.
https://github.com/ftosoni/green-compressed-storage
green-software key-value-stores large-data-management lossless-compression source-code-archival
Last synced: 6 days ago
JSON representation
An energy-optimised, high-density RocksDB solution for massive source code archives, leveraging Pareto-optimal compression to maximise throughput and energy efficiency.
- Host: GitHub
- URL: https://github.com/ftosoni/green-compressed-storage
- Owner: ftosoni
- License: apache-2.0
- Created: 2025-11-07T09:35:06.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2026-03-17T20:22:54.000Z (3 months ago)
- Last Synced: 2026-03-18T09:48:45.572Z (3 months ago)
- Topics: green-software, key-value-stores, large-data-management, lossless-compression, source-code-archival
- Language: C++
- Homepage:
- Size: 30.4 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Green Compressed Storage
An energy-optimised key-value store for massive source code archives, delivering Pareto-optimal compression with order-of-magnitude gains in throughput and energy efficiency.
## 📝 Short Description
**Green Compressed Storage** is an innovative, energy-aware key-value store designed to handle massive source code datasets. Built on **RocksDB**, it specialises in optimising the trade-off between space, time, and energy consumption. It achieves superior data density and high-speed retrieval by utilising finely-tuned **zstd** configurations, making it ideal for large-scale archival and analysis of code.
## 📋 Prerequisites
The following dependencies are required to compile and run the project.
1. Compilation Tools and Perf
Install the C++ compiler (Clang, G++), CMake, and the perf performance analysis tool suite.
### Compilation Tools
```bash
sudo apt install clang g++ gcc cmake
```
### Energy Profiling Tools (Perf)
```bash
sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`
```
### Arrow
The project utilises Apache Arrow for handling Parquet data. The Arrow C++ libraries must be installed for compilation. If you have the official Arrow repository configured, use the following commands:
```bash
sudo apt update
sudo apt install -y libarrow-dev libparquet-dev
```
### Python Dependencies
The supporting scripts require Python and several libraries.
```bash
sudo apt install python3 python3-pip
pip3 install pandas pyarrow
```
## 🛠️ Build Instructions
The project uses **CMake** for building. Ensure you have `cmake` and a C++ compiler installed.
1. **Create a build directory and run CMake:**
```bash
git submodule update --init --recursive
mkdir build
cd build
cmake ..
```
2. **Compile the project:**
```bash
make
```
The main executable, `green-compressed-storage`, will be located in a directory like `./cmake-build-release/` (depending on your build configuration).
---
## 🏃 Example Run
This section demonstrates how to build the database, generate test keys, and execute single-get retrieval tests.
The `sample_data` directory contains dummy code files (extracted from the [MediaWiki repository](https://www.mediawiki.org/wiki/Download)) in a Parquet format for testing purposes. The file has two columns: `inverted_filepath` (key) and `content` (source code). In this example, we will be using the first as key and the second as value in our database.
### 1. Generate Query Keys
First, you must **generate the key sample** for the uniform and power-law (Zipfian-like) distributions, as described in the accompanying paper. The sample keys will be written to a new Parquet file in the `sample_data` directory.
The command below samples 100 keys for uniform single-gets and 100 for power-law single-gets (named `single-get-zipf` in the code).
```bash
cd scripts
python3 -u generate_query_data_shuffle.py ../sample_data/mediawiki10k.parquet inverted_filepath 0.0 42
cd ..
````
> **Note**: The parameter `0.0` ensures all selected keys are for retrieval, and `42` is the random seed for key selection.
### 2\. Build the Database (Insert Operation)
Execute the primary application to insert the data into the key-value store. This example uses Zstandard with level 6 and a block size of 64 KiB.
```bash
# Define the path to your project root (replace with the actual path)
PROJECT_PATH=$(pwd)
EXECUTABLE_PATH="./build/green-compressed-storage"
$EXECUTABLE_PATH \
--parquetfile=$PROJECT_PATH/sample_data/mediawiki10k \
--db-path=$PROJECT_PATH/zstd_6_65536 \
--key-column=inverted_filepath \
--compression=zstd \
--compression-level=6 \
--block-size=65536 \
--run-test=insert \
--sampling-rate-zipf=1.5 \
--sampling-rate=1.0 \
--probability=0.0
```
### 3\. Run Retrieval Tests (Single-Gets)
Execute single-get retrieval tests using the generated key sample.
**Uniformly Distributed Keys:**
```bash
$EXECUTABLE_PATH \
--parquetfile=$PROJECT_PATH/sample_data/mediawiki10k-s42 \
--db-path=$PROJECT_PATH/zstd_6_65536 \
--key-column=inverted_filepath \
--compression=zstd \
--compression-level=6 \
--block-size=65536 \
--run-test=single-get \
--nt=0 \
--sampling-rate-zipf=1.5 \
--sampling-rate=1.0 \
--probability=0.0
```
**Power Law Distributed Keys:**
To test retrieval with power law-distributed keys, simply substitute `--run-test=single-get` with **`--run-test=single-get-zipf`** in the command above.
> **Additional Test Modes:** You may also try **`--run-test=multi-get`** and **`--run-test=multi-get-zipf`** for multi-key retrieval tests.
### 4\. Profiling Energy Consumption
To profile the energy package consumption, ensure the **Perf suite** is installed on your system and prepend the execution command with `perf stat`.
For each test, prepend `perf stat -a -e power/energy-pkg/` to estimate the package-level consumption.
To conclude the `README.md` for this project, it is standard practice to include sections for citing the work, acknowledging contributors or funding, and specifying the licence.
Given the academic nature of the paper, here is a professional way to structure the end of your file:
### Citation
If you use this software or the data from our experiments in your research, please cite our paper:
```bibtex
@inproceedings{ferragina2026energy,
author = {Ferragina, Paolo and Tosoni, Francesco},
title = {The Energy-Throughput Trade-off in Lossless-Compressed Source Code Storage},
booktitle = {2026 IEEE International Conference on Software Analysis, Evolution and Reengineering - Companion (SANER-C)},
year = {2026},
pages = {157--164},
doi = {10.1109/SANER-C67878.2026.00027},
publisher = {IEEE}
}
```
### Acknowledgements
This work was supported by the L'EMbeDS Department at the Sant'Anna School of Advanced Studies, Pisa, Italy. All the computations presented in this paper were performed
using the GRICAD infrastructure ([https://gricad.univ-grenoble-alpes.fr](https://gricad.univ-grenoble-alpes.fr)), which is supported by Grenoble research communities. We thank SOS Gricad and the Software Heritage team for valuable insights, suggestions, and continuous support for our work.
### Licence
This project is licensed under the Apache 2.0 Licence - see the [LICENSE.md](LICENSE.md) file for details.
-----
*For any questions or further information regarding the experiments or the compressed key-value store design, please contact the authors at the Sant'Anna School of Advanced Studies.*