https://github.com/multiqc/example-timeline

Example of usig MultiQC parquet intermediate files to plot timelines across many runs
https://github.com/multiqc/example-timeline

Last synced: 12 months ago
JSON representation

Example of usig MultiQC parquet intermediate files to plot timelines across many runs

Host: GitHub
URL: https://github.com/multiqc/example-timeline
Owner: MultiQC
Created: 2025-04-09T11:45:55.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-30T12:44:14.000Z (about 1 year ago)
Last Synced: 2025-05-30T17:41:55.423Z (about 1 year ago)
Language: Jupyter Notebook
Size: 283 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Apache Iceberg + Trino Performance Evaluation

This project sets up a development environment for evaluating Apache Iceberg with Trino compared to traditional Parquet for storing and querying MultiQC data directly from your S3 bucket.

## System Architecture

The setup includes:

- Direct connection to your AWS S3 bucket (s3://megaqc-test/)
- Trino as the SQL query engine
- Apache Iceberg for table format
- Hive Metastore for schema registry
- PostgreSQL for the metastore backend
- Jupyter Notebook for running queries and evaluations

## Getting Started

### Prerequisites

- Docker and Docker Compose
- Git
- AWS credentials with access to the s3://megaqc-test/ bucket

### Installation

1. Clone this repository:
```
git clone
cd example-timeline
```

4. Start the Docker containers:
```
docker compose up -d
```

5. Wait for all services to start. You can check the status with:
```
docker compose ps
```

### Accessing Services

- Jupyter Notebook: http://localhost:8888
- Trino UI: http://localhost:8080

## Running the Evaluation

1. Open Jupyter Notebook at http://localhost:8888
2. Navigate to `notebooks/iceberg_evaluation.ipynb`
3. Run all cells to execute the performance comparison

## Performance Evaluation

The notebook compares:

1. **Storage Time**: How long it takes to write data to Parquet vs. Iceberg
2. **Query Performance**:
- Filtering by metric name across all runs
- Filtering by run_id and module_id

## Project Structure

- `docker-compose.yml` - Container configuration
- `trino/etc/` - Trino configuration files
- `notebooks/` - Jupyter notebooks for evaluation
- `exploring/` - Original Parquet test notebooks
- `.env` - Environment variables for AWS credentials

## Customization

- Adjust the dataset size in the notebook by changing `NUM_RUNS`, `NUM_MODULES`, etc.
- Modify query patterns to test different access patterns

## Potential Benefits of Iceberg

- Schema evolution capabilities
- Time travel (querying data as of a specific point in time)
- Better handling of small file problems
- Transactional consistency
- Partition evolution

## Troubleshooting

- If Trino fails to start, check the logs with `docker-compose logs trino-coordinator`
- If connections to AWS S3 fail, verify your credentials in the `.env` file
- If the Hive Metastore isn't accessible, check PostgreSQL is running correctly

## Additional Resources

- [Apache Iceberg Documentation](https://iceberg.apache.org/)
- [Trino Documentation](https://trino.io/docs/current/)
- [AWS S3 Documentation](https://docs.aws.amazon.com/s3/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/multiqc/example-timeline

Awesome Lists containing this project

README