{"id":28450334,"url":"https://github.com/multiqc/example-timeline","last_synced_at":"2025-06-30T16:32:29.826Z","repository":{"id":287003181,"uuid":"963249986","full_name":"MultiQC/example-timeline","owner":"MultiQC","description":"Example of usig MultiQC parquet intermediate files to plot timelines across many runs","archived":false,"fork":false,"pushed_at":"2025-05-30T12:44:14.000Z","size":290,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-05-30T17:41:55.423Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MultiQC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-09T11:45:55.000Z","updated_at":"2025-05-30T12:44:17.000Z","dependencies_parsed_at":"2025-06-03T05:01:33.095Z","dependency_job_id":null,"html_url":"https://github.com/MultiQC/example-timeline","commit_stats":null,"previous_names":["multiqc/example-timeline"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MultiQC/example-timeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MultiQC%2Fexample-timeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MultiQC%2Fexample-timeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MultiQC%2Fexample-timeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MultiQC%2Fexample-timeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MultiQC","download_url":"https://codeload.github.com/MultiQC/example-timeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MultiQC%2Fexample-timeline/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262810734,"owners_count":23367972,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-06T15:07:10.439Z","updated_at":"2025-06-30T16:32:29.813Z","avatar_url":"https://github.com/MultiQC.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Apache Iceberg + Trino Performance Evaluation\n\nThis project sets up a development environment for evaluating Apache Iceberg with Trino compared to traditional Parquet for storing and querying MultiQC data directly from your S3 bucket.\n\n## System Architecture\n\nThe setup includes:\n\n- Direct connection to your AWS S3 bucket (s3://megaqc-test/)\n- Trino as the SQL query engine\n- Apache Iceberg for table format\n- Hive Metastore for schema registry\n- PostgreSQL for the metastore backend\n- Jupyter Notebook for running queries and evaluations\n\n## Getting Started\n\n### Prerequisites\n\n- Docker and Docker Compose\n- Git\n- AWS credentials with access to the s3://megaqc-test/ bucket\n\n### Installation\n\n1. Clone this repository:\n   ```\n   git clone \u003crepository-url\u003e\n   cd example-timeline\n   ```\n\n4. Start the Docker containers:\n   ```\n   docker compose up -d\n   ```\n\n5. Wait for all services to start. You can check the status with:\n   ```\n   docker compose ps\n   ```\n\n### Accessing Services\n\n- Jupyter Notebook: http://localhost:8888\n- Trino UI: http://localhost:8080\n\n## Running the Evaluation\n\n1. Open Jupyter Notebook at http://localhost:8888\n2. Navigate to `notebooks/iceberg_evaluation.ipynb`\n3. Run all cells to execute the performance comparison\n\n## Performance Evaluation\n\nThe notebook compares:\n\n1. **Storage Time**: How long it takes to write data to Parquet vs. Iceberg\n2. **Query Performance**: \n   - Filtering by metric name across all runs\n   - Filtering by run_id and module_id\n\n## Project Structure\n\n- `docker-compose.yml` - Container configuration\n- `trino/etc/` - Trino configuration files\n- `notebooks/` - Jupyter notebooks for evaluation\n- `exploring/` - Original Parquet test notebooks\n- `.env` - Environment variables for AWS credentials\n\n## Customization\n\n- Adjust the dataset size in the notebook by changing `NUM_RUNS`, `NUM_MODULES`, etc.\n- Modify query patterns to test different access patterns\n\n## Potential Benefits of Iceberg\n\n- Schema evolution capabilities\n- Time travel (querying data as of a specific point in time)\n- Better handling of small file problems\n- Transactional consistency\n- Partition evolution\n\n## Troubleshooting\n\n- If Trino fails to start, check the logs with `docker-compose logs trino-coordinator`\n- If connections to AWS S3 fail, verify your credentials in the `.env` file\n- If the Hive Metastore isn't accessible, check PostgreSQL is running correctly\n\n## Additional Resources\n\n- [Apache Iceberg Documentation](https://iceberg.apache.org/)\n- [Trino Documentation](https://trino.io/docs/current/)\n- [AWS S3 Documentation](https://docs.aws.amazon.com/s3/) ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmultiqc%2Fexample-timeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmultiqc%2Fexample-timeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmultiqc%2Fexample-timeline/lists"}