https://github.com/qasimkhan5x/tpc-di
Implementation of TPC-DI Benchmark V1.1.0 using Python for ETL into MySQL
https://github.com/qasimkhan5x/tpc-di
data-integration etl mysql pandas tpc-di
Last synced: 3 months ago
JSON representation
Implementation of TPC-DI Benchmark V1.1.0 using Python for ETL into MySQL
- Host: GitHub
- URL: https://github.com/qasimkhan5x/tpc-di
- Owner: QasimKhan5x
- Created: 2023-12-08T14:30:48.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-30T10:53:26.000Z (over 1 year ago)
- Last Synced: 2025-01-21T04:07:53.684Z (4 months ago)
- Topics: data-integration, etl, mysql, pandas, tpc-di
- Language: Jupyter Notebook
- Homepage:
- Size: 167 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# TPC-DI
This repository contains the implementation of TPC-DI Benchmark V1.1.0 using Python for ETL into MySQL.
## Usage
Ensure you have installed the latest version of MySQL Community Server. Then, create a new environment, preferably with `conda`, and install the requisite packages:
conda create -n tpcdi python=3.10 -y
conda activate tpcdi
pip install -r requirements.txtFirst, create a MySQL database for the required scale factor database. It should be called `tpcdi_sf5` or `tpcdi_sf3` or similar.
The entire implementation can be found `etl.ipynb`. The notebook provides an interactive execution of the benchmark phases. Otherwise, to run the benchmark in a command line, run the following:
python -m scripts.historical
python -m scripts.incrementalRunning the scripts allows measuring the elapsed time for each phase.
To prepare the `Audit` table for the automated audit phase, run the following:
python -m scripts.audit
Run the automated audit by running `validation/tpcdi_audit.sql`. If all tests pass, then the benchmark has been implemented correctly.
## Future Work
- [ ] Prevent deadlocks in running visibility queries during incremental updates
- [ ] Ensure all tests pass in audit