https://github.com/johansolbakken/benchmark
A repository for benchmarking Optimistic Order-Preserving Hash Join (OOHJ) in MySQL using the Join Order Benchmark (JOB). Part of a Master’s thesis project, it includes scripts for setting up the database, running SQL queries, and evaluating performance.
https://github.com/johansolbakken/benchmark
benchmarking join-order-benchmark mysql query-optimization ruby sql
Last synced: about 1 month ago
JSON representation
A repository for benchmarking Optimistic Order-Preserving Hash Join (OOHJ) in MySQL using the Join Order Benchmark (JOB). Part of a Master’s thesis project, it includes scripts for setting up the database, running SQL queries, and evaluating performance.
- Host: GitHub
- URL: https://github.com/johansolbakken/benchmark
- Owner: johansolbakken
- License: mit
- Created: 2025-01-27T10:19:20.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-11T11:48:37.000Z (12 months ago)
- Last Synced: 2025-07-03T22:03:29.106Z (11 months ago)
- Topics: benchmarking, join-order-benchmark, mysql, query-optimization, ruby, sql
- Language: Ruby
- Homepage:
- Size: 130 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.org
- License: LICENSE
Awesome Lists containing this project
README
#+title: Benchmarking for Optimistic Order-Preserving Hash Join (OOHJ)
#+BEGIN_QUOTE
[!WARNING]
This software is /not/ intended for production use and is intended for our own private research purposes. Although others may use the code, please be aware that it is prone to change without notice.
#+END_QUOTE
This repository is part of our Master's thesis, focused on benchmarking modifications made to MySQL. Our fork of MySQL, which implements the Optimistic Order-Preserving Hash Join (OOHJ) feature, is available at: https://github.com/johansolbakken/mysql-server/tree/oohj/oohj-iterator.
Link to Master Thesis will be added later.
* Requirements
#+begin_src
brew install ruby
gem install pq_query
gem install terminal-table
brew install hyperfine
#+END_SRC
* JOB from a fresh start
If MySQL has lost JOB data then do the following.
First, make sure MySQL is running. (Tips use =mtr --start-dirty= or =make run-dirty= to avoid resetting MySQL).
#+begin_src
make job-dataset # Download job dataset
make job-order-queries # Convert job queries to ORDER BY
# Ensure MySQL is running
make job-setup # Create databases
make inline-local # Allow inserting data from CSV
make job-feed # Feed data to database
make prepare # Enable hypergraph optimizer and disable stats_auto_recalc and set size-values for join-buffer and innodb-pool
make job-analyze # Analyze job tables
#+end_src
* Testing the Optimistic Hash Join
1. Start the MySQL server
2. Fill out config.yaml
#+begin_src shell
make test
#+end_src
This will compare .sql files in homemade_dataset folder against .expected files in homemade_dataset folder. It will also do other types tests defined in tests.yaml file.
These tests are passing under our current assumptions. As we improve optimistic order-preserving hash join these tests will fail and we need to take new snapshots of the new expected state.
* Commands
** Download JOB dataset
#+begin_src shell
make job-dataset
#+end_src
** Start MySQL
1. Download and build the MySQL source code.
2. Start MySQL using MySQL Test Run (MTR), MySQL's testing script:
First-time setup:
#+begin_src shell
cd build # Navigate to the MySQL build folder
./mysql-test/mtr --start
#+end_src
For subsequent runs, use:
#+begin_src shell
./mysql-test/mtr --start-dirty
#+end_src
3. Configure the path to the MySQL binary in `config.yaml` (e.g., `build/bin/mysql`).
** config.yaml
Copy the `config.yaml.example` file to `config.yaml` and update it with the path to your MySQL binary.
It is important that /path/ points to a mysql client executable.
** Setup the Database
Then, initialize the database by creating tables and indexes:
#+begin_src shell
make job-setup
#+end_src
** Load the Data
Feed the downloaded dataset into the database:
#+begin_src shell
make job-feed
#+end_src
** Prepare MySQL environment
This command sets the environment to ensure the environment is the same for every test.
#+begin_src shell
ruby bin/benchmark.rb --prepare-mysql
#+end_src
** Run SQL Queries
To run SQL queries, use the following commands:
- Execute a query:
#+begin_src shell
ruby bin/benchmark.rb --run ./job/1a.sql
#+end_src
- Execute a query with EXPLAIN ANALYZE to analyze execution:
#+begin_src shell
ruby bin/benchmark.rb --run ./job/1a.sql --analyze
#+end_src
- Execute a query with EXPLAIN FORMAT=TREE to analyze plan:
#+begin_src shell
ruby bin/benchmark.rb --run ./job/1a.sql --tree
#+end_src
* tests.yaml
The YAML configuration is structured under a top-level =tests= key that divides tests into two categories: *diff_test* and *contain_test*. Each category may include a global =setup= section to prepare the environment before running tests, followed by a list of test cases under the =tests= key. In *diff_test*, each test is defined with a =name=, an SQL file specified by the =sql= key, and an =expected= file for output comparison; tests can also have individual /setup/ commands. In *contain_test*, tests may include individual =setup= commands and verify outputs by checking for specific substrings using a =contains= list. To add a new test, choose the appropriate category based on whether you want a full output comparison or substring validation. Then, include any necessary setup commands and define the test with a unique =name=, the path to the SQL file, and either an =expected= file (for *diff_test*) or a =contains= list (for *contain_test*). Note that tests run /sequentially/, so the environment setup for one test may affect subsequent tests.
#+begin_src yaml
tests:
diff_test:
setup:
- "ruby ./bin/generate_sort_hashjoin_dataset.rb 10000 10000"
- "ruby ./bin/benchmark.rb --prepare-mysql"
tests:
- name: "Basic test"
sql: "./homemade_dataset/homemade.sql"
expected: "./homemade_dataset/homemade.expected"
- name: "Disable optimistic hash join"
sql: "./homemade_dataset/homemade_disabled.sql"
expected: "./homemade_dataset/homemade_disabled.expected"
contain_test:
# Global setup is optional here.
tests:
- name: "went_on_disk=false, n=100 m=100"
setup:
- "ruby ./bin/generate_sort_hashjoin_dataset.rb 100 100"
- "ruby ./bin/benchmark.rb --prepare-mysql"
sql: "./homemade_dataset/went_on_disk.sql"
contains:
- "(optimistic hash join!)"
- "(went_on_disk=false)"
#+end_src
* C++ Debugging Tools
** Header-only Logging File
The =debug/logger.h= is a class that can be used to fast log to a file.
Usage:
#+begin_src c++
#include "/absolute_path_to_benchmark/debug/logger.h"
static Logger* s_logger = nullptr;
ClassToTest::ClassToTest() {
s_logger = new Logger("~/path_to_output/log.txt");
}
void ClassToTest::functionToTest() {
// Lets write CSV information to the logger.
auto& logger = *s_logger;
while (someCondition) {
logger << logger.timestamp() << "," this->getSomeValue() << ",";
logger << this->getState() << "\n";
}
}
#+end_src
This class will delete the log-file on construction.
There is a =timestamp()= function for getting timestamps easily.
Currently using streams.
* Generate TPC-H for MacOS
#+begin_src shell
podman run --rm -it \
-v $(pwd):/src \
-w /src \
ubuntu:22.04 \
bash
# now in podman ubuntu
sudo apt update && sudo apt install -y gcc make ruby bison flex
ruby bin/build-tpc-h.rb
#+end_src
This will generate folders:
- =tpc-h-queries=
- =tpc-h-ddl=
- =tpc-h-dataset=
* Generate TPC-DS for MacOS
#+begin_src shell
# copy the Makefile.suite and add -fcommon to CFLAGS
CFLAGS = $(BASE_CFLAGS) -D$(OS) $($(OS)_CFLAGS) -fcommon
# Start podman
podman run --rm -it \
-v $(pwd):/src \
-w /src \
ubuntu:22.04 \
bash
# now in podman ubuntu
sudo apt update && sudo apt install -y gcc make ruby bison flex
ruby bin/build-tpc-ds.rb
#+end_src
* Join Order Benchmark Commands
** Setup database and indexes in MySQL
Requires MySQL to be running.
#+begin_src shell
make job-setup
#+end_src
To wipe database and recreate:
#+begin_src shell
ruby bin/job-setup.rb --force
#+end_src
** Download job dataset
Creates job-dataset folder.
#+begin_src shell
make job-dataset
#+end_src
The job-dataset folder contains all the data as csv files.
Do this before feeding.
** Feed job data
Feed data in job-dataset to MySQL database imdbload.
#+begin_src shell
make job-feed
#+end_src
** Convert queries: remove MIN(...)
The job files we were provided is altered such that each column is in a MIN aggregate.
We therefore have created scripts for removing MIN and additionally adding ORDER BY clauses.
To generate the queries without MIN or ORDER BY:
#+begin_src shell
make job-queries
#+end_src
To make ordered queries:
#+begin_src shell
make job-order-queries
#+end_src
** Run JOB queries
#+begin_src shell
make run-file DATABASE=imdbload FILE=./job-queries/10a.sql
#+end_src
** Delete JOB artifacts
#+begin_src shell
make job-clean
#+end_src
** Warmup MySQL
Essentially means loading all data into memory.
#+begin_src shell
make job-warmup
#+end_src
* Analyze
To analyze run the script:
#+begin_src shell
ruby bin/analyze.rb --job
#+end_src
* Check if any query fails for a database
#+begin_src shell
ruby ./bin/test-sql-files.rb --folder ./job-queries --database imdbload
#+end_src
* Run any file
#+begin_src shell
make run-file DATABASE=imdbload FILE=./job-queries/10b.sql
#+end_src
* Homemade Dataset
** Setup database and indexes in MySQL
Requires MySQL to be running.
#+begin_src shell
make homemade-setup
#+end_src
To wipe database and recreate:
#+begin_src shell
ruby bin/homemade-setup.rb --force
#+end_src
** Generate dataset
For instance with table a size and table b size set to 10000. Default for both is 10000.
#+begin_src shell
make homemade-dataset TABLE_A_SIZE=10000 TABLE_B_SIZE=10000
#+end_src
** Feed homemade data
#+begin_src shell
make homemade-feed
#+end_src
** Analyze tables
#+begin_src shell
make run-file DATABASE=homemade FILE=./sql/analyze_homemade.sql
#+end_src
** Count Optimistic Hash Join
`bin/count-ohj.rb`, counts occurrences of "optimistic hash join" in SQL execution plans. It works by:
#+begin_src sh
ruby bin/count-ohj.rb [--join-buffer-size SIZE]
#+end_src
Example (setting join buffer size to 16MB):
#+begin_src sh
ruby bin/count-ohj.rb --join-buffer-size 16777216
#+end_src
** Export query plan as DOT
Generates a graphical representation of a query plan from an input SQL file.
*Run with an SQL file:*
#+begin_src shell
ruby bin/benchmark.rb ./job/1a.sql
#+end_src
*Display the JSON output:*
#+begin_src shell
ruby bin/benchmark.rb --show-json ./job/1a.sql
#+end_src
*Specify a custom output PNG file:*
#+begin_src shell
ruby bin/benchmark.rb -o custom_plan.png ./job/1a.sql
#+end_src
*Keep the DOT file:*
#+begin_src shell
ruby bin/benchmark.rb -c ./job/1a.sql
#+end_src
*With hints*
#+begin_src shell
ruby bin/print-plan-as-graphwiz.rb ./job-order-queries/1a.sql -o./1a.png --hint "/*+ SET_OPTIMISM_FUNC(SIGMOID) */"
#+end_src
** Enable and disable indexes for JOB
#+begin_src bash
make job-index-enable
make job-index-disable
#+end_src
* TPC-H setup
- There must exist an folder in root of this repo called =tpc-h-tool= with =dbgen= and others.
#+begin_src shell
# Ensure you have data
ls tpc-h-tool/dbgen/*
# Generate tpc-h data
cargo run -p tpc-h-datagen
make experimental-setup # activate hypergraph optimizer
make tpc-h-setup # will setup both s1 and s10
make tpc-h-feed # will feed both
make tpc-h-analyze # this will analyze both s1 and s10
ruby bin/tpc-h-warmup.rb s1
rbuy bin/tpc-h-warmup.rb s10
#+end_src