https://github.com/johansolbakken/benchmark

A repository for benchmarking Optimistic Order-Preserving Hash Join (OOHJ) in MySQL using the Join Order Benchmark (JOB). Part of a Master’s thesis project, it includes scripts for setting up the database, running SQL queries, and evaluating performance.
https://github.com/johansolbakken/benchmark

benchmarking join-order-benchmark mysql query-optimization ruby sql

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/johansolbakken/benchmark
Owner: johansolbakken
License: mit
Created: 2025-01-27T10:19:20.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-06-11T11:48:37.000Z (about 1 year ago)
Last Synced: 2025-07-03T22:03:29.106Z (about 1 year ago)
Topics: benchmarking, join-order-benchmark, mysql, query-optimization, ruby, sql
Language: Ruby
Homepage:
Size: 130 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.org
- License: LICENSE

Awesome Lists containing this project

README

#+title: Benchmarking for Optimistic Order-Preserving Hash Join (OOHJ)

#+BEGIN_QUOTE
[!WARNING]
This software is /not/ intended for production use and is intended for our own private research purposes. Although others may use the code, please be aware that it is prone to change without notice.
#+END_QUOTE

This repository is part of our Master's thesis, focused on benchmarking modifications made to MySQL. Our fork of MySQL, which implements the Optimistic Order-Preserving Hash Join (OOHJ) feature, is available at: https://github.com/johansolbakken/mysql-server/tree/oohj/oohj-iterator.

Link to Master Thesis will be added later.

* Requirements

#+begin_src
brew install ruby
gem install pq_query
gem install terminal-table
brew install hyperfine
#+END_SRC

* JOB from a fresh start

If MySQL has lost JOB data then do the following.

First, make sure MySQL is running. (Tips use =mtr --start-dirty= or =make run-dirty= to avoid resetting MySQL).

#+begin_src
make job-dataset # Download job dataset
make job-order-queries # Convert job queries to ORDER BY

# Ensure MySQL is running
make job-setup # Create databases
make inline-local # Allow inserting data from CSV
make job-feed # Feed data to database
make prepare # Enable hypergraph optimizer and disable stats_auto_recalc and set size-values for join-buffer and innodb-pool
make job-analyze # Analyze job tables
#+end_src

* Testing the Optimistic Hash Join

1. Start the MySQL server
2. Fill out config.yaml

#+begin_src shell
make test
#+end_src

This will compare .sql files in homemade_dataset folder against .expected files in homemade_dataset folder. It will also do other types tests defined in tests.yaml file.

These tests are passing under our current assumptions. As we improve optimistic order-preserving hash join these tests will fail and we need to take new snapshots of the new expected state.

* Commands

** Download JOB dataset

#+begin_src shell
make job-dataset
#+end_src

** Start MySQL

1. Download and build the MySQL source code.
2. Start MySQL using MySQL Test Run (MTR), MySQL's testing script:

First-time setup:

#+begin_src shell
cd build # Navigate to the MySQL build folder
./mysql-test/mtr --start
#+end_src

For subsequent runs, use:

#+begin_src shell
./mysql-test/mtr --start-dirty
#+end_src

3. Configure the path to the MySQL binary in `config.yaml` (e.g., `build/bin/mysql`).

** config.yaml

Copy the `config.yaml.example` file to `config.yaml` and update it with the path to your MySQL binary.

It is important that /path/ points to a mysql client executable.

** Setup the Database

Then, initialize the database by creating tables and indexes:

#+begin_src shell
make job-setup
#+end_src

** Load the Data

Feed the downloaded dataset into the database:

#+begin_src shell
make job-feed
#+end_src

** Prepare MySQL environment

This command sets the environment to ensure the environment is the same for every test.

#+begin_src shell
ruby bin/benchmark.rb --prepare-mysql
#+end_src

** Run SQL Queries

To run SQL queries, use the following commands:

- Execute a query:
#+begin_src shell
ruby bin/benchmark.rb --run ./job/1a.sql
#+end_src

- Execute a query with EXPLAIN ANALYZE to analyze execution:
#+begin_src shell
ruby bin/benchmark.rb --run ./job/1a.sql --analyze
#+end_src

- Execute a query with EXPLAIN FORMAT=TREE to analyze plan:
#+begin_src shell
ruby bin/benchmark.rb --run ./job/1a.sql --tree
#+end_src

* tests.yaml

The YAML configuration is structured under a top-level =tests= key that divides tests into two categories: *diff_test* and *contain_test*. Each category may include a global =setup= section to prepare the environment before running tests, followed by a list of test cases under the =tests= key. In *diff_test*, each test is defined with a =name=, an SQL file specified by the =sql= key, and an =expected= file for output comparison; tests can also have individual /setup/ commands. In *contain_test*, tests may include individual =setup= commands and verify outputs by checking for specific substrings using a =contains= list. To add a new test, choose the appropriate category based on whether you want a full output comparison or substring validation. Then, include any necessary setup commands and define the test with a unique =name=, the path to the SQL file, and either an =expected= file (for *diff_test*) or a =contains= list (for *contain_test*). Note that tests run /sequentially/, so the environment setup for one test may affect subsequent tests.

#+begin_src yaml
tests:
diff_test:
setup:
- "ruby ./bin/generate_sort_hashjoin_dataset.rb 10000 10000"
- "ruby ./bin/benchmark.rb --prepare-mysql"
tests:
- name: "Basic test"
sql: "./homemade_dataset/homemade.sql"
expected: "./homemade_dataset/homemade.expected"
- name: "Disable optimistic hash join"
sql: "./homemade_dataset/homemade_disabled.sql"
expected: "./homemade_dataset/homemade_disabled.expected"

contain_test:
# Global setup is optional here.
tests:
- name: "went_on_disk=false, n=100 m=100"
setup:
- "ruby ./bin/generate_sort_hashjoin_dataset.rb 100 100"
- "ruby ./bin/benchmark.rb --prepare-mysql"
sql: "./homemade_dataset/went_on_disk.sql"
contains:
- "(optimistic hash join!)"
- "(went_on_disk=false)"
#+end_src

* C++ Debugging Tools

** Header-only Logging File

The =debug/logger.h= is a class that can be used to fast log to a file.

Usage:

#+begin_src c++
#include "/absolute_path_to_benchmark/debug/logger.h"

static Logger* s_logger = nullptr;

ClassToTest::ClassToTest() {
s_logger = new Logger("~/path_to_output/log.txt");
}

void ClassToTest::functionToTest() {
// Lets write CSV information to the logger.
auto& logger = *s_logger;

while (someCondition) {
logger << logger.timestamp() << "," this->getSomeValue() << ",";
logger << this->getState() << "\n";
}
}

#+end_src

This class will delete the log-file on construction.

There is a =timestamp()= function for getting timestamps easily.

Currently using streams.

* Generate TPC-H for MacOS

#+begin_src shell
podman run --rm -it \
-v $(pwd):/src \
-w /src \
ubuntu:22.04 \
bash
# now in podman ubuntu
sudo apt update && sudo apt install -y gcc make ruby bison flex
ruby bin/build-tpc-h.rb
#+end_src

This will generate folders:
- =tpc-h-queries=
- =tpc-h-ddl=
- =tpc-h-dataset=

* Generate TPC-DS for MacOS

#+begin_src shell
# copy the Makefile.suite and add -fcommon to CFLAGS
CFLAGS = $(BASE_CFLAGS) -D$(OS) $($(OS)_CFLAGS) -fcommon

# Start podman
podman run --rm -it \
-v $(pwd):/src \
-w /src \
ubuntu:22.04 \
bash

# now in podman ubuntu
sudo apt update && sudo apt install -y gcc make ruby bison flex
ruby bin/build-tpc-ds.rb
#+end_src

* Join Order Benchmark Commands

** Setup database and indexes in MySQL

Requires MySQL to be running.

#+begin_src shell
make job-setup
#+end_src

To wipe database and recreate:

#+begin_src shell
ruby bin/job-setup.rb --force
#+end_src

** Download job dataset

Creates job-dataset folder.

#+begin_src shell
make job-dataset
#+end_src

The job-dataset folder contains all the data as csv files.

Do this before feeding.

** Feed job data

Feed data in job-dataset to MySQL database imdbload.

#+begin_src shell
make job-feed
#+end_src

** Convert queries: remove MIN(...)

The job files we were provided is altered such that each column is in a MIN aggregate.

We therefore have created scripts for removing MIN and additionally adding ORDER BY clauses.

To generate the queries without MIN or ORDER BY:

#+begin_src shell
make job-queries
#+end_src

To make ordered queries:

#+begin_src shell
make job-order-queries
#+end_src

** Run JOB queries

#+begin_src shell
make run-file DATABASE=imdbload FILE=./job-queries/10a.sql
#+end_src

** Delete JOB artifacts

#+begin_src shell
make job-clean
#+end_src

** Warmup MySQL

Essentially means loading all data into memory.

#+begin_src shell
make job-warmup
#+end_src

* Analyze

To analyze run the script:

#+begin_src shell
ruby bin/analyze.rb --job
#+end_src

* Check if any query fails for a database
#+begin_src shell
ruby ./bin/test-sql-files.rb --folder ./job-queries --database imdbload
#+end_src

* Run any file

#+begin_src shell
make run-file DATABASE=imdbload FILE=./job-queries/10b.sql
#+end_src

* Homemade Dataset

** Setup database and indexes in MySQL

Requires MySQL to be running.

#+begin_src shell
make homemade-setup
#+end_src

To wipe database and recreate:

#+begin_src shell
ruby bin/homemade-setup.rb --force
#+end_src

** Generate dataset

For instance with table a size and table b size set to 10000. Default for both is 10000.

#+begin_src shell
make homemade-dataset TABLE_A_SIZE=10000 TABLE_B_SIZE=10000
#+end_src

** Feed homemade data

#+begin_src shell
make homemade-feed
#+end_src

** Analyze tables

#+begin_src shell
make run-file DATABASE=homemade FILE=./sql/analyze_homemade.sql
#+end_src

** Count Optimistic Hash Join

`bin/count-ohj.rb`, counts occurrences of "optimistic hash join" in SQL execution plans. It works by:

#+begin_src sh
ruby bin/count-ohj.rb [--join-buffer-size SIZE]
#+end_src

Example (setting join buffer size to 16MB):

#+begin_src sh
ruby bin/count-ohj.rb --join-buffer-size 16777216
#+end_src

** Export query plan as DOT

Generates a graphical representation of a query plan from an input SQL file.

*Run with an SQL file:*
#+begin_src shell
ruby bin/benchmark.rb ./job/1a.sql
#+end_src

*Display the JSON output:*
#+begin_src shell
ruby bin/benchmark.rb --show-json ./job/1a.sql
#+end_src

*Specify a custom output PNG file:*
#+begin_src shell
ruby bin/benchmark.rb -o custom_plan.png ./job/1a.sql
#+end_src

*Keep the DOT file:*
#+begin_src shell
ruby bin/benchmark.rb -c ./job/1a.sql
#+end_src

*With hints*
#+begin_src shell
ruby bin/print-plan-as-graphwiz.rb ./job-order-queries/1a.sql -o./1a.png --hint "/*+ SET_OPTIMISM_FUNC(SIGMOID) */"
#+end_src

** Enable and disable indexes for JOB

#+begin_src bash
make job-index-enable
make job-index-disable
#+end_src

* TPC-H setup

- There must exist an folder in root of this repo called =tpc-h-tool= with =dbgen= and others.

#+begin_src shell
# Ensure you have data
ls tpc-h-tool/dbgen/*

# Generate tpc-h data
cargo run -p tpc-h-datagen

make experimental-setup # activate hypergraph optimizer
make tpc-h-setup # will setup both s1 and s10
make tpc-h-feed # will feed both
make tpc-h-analyze # this will analyze both s1 and s10
ruby bin/tpc-h-warmup.rb s1
rbuy bin/tpc-h-warmup.rb s10
#+end_src

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/johansolbakken/benchmark

Awesome Lists containing this project

README