https://github.com/rapidsai/gpu-bdb
RAPIDS GPU-BDB
https://github.com/rapidsai/gpu-bdb
Last synced: 3 months ago
JSON representation
RAPIDS GPU-BDB
- Host: GitHub
- URL: https://github.com/rapidsai/gpu-bdb
- Owner: rapidsai
- License: apache-2.0
- Created: 2020-03-23T20:02:33.000Z (almost 6 years ago)
- Default Branch: main
- Last Pushed: 2024-03-05T23:02:18.000Z (almost 2 years ago)
- Last Synced: 2025-06-19T01:06:51.789Z (7 months ago)
- Language: Python
- Homepage:
- Size: 831 KB
- Stars: 108
- Watchers: 7
- Forks: 43
- Open Issues: 36
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# RAPIDS GPU-BDB
## Disclaimer
gpu-bdb is derived from [TPCx-BB](http://www.tpc.org/tpcx-bb/). Any results based on gpu-bdb are considered unofficial results, and per [TPC](http://www.tpc.org/) policy, cannot be compared against official TPCx-BB results.
## Overview
The GPU Big Data Benchmark (gpu-bdb) is a [RAPIDS](https://rapids.ai) library based benchmark for enterprises that includes 30 queries representing real-world ETL & ML workflows at various "scale factors": SF1000 is 1 TB of data, SF10000 is 10TB. Each “query” is in fact a model workflow that can include SQL, user-defined functions, careful sub-setting and aggregation, and machine learning.
## Conda Environment Setup
We provide a conda environment definition specifying all RAPIDS dependencies needed to run our query implementations. To install and activate it:
```bash
CONDA_ENV="rapids-gpu-bdb"
conda env create --name $CONDA_ENV -f gpu-bdb/conda/rapids-gpu-bdb.yml
conda activate rapids-gpu-bdb
```
### Installing RAPIDS bdb Tools
This repository includes a small local module containing utility functions for running the queries. You can install it with the following:
```bash
cd gpu-bdb/gpu_bdb
python -m pip install .
```
This will install a package named `bdb-tools` into your Conda environment. It should look like this:
```bash
conda list | grep bdb
bdb-tools 0.2 pypi_0 pypi
```
Note that this Conda environment needs to be replicated or installed manually on all nodes, which will allow starting one dask-cuda-worker per node.
## NLP Query Setup
Queries 10, 18, and 19 depend on two static (negativeSentiment.txt, positiveSentiment.txt) files. As we cannot redistribute those files, you should [download the tpcx-bb toolkit](http://www.tpc.org/tpc_documents_current_versions/download_programs/tools-download-request5.asp?bm_type=TPCX-BB&bm_vers=1.3.1&mode=CURRENT-ONLY) and extract them to your data directory on your shared filesystem:
```
jar xf bigbenchqueriesmr.jar
cp tpcx-bb1.3.1/distributions/Resources/io/bigdatabenchmark/v1/queries/q10/*.txt ${DATA_DIR}/sentiment_files/
```
For Query 27, we rely on [spacy](https://spacy.io/). To download the necessary language model after activating the Conda environment:
```bash
python -m spacy download en_core_web_sm
````
## Starting Your Cluster
We use the `dask-scheduler` and `dask-cuda-worker` command line interfaces to start a Dask cluster. We provide a `cluster_configuration` directory with a bash script to help you set up an NVLink-enabled cluster using UCX.
Before running the script, you'll make changes specific to your environment.
In `cluster_configuration/cluster-startup.sh`:
- Update `GPU_BDB_HOME=...` to location on disk of this repo
- Update `CONDA_ENV_PATH=...` to refer to your conda environment path.
- Update `CONDA_ENV_NAME=...` to refer to the name of the conda environment you created, perhaps using the `yml` files provided in this repository.
- Update `INTERFACE=...` to refer to the relevant network interface present on your cluster.
- Update `CLUSTER_MODE="TCP"` to refer to your communication method, either "TCP" or "NVLINK". You can also configure this as an environment variable.
- You may also need to change the `LOCAL_DIRECTORY` and `WORKER_DIR` depending on your filesystem. Make sure that these point to a location to which you have write access and that `LOCAL_DIRECTORY` is accessible from all nodes.
To start up the cluster on your scheduler node, please run the following from `gpu_bdb/cluster_configuration/`. This will spin up a scheduler and one Dask worker per GPU.
```bash
DASK_JIT_UNSPILL=True CLUSTER_MODE=NVLINK bash cluster-startup.sh SCHEDULER
```
Note: Don't use DASK_JIT_UNSPILL when running BlazingSQL queries.
Then run the following on every other node from `gpu_bdb/cluster_configuration/`.
```bash
bash cluster-startup.sh
```
This will spin up one Dask worker per GPU. If you are running on a single node, you will only need to run `bash cluster-startup.sh SCHEDULER`.
If you are using a Slurm cluster, please adapt the example Slurm setup in `gpu_bdb/benchmark_runner/slurm/` which uses `gpu_bdb/cluster_configuration/cluster-startup-slurm.sh`.
## Running the Queries
To run a query, starting from the repository root, go to the query specific subdirectory. For example, to run q07:
```bash
cd gpu_bdb/queries/q07/
```
The queries assume that they can attach to a running Dask cluster. Cluster address and other benchmark configuration lives in a yaml file (`gpu_bdb/benchmark_runner/becnhmark_config.yaml`). You will need to fill this out as appropriate if you are not using the Slurm cluster configuration.
```bash
conda activate rapids-gpu-bdb
python gpu_bdb_query_07.py --config_file=../../benchmark_runner/benchmark_config.yaml
```
To NSYS profile a gpu-bdb query, change `start_local_cluster` in benchmark_config.yaml to `True` and run:
```bash
nsys profile -t cuda,nvtx python gpu_bdb_query_07_dask_sql.py --config_file=../../benchmark_runner/benchmark_config.yaml
```
Note: There is no need to start workers with `cluster-startup.sh` as
there is a `LocalCUDACluster` being started in `attach_to_cluster` API.
## Performance Tracking
This repository includes optional performance-tracking automation using Google Sheets. To enable logging query runtimes, on the client node:
```
export GOOGLE_SHEETS_CREDENTIALS_PATH=
```
Then configure the `--sheet` and `--tab` arguments in `benchmark_config.yaml`.
### Running all of the Queries
The included `benchmark_runner.py` script will run all queries sequentially. Configuration for this type of end-to-end run is specified in `benchmark_runner/benchmark_config.yaml`.
To run all queries, cd to `gpu_bdb/` and:
```python
python benchmark_runner.py --config_file benchmark_runner/benchmark_config.yaml
```
By default, this will run each Dask query five times, and, if BlazingSQL queries are enabled in `benchmark_config.yaml`, each BlazingSQL query five times. You can control the number of repeats by changing the `N_REPEATS` variable in the script.
## BlazingSQL
BlazingSQL implementations of all queries are included. BlazingSQL currently supports communication via TCP. To run BlazingSQL queries, please follow the instructions above to create a cluster using `CLUSTER_MODE=TCP`.
## Data Generation
The RAPIDS queries expect [Apache Parquet](http://parquet.apache.org/) formatted data. We provide a [script](gpu_bdb/queries/load_test/gpu_bdb_load_test.py) which can be used to convert bigBench dataGen's raw CSV files to optimally sized Parquet partitions.