{"id":18400664,"url":"https://github.com/databricks/spark-sql-perf","last_synced_at":"2025-10-24T20:15:00.658Z","repository":{"id":30466345,"uuid":"34020308","full_name":"databricks/spark-sql-perf","owner":"databricks","description":null,"archived":false,"fork":false,"pushed_at":"2022-02-26T18:13:56.000Z","size":577,"stargazers_count":611,"open_issues_count":55,"forks_count":411,"subscribers_count":360,"default_branch":"master","last_synced_at":"2025-10-21T03:43:14.196Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databricks.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-04-15T21:45:41.000Z","updated_at":"2025-10-17T08:52:14.000Z","dependencies_parsed_at":"2022-08-07T15:15:52.884Z","dependency_job_id":null,"html_url":"https://github.com/databricks/spark-sql-perf","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/databricks/spark-sql-perf","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-sql-perf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-sql-perf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-sql-perf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-sql-perf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databricks","download_url":"https://codeload.github.com/databricks/spark-sql-perf/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-sql-perf/sbom","scorecard":{"id":324242,"data":{"date":"2024-06-17","repo":{"name":"github.com/databricks/spark-sql-perf","commit":"28d88190f6a5f6698d32eaf4092334c41180b806"},"scorecard":{"version":"v5.0.0-rc2-62-gda0f2b4e","commit":"da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8"},"score":4,"checks":[{"name":"Code-Review","score":7,"reason":"Found 21/30 approved changesets -- score normalized to 7","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#maintained"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#cii-best-practices"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#signed-releases"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#dangerous-workflow"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#token-permissions"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#branch-protection"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#binary-artifacts"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#fuzzing"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#pinned-dependencies"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#vulnerabilities"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#security-policy"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 29 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/da0f2b4ebca563a6f7f1ca2f4099c336f7ce8bd8/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-18T02:05:53.231Z","repository_id":30466345,"created_at":"2025-08-18T02:05:53.231Z","updated_at":"2025-08-18T02:05:53.231Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280858142,"owners_count":26403262,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-24T02:00:06.418Z","response_time":73,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T02:35:53.666Z","updated_at":"2025-10-24T20:15:00.613Z","avatar_url":"https://github.com/databricks.png","language":"Scala","readme":"# Spark SQL Performance Tests\n\n[![Build Status](https://travis-ci.org/databricks/spark-sql-perf.svg)](https://travis-ci.org/databricks/spark-sql-perf)\n\nThis is a performance testing framework for [Spark SQL](https://spark.apache.org/sql/) in [Apache Spark](https://spark.apache.org/) 2.2+.\n\n**Note: This README is still under development. Please also check our source code for more information.**\n\n# Quick Start\n\n## Running from command line.\n\n```\n$ bin/run --help\n\nspark-sql-perf 0.2.0\nUsage: spark-sql-perf [options]\n\n  -b \u003cvalue\u003e | --benchmark \u003cvalue\u003e\n        the name of the benchmark to run\n  -m \u003cvalue\u003e | --master \u003cvalue\n        the master url to use\n  -f \u003cvalue\u003e | --filter \u003cvalue\u003e\n        a filter on the name of the queries to run\n  -i \u003cvalue\u003e | --iterations \u003cvalue\u003e\n        the number of iterations to run\n  --help\n        prints this usage text\n        \n$ bin/run --benchmark DatasetPerformance\n```\n\nThe first run of `bin/run` will build the library.\n\n## Build\n\nUse `sbt package` or `sbt assembly` to build the library jar.  \nUse `sbt +package` to build for scala 2.11 and 2.12.\n\n## Local performance tests\nThe framework contains twelve benchmarks that can be executed in local mode. They are organized into three classes and target different components and functions of Spark:\n* [DatasetPerformance](https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/DatasetPerformance.scala) compares the performance of the old RDD API with the new Dataframe and Dataset APIs.\nThese benchmarks can be launched with the command `bin/run --benchmark DatasetPerformance`\n* [JoinPerformance](https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/JoinPerformance.scala) compares the performance of joining different table sizes and shapes with different join types.\nThese benchmarks can be launched with the command `bin/run --benchmark JoinPerformance`\n* [AggregationPerformance](https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/AggregationPerformance.scala) compares the performance of aggregating different table sizes using different aggregation types.\nThese benchmarks can be launched with the command `bin/run --benchmark AggregationPerformance`\n\n\n# MLlib tests\n\nTo run MLlib tests, run `/bin/run-ml yamlfile`, where `yamlfile` is the path to a YAML configuration\nfile describing tests to run and their parameters.\n\n# TPC-DS\n\n## Setup a benchmark\n\nBefore running any query, a dataset needs to be setup by creating a `Benchmark` object. Generating\nthe TPCDS data requires dsdgen built and available on the machines. We have a fork of dsdgen that\nyou will need. The fork includes changes to generate TPCDS data to stdout, so that this library can\npipe them directly to Spark, without intermediate files. Therefore, this library will not work with\nthe vanilla TPCDS kit.\n\nTPCDS kit needs to be installed on all cluster executor nodes under the same path!\n\nIt can be found [here](https://github.com/databricks/tpcds-kit).  \n\n```\n// Generate the data\nbuild/sbt \"test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d \u003cdsdgenDir\u003e -s \u003cscaleFactor\u003e -l \u003clocation\u003e -f \u003cformat\u003e\"\n```\n\n```\n// Create the specified database\nsql(s\"create database $databaseName\")\n// Create metastore tables in a specified database for your data.\n// Once tables are created, the current database will be switched to the specified database.\ntables.createExternalTables(rootDir, \"parquet\", databaseName, overwrite = true, discoverPartitions = true)\n// Or, if you want to create temporary tables\n// tables.createTemporaryTables(location, format)\n\n// For CBO only, gather statistics on all columns:\ntables.analyzeTables(databaseName, analyzeColumns = true) \n```\n\n## Run benchmarking queries\nAfter setup, users can use `runExperiment` function to run benchmarking queries and record query execution time. Taking TPC-DS as an example, you can start an experiment by using\n\n```\nimport com.databricks.spark.sql.perf.tpcds.TPCDS\n\nval tpcds = new TPCDS (sqlContext = sqlContext)\n// Set:\nval databaseName = ... // name of database with TPCDS data.\nval resultLocation = ... // place to write results\nval iterations = 1 // how many iterations of queries to run.\nval queries = tpcds.tpcds2_4Queries // queries to run.\nval timeout = 24*60*60 // timeout, in seconds.\n// Run:\nsql(s\"use $databaseName\")\nval experiment = tpcds.runExperiment(\n  queries, \n  iterations = iterations,\n  resultLocation = resultLocation,\n  forkThread = true)\nexperiment.waitForFinish(timeout)\n```\n\nBy default, experiment will be started in a background thread.\nFor every experiment run (i.e. every call of `runExperiment`), Spark SQL Perf will use the timestamp of the start time to identify this experiment. Performance results will be stored in the sub-dir named by the timestamp in the given `spark.sql.perf.results` (for example `/tmp/results/timestamp=1429213883272`). The performance results are stored in the JSON format.\n\n## Retrieve results\nWhile the experiment is running you can use `experiment.html` to get a summary, or `experiment.getCurrentResults` to get complete current results.\nOnce the experiment is complete, you can still access `experiment.getCurrentResults`, or you can load the results from disk.\n\n```\n// Get all experiments results.\nval resultTable = spark.read.json(resultLocation)\nresultTable.createOrReplaceTempView(\"sqlPerformance\")\nsqlContext.table(\"sqlPerformance\")\n// Get the result of a particular run by specifying the timestamp of that run.\nsqlContext.table(\"sqlPerformance\").filter(\"timestamp = 1429132621024\")\n// or\nval specificResultTable = spark.read.json(experiment.resultPath)\n```\n\nYou can get a basic summary by running:\n```\nexperiment.getCurrentResults // or: spark.read.json(resultLocation).filter(\"timestamp = 1429132621024\")\n  .withColumn(\"Name\", substring(col(\"name\"), 2, 100))\n  .withColumn(\"Runtime\", (col(\"parsingTime\") + col(\"analysisTime\") + col(\"optimizationTime\") + col(\"planningTime\") + col(\"executionTime\")) / 1000.0)\n  .select('Name, 'Runtime)\n```\n\n# TPC-H\n\nTPC-H can be run similarly to TPC-DS replacing `tpcds` for `tpch`.\nTake a look at the data generator and `tpch_run` notebook code below.\n\n## Running in Databricks workspace (or spark-shell)\n\nThere are example notebooks in `src/main/notebooks` for running TPCDS and TPCH in the Databricks environment.\n_These scripts can also be run from spark-shell command line with minor modifications using `:load file_name.scala`._\n\n### TPC-multi_datagen notebook\nThis notebook (or scala script) can be use to generate both TPCDS and TPCH data at selected scale factors.\nIt is a newer version from the `tpcds_datagen` notebook below.  To use it:\n* Edit the config variables the top of the script.\n* Run the whole notebook.\n\n### tpcds_datagen notebook\n\nThis notebook can be used to install dsdgen on all worker nodes, run data generation, and create the TPCDS database.\nNote that because of the way dsdgen is installed, it will not work on an autoscaling cluster, and `num_workers` has\nto be updated to the number of worker instances on the cluster.\nData generation may also break if any of the workers is killed - the restarted worker container will not have `dsdgen` anymore.\n \n### tpcds_run notebook\n\nThis notebook can be used to run TPCDS queries.\n\nFor running parallel TPCDS streams:\n* Create a Cluster and attach the spark-sql-perf library to it.\n* Create a Job using the notebook and attaching to the created cluster as \"existing cluster\".\n* Allow concurrent runs of the created job.\n* Launch appriopriate number of Runs of the Job to run in parallel on the cluster.\n\n### tpch_run notebook\n\nThis notebook can be used to run TPCH queries.  Data needs be generated first.","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Fspark-sql-perf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabricks%2Fspark-sql-perf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Fspark-sql-perf/lists"}