{"id":30192298,"url":"https://github.com/qbeast-io/qbeast-spark","last_synced_at":"2025-08-12T23:01:52.662Z","repository":{"id":38312779,"uuid":"409655014","full_name":"Qbeast-io/qbeast-spark","owner":"Qbeast-io","description":"Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling.  Big Data, free from the unnecessary!","archived":false,"fork":false,"pushed_at":"2025-01-24T14:23:14.000Z","size":39136,"stargazers_count":228,"open_issues_count":37,"forks_count":24,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-06-07T23:53:00.864Z","etag":null,"topics":["big-data","data-lakehouse","datasource","sampling","scala","spark","spark-sql"],"latest_commit_sha":null,"homepage":"https://qbeast.io/qbeast-our-tech/","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Qbeast-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-09-23T15:54:12.000Z","updated_at":"2025-05-29T21:34:03.000Z","dependencies_parsed_at":"2024-01-22T12:36:25.435Z","dependency_job_id":"28b7fc7a-45d4-4633-8738-16687363a1a9","html_url":"https://github.com/Qbeast-io/qbeast-spark","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/Qbeast-io/qbeast-spark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qbeast-io%2Fqbeast-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qbeast-io%2Fqbeast-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qbeast-io%2Fqbeast-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qbeast-io%2Fqbeast-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Qbeast-io","download_url":"https://codeload.github.com/Qbeast-io/qbeast-spark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qbeast-io%2Fqbeast-spark/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270149337,"owners_count":24535727,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-12T02:00:09.011Z","response_time":80,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","data-lakehouse","datasource","sampling","scala","spark","spark-sql"],"created_at":"2025-08-12T23:01:02.273Z","updated_at":"2025-08-12T23:01:52.575Z","avatar_url":"https://github.com/Qbeast-io.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\t\u003cimg src=\"https://raw.githubusercontent.com/Qbeast-io/qbeast-spark/main/docs/images/Qbeast-spark.png\" alt=\"Qbeast spark project\"/\u003e\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n\n[![Users Documentation](https://img.shields.io/badge/-Users_Docs-lightgreen?style=for-the-badge\u0026logo=readthedocs)](./docs)\n[![Developers Documentation](https://img.shields.io/badge/_-Developer's_docs_(docs.qbeast.io)-ff7?style=for-the-badge\u0026logo=readthedocs)](https://docs.qbeast.io/)\n\u003cbr /\u003e\n[![API](https://img.shields.io/badge/-Check_the_API-orange?style=for-the-badge)](./docs/QbeastTable.md)\n[![Notebook](https://img.shields.io/badge/_-Jupyter_Notebook_example-0053B3?style=for-the-badge\u0026logo=jupyter)](./docs/sample_pushdown_demo.ipynb)\n\u003cbr /\u003e\n[![Slack](https://img.shields.io/badge/_-Slack-blue?style=for-the-badge\u0026logo=slack)](https://join.slack.com/t/qbeast-users/shared_invite/zt-w0zy8qrm-tJ2di1kZpXhjDq_hAl1LHw)\n[![Academy](https://img.shields.io/badge/_-Medium-yellowgreen?style=for-the-badge\u0026logo=medium)](https://qbeast.io/academy-courses-index/)\n[![Website](https://img.shields.io/badge/_-Website-dc005f?style=for-the-badge\u0026logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAE0AAABNCAMAAADU1xmCAAAABGdBTUEAALGPC/xhBQAAACBjSFJNAAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAAC6FBMVEVHcEyCvU6qx1Ckx1CatGSNwk+qxlDmlouPuJHMwE/bvn7evU/geImbyVDxrDn2ulzjQYTkQYXjQYTjQYTjQYTjQYTjQYTjQYTjQYTjQYTjQYTjQYTeR4ihx1D3tlb3sUf4qkH3slbysETmQofjQYTjQYT3qjv2rk35u2LvwWr5t1r41JjjQYTjQYT5yJL6yIH6zIn83rL705j94rzjQYTjQYTKqpj337LjQYT3qFHkRYf847/lRYf95sX5slWd4dxZxrlCtp58vp2Yv5Kkp5G+up0Wt6k4uqlSvKfQ272o0YWt4NTTwVTa0IfjQYT72aP61KX1g7P6tdH0ibfzcaj6p8jzXZ36ncP7rs37utRNSX6mdqY3NnaVjJU8OndBQHlaVoHAtaVUUYBhXoRZVoJpZYVzbouHgJCNxE+eyFCqx1CzxFC6w1DEwVDMwE/Wvk/evVDpu1DjQYT4r034s1D4t1n5vWf5w3X6xn76yoX7z5H705n71Z7616P826z837T84bn848D958j+6dBfv6pzwKeBxKaPxqebyamkyqerzKezzKe7z6lKvKnd59ygzYWDvk/97daj1cPJ3813uE/+8uF1z8OBu5BusGtUpExor07+8Nv+9elfq4U3lU6sx1XB04Td163w37Ts8++ZymLl27HE0qnP1KvT1Kv++fH++vn3xc3drtH////5q273nk72ttPUnMn3pE74u9XJfLjRrc77za/75/LDdbb4qlD7wNj2xd29crXY1dHy8ub2lUzyr8+1bLLzrqn3qpL3moH2jGT1f0zzeEj0hEr1iEr1jUyvZrDzcU3yZ0fuqc3spMronsf31+jDgbPil8TekcGnX63Pg7rajcDVib6gXKzdn7CXVamJTKWhdbDJvs6sop3QqZShmJmmnZo+PHhhXYS0bLOknq+XkJack5exp55taYqRiJO3rKCBe4+JgZFKSHxSUH+yqcB8do2/s6NzbYp4cotaV4JoZIVjODowAAAAaHRSTlMAZfugL62AAyHGFtEH4HxyJl+Nrcnb4dK8onJIEVzEl2RFL1Tq/v7dpYZUOJj0IOi8jHZfHGoM5vn0M8M8qexNquLlz6du7f3538KNrumB27p+3I1wqlyWtOBK07r+mOAu39af/PbmWr6y1IwAAAABYktHRK0gYsIdAAAACXBIWXMAABYlAAAWJQFJUiTwAAAAB3RJTUUH5gMVDhs12bWt8QAAAAFvck5UAc+id5oAAAcCSURBVFjDrdh5XBRVHADwVx5jOMpeLHslyuxGwC6iDgtuhJWInWtadtCFWSndrrcdCmgiECW4ooblkRarA4gCIZCkqOWVZiomooZRaQoe6b+992Zn9piZBXN/f/GB3S/v+L3fOwC4xbhjambmnb16gyBFn8zMN996uy8RHK0f1t65KyQ4XC+svds/OFxvVnuvPxkUbgCrvT8wOF11a9OC0te+nBb6P74skyuUqjC1Wh2u0er0BgDu5rRBgs9GDBw8JFKyySSlNTp8wqSS8do0wTzcA2P69BlRg+6NDo2J9WVlZotDEGrCo8X5a5FRWJs5c9asWbNnz5k7ND5+2PARsXEEoLUmh0moJXjGTagBa4SPNnfevA8+/OjjjxPjDOG8kBSu0ig1KjXEjSTow2tWgUaAwSNFtPmJIYYwJBm1lIxblKQtgQLWTE6LEh3ryCgRbcF9JB2WpE0WVhFei5CYuggRLet+QBtQ460h/dwRYiWBdQCvDZFKhZgUoZY9DFKjHngQxkMoRo8enZo6hq1ISOsPKJu4RsQNEmo5I0alpaWNfXjhokWfLM7NXZKXX/DIo1N5LZIIs0hxZLRAy855LG3s2E8XYm0x1AoKP/uc16IBBedIJsGBmKECbWmRR1tSXJi3bBmvPU4AFUpmg9TgxcULNOdyqJUszi3Kyy1ZsXLVKl6DGIkTUiNZ08lof+2L0uWwbau/RLHSS3siOhYAWo04nXTBGDLUT3OWLkQ9XY0wTvtqzdp1iYiToXVskRg6QNAgJN5fW8+OW0nJilVI+3rNmrUbNqyDHFylFO6rhGYwJgBiuJ8GOW5O8/MLNn7z7VqsrUsMhctUizi9uGZ3OOSAiB3nq5UV8lrBRo+2YFws+v9QCxfFrPBPJjmc2yd9NVexmJaVNRIurQTJxslRbdXCgjt+03wfbXORUHMy5RVRJCDRvCrFNJSNDhmwySdUblrgo20p8tOWMoyzoqIqlG1cEi3EaHaCCOqprZWVrhxvbVt1rreW7WIYpgZqKSQwoCyRCzU8BHIQR02AWi3j9Na+25zLaxvKGBTlUKsbzE6rSFeVqM1WkPz0dqjVM0xZjpfW0Mhp32OLKd2BNLgJ6lEGC5eXkc1EaiLSmtA3nF5aQyPWChl31GBteggwoK3IJjpsdjijE7BWi75S5qX9UJyXX+xi2+XMySpntRh28gQDp2czx/rMdqzV4++5Sp281tBYs3NXc/PuPXv3/PjTvh2sBvcGM/yeQnQSZMA6ntWa6mtxQ8p4bX/DgYMwDuz11uDA6cSmwY40AhiedWuHfj586AgEXby2/5cDAi2F7ZTKX1OgzROuvIke7fDRo79W1rp47dhxgTYDABv8Ypi/hrqvhtpzvtqJlpYtLZx28reDB/f6agSQsc3oqXZ8G6+d2unfNgIng0Wqp4G1U7sEmk1Ms7O/9B83f+1Us/+4JbOdEskQg8+cimmtJ5v95pQSK5i4xts8+SbVttbTuzmtDuebXWxvsLElhERrYesR6ba1tlbvZrWaeXgtoCJiFmymSfi3BPX89q0ul6R2uq3tzNl9UKvIZsrxOlWLbqoqNgvTX4DFjWlqqqwUaseq286c+/0M5PaVO5kyXENQujmExzyFe6G2n29DmuuIn/ZHdVvHn3+dQ9rfZ3fBNZeD6xsu5cJbkp4tSeDF9vYLhfUuptZH29LWcfHiS5z2TyMqcVV17oIkskGTFnyAB+ntMDrWMy6P1lLdcenyZW8NcXOqUth1JXoWwRs3hRvX3tl1hdNOVF+5evWSv3atkKmrcm8LJpE9Cyc1ykPUuM7OzgvHkbb5ytXr1/8V0a7loP2UhplgEj+JhLMpB15GWGfXjdOwVTdvXpDQrqG9PsBBBC8HI1xdryCuq+vGjfPnA2hj3N1RSRyS8G6vBSCjswfaJHjcDwtwRmL/FeprevfapBB3P5WSZ0tUMh3o3J7enTapn7vsWGhJzYqPsujcnt4VUEMYZWIvhdJhS8KXUMhlvBpAS4XnSiopcD+5QxxsHewsOfk1KW0Mqq7s5dUc+B5vZ6+kqAMZk0W1VNhL0sxdXnWBOffnlDT2XvfXkAVsnquwiQrMKdiPWXSozpDpk9/waKlT0H3ZYPa+oktd3bjQ4Q+bHGqdFV3AgCFjCooM9hGBViT5XvWNdGBOz71bWLR60utiB2G5Rvh0EG4IzNEazxOBRiFPltEG2qZPMIeLvEKgItLdcxylFn7J60et3edP5u5efEid0SERJqWNy6Qe5gl+99GIdUytkHlnUs/yBI+7Qa706bFFY+c3O0LpuIU84W+GyVSCHYZOrqd9RpvU3EqedD8UKp88ud2HPa+3JodJd7ua+2qPLGNyEJ4dZe400tAgGGHDL36KIL1Ng2SLw+JOtv8Aax/72g6rujwAAAAldEVYdGRhdGU6Y3JlYXRlADIwMjItMDMtMjFUMTQ6Mjc6NTArMDA6MDBUKA+xAAAAJXRFWHRkYXRlOm1vZGlmeQAyMDIyLTAzLTIxVDE0OjI3OjUwKzAwOjAwJXW3DQAAAABJRU5ErkJggg==)](https://qbeast.io)\n\n---\n\n**Qbeast Spark** is an Apache Spark extension that enhances data processing in [**Data Lakehouses**](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). It provides advanced **multi-dimensional filtering** and **efficient data sampling**, enabling faster and more accurate queries. The extension also maintains ACID properties for data integrity and reliability, making it ideal for handling large-scale data efficiently.\n\n[![apache-spark](https://img.shields.io/badge/apache--spark-3.5.x-blue)](https://spark.apache.org/releases/spark-release-3-5-0.html) \n[![apache-hadoop](https://img.shields.io/badge/apache--hadoop-3.3.x-blue)](https://hadoop.apache.org/release/3.3.1.html)\n[![delta-core](https://img.shields.io/badge/delta--core-3.1.0-blue)](https://github.com/delta-io/delta/releases/tag/v2.4.0)\n[![codecov](https://codecov.io/gh/Qbeast-io/qbeast-spark/branch/main/graph/badge.svg?token=8WO7HGZ4MW)](https://codecov.io/gh/Qbeast-io/qbeast-spark)\n\n\u003c/div\u003e\n\n## Features\n\n1. **Data Lakehouse** - Data lake with **ACID** properties, thanks to the underlying [Delta Lake](https://delta.io/) architecture\n\n\n2. **Multi-column indexing**:  **Filter** your data with **multiple columns** using the Qbeast Format.\n   \n\n3. **Improved Sampling operator** - **Read** statistically significant **subsets** of files.\n   \n\n4. **Table Tolerance** - Model for sampling fraction and **query accuracy** trade-off. \n\n\n## Query example with Qbeast\n\n| ![Demo for Delta format GIF](docs/images/spark_delta_demo.gif) | ![Demo for Qbeast format GIF](docs/images/spark_qbeast_demo.gif) |\n|:---------------------------------------------------------------:|:---------------------------------------------------------------:|\n\nAs you can see above, the Qbeast Spark extension allows **faster** queries with statistically **accurate** sampling.\n\n| Format | Execution Time |   Result  |\n|--------|:--------------:|:---------:|\n| Delta  |  ~ 151.3 sec.  | 37.869383 |\n| Qbeast |   ~ 6.6 sec.   | 37.856333 |\n\nIn this example, **1% sampling** provides the result **x22 times faster** compared to using Delta format, with an **error of 0,034%**.\n\n## Documentation\nExplore the documentation for more details:\n- [Quickstart for Qbeast-Spark](./docs/Quickstart.md)\n- [Data Lakehouse with Qbeast Format](./docs/QbeastFormat.md)\n- [OTree Algorithm](./docs/OTreeAlgorithm.md)\n- [QbeastTable](./docs/QbeastTable.md)\n- [Columns To Index Selector](./docs/ColumnsToIndexSelector.md)\n- [Recommendations for different Cloud Storage systems](./docs/CloudStorages.md)\n- [Advanced configurations](./docs/AdvancedConfiguration.md)\n- [Qbeast Metadata](./docs/QbeastFormat.md)\n- [FAQ: Frequently Asked Questions](./docs/FAQ.md)\n\n# Quickstart\nYou can run the qbeast-spark application locally on your computer, or using a Docker image we already prepared with the dependencies.\nYou can find it in the [Packages section](https://github.com/orgs/Qbeast-io/packages?repo_name=qbeast-spark).\n\n### Pre: Install **Spark**\nDownload **Spark 3.5.0 with Hadoop 3.3.4**, unzip it, and create the `SPARK_HOME` environment variable:\u003cbr /\u003e\n\n\u003e:information_source: **Note**: You can use Hadoop 2.7 if desired, but you could have some troubles with different cloud providers' storage, read more about it [here](docs/CloudStorages.md).\n\n```bash\nwget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz\n\ntar -xzvf spark-3.5.0-bin-hadoop3.tgz\n\nexport SPARK_HOME=$PWD/spark-3.5.0-bin-hadoop3\n ```\n### 1. Launch a spark-shell\n\n**Inside the project folder**, launch a **spark shell** with the required dependencies:\n\n```bash\n$SPARK_HOME/bin/spark-shell \\\n--packages io.qbeast:qbeast-spark_2.12:0.7.0,io.delta:delta-spark_2.12:3.1.0 \\\n--conf spark.sql.extensions=io.qbeast.spark.internal.QbeastSparkSessionExtension \\\n--conf spark.sql.catalog.spark_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog\n```\n\n### 2. Indexing a dataset\n\n**Read** the **CSV** source file placed inside the project.\n\n```scala\nval csvDF = spark.read.format(\"csv\").\n  option(\"header\", \"true\").\n  option(\"inferSchema\", \"true\").\n  load(\"./src/test/resources/ecommerce100K_2019_Oct.csv\")\n```\n\nIndexing the dataset by writing it into the **qbeast** format, specifying the columns to index.\n\n```scala\nval tmpDir = \"/tmp/qbeast-spark\"\n\ncsvDF.write.\n  mode(\"overwrite\").\n  format(\"qbeast\").\n  option(\"columnsToIndex\", \"user_id,product_id\").\n  save(tmpDir)\n```\n\n#### SQL Syntax.\nYou can create a table with Qbeast with the help of `QbeastCatalog`.\n\n```scala\nspark.sql(\n  \"CREATE TABLE student (id INT, name STRING, age INT) \" +\n    \"USING qbeast OPTIONS ('columnsToIndex'='id')\")\n\n```\n\nUse **`INSERT INTO`** to add records to the new table. It will update the index in a **dynamic** fashion when new data is inserted.\n\n```scala\nval studentsDF = Seq((1, \"Alice\", 34), (2, \"Bob\", 36)).toDF(\"id\", \"name\", \"age\")\n\nstudentsDF.write.mode(\"overwrite\").saveAsTable(\"visitor_students\")\n\n// AS SELECT FROM\nspark.sql(\"INSERT INTO table student SELECT * FROM visitor_students\")\n\n// VALUES\nspark.sql(\"INSERT INTO table student VALUES (3, 'Charlie', 37)\")\n\n// SHOW\nspark.sql(\"SELECT * FROM student\").show()\n+---+-------+---+\n| id|   name|age|\n+---+-------+---+\n|  1|  Alice| 34| \n|  2|    Bob| 36|\n|  3|Charlie| 37|\n+---+-------+---+\n```\n\n###  3. Load the dataset\nLoad the newly indexed dataset.\n```scala\nval qbeastDF =\n  spark.\n    read.\n    format(\"qbeast\").\n    load(tmpDir)\n```\n\n### 4. Examine the Query plan for sampling\n**Sampling the data**, notice how the sampler is converted into filters and pushed down to the source!\n\n```scala\nqbeastDF.sample(0.1).explain(true)\n```\nGo to the [Quickstart](./docs/Quickstart.md) or [notebook](docs/sample_pushdown_demo.ipynb) for more details.\n\n### 5. Interact with the format\n\nGet **insights** to the data using the `QbeastTable` interface!\n\n```scala\nimport io.qbeast.spark.QbeastTable\n\nval qbeastTable = QbeastTable.forPath(spark, tmpDir) \n\nqbeastTable.getIndexMetrics()\n\n```\n\n### 6. Optimize the table\n\n**Optimize** is an expensive operation that consist on **rewriting part of the files** to accomplish **better layout** and **improving query performance**.\n\nTo minimize write amplification of this command, **we execute it based on subsets of the table**, like `Revision ID's` or specific files.\n\n\u003e Read more about `Revision` and find an example [here](./docs/QbeastFormat.md).\n\n#### Optimize API\nThese are the 3 ways of executing the `optimize` operation:\n\n```scala\nqbeastTable.optimize() // Optimizes the last Revision Available.\n// This does NOT include previous Revision's optimizations.\n\nqbeastTable.optimize(2L) // Optimizes the Revision number 2.\n\nqbeastTable.optimize(Seq(\"file1\", \"file2\")) // Optimizes the specific files\n```\n\n**If you want to optimize the full table, you must loop through `revisions`**:\n\n```scala\nval revisions = qbeastTable.revisionsIDs() // Get all the Revision ID's available in the table.\nrevisions.foreach(revision =\u003e \n  qbeastTable.optimize(revision)\n)\n```\n\nGo to [QbeastTable documentation](./docs/QbeastTable.md) for more detailed information.\n\n### 7. Visualize index\nUse [Python index visualizer](./utils/visualizer/README.md) for your indexed table to visually examine index structure and gather sampling metrics.\n\n# Dependencies and Version Compatibility\n| Version |   Spark   |  Hadoop   | Delta Lake |\n|-------|:---------:|:---------:|:----------:|\n| 0.1.0 |   3.0.0   |   3.2.0   |   0.8.0    |\n| 0.2.0 |   3.1.x   |   3.2.0   |   1.0.0    |\n| 0.3.x |   3.2.x   |   3.3.x   |   1.2.x    |\n| 0.4.x |   3.3.x   |   3.3.x   |   2.1.x    |\n| 0.5.x |   3.4.x   |   3.3.x   |   2.4.x    |\n| 0.6.x |   3.5.x   |   3.3.x   |   3.1.x    |\n| **0.7.x** | **3.5.x** | **3.3.x** | **3.1.x**  |\n\n\nCheck [here](https://docs.delta.io/latest/releases.html) for **Delta Lake** and **Apache Spark** version compatibility.\n\n# Contribution Guide\n\nSee [Contribution Guide](./CONTRIBUTING.md) for more information. \n\n# License\nSee [LICENSE](./LICENSE).\n\n# Code of conduct\n\nSee [Code of conduct](./CODE_OF_CONDUCT.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqbeast-io%2Fqbeast-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqbeast-io%2Fqbeast-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqbeast-io%2Fqbeast-spark/lists"}