https://github.com/qbeast-io/qbeast-spark
Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://github.com/qbeast-io/qbeast-spark
big-data data-lakehouse datasource sampling scala spark spark-sql
Last synced: 4 months ago
JSON representation
Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
- Host: GitHub
- URL: https://github.com/qbeast-io/qbeast-spark
- Owner: Qbeast-io
- License: apache-2.0
- Created: 2021-09-23T15:54:12.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2025-01-24T14:23:14.000Z (11 months ago)
- Last Synced: 2025-06-07T23:53:00.864Z (7 months ago)
- Topics: big-data, data-lakehouse, datasource, sampling, scala, spark, spark-sql
- Language: Scala
- Homepage: https://qbeast.io/qbeast-our-tech/
- Size: 37.3 MB
- Stars: 228
- Watchers: 10
- Forks: 24
- Open Issues: 37
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
[](./docs)
[-ff7?style=for-the-badge&logo=readthedocs)](https://docs.qbeast.io/)
[](./docs/QbeastTable.md)
[](./docs/sample_pushdown_demo.ipynb)
[](https://join.slack.com/t/qbeast-users/shared_invite/zt-w0zy8qrm-tJ2di1kZpXhjDq_hAl1LHw)
[](https://qbeast.io/academy-courses-index/)
[](https://qbeast.io)
---
**Qbeast Spark** is an Apache Spark extension that enhances data processing in [**Data Lakehouses**](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). It provides advanced **multi-dimensional filtering** and **efficient data sampling**, enabling faster and more accurate queries. The extension also maintains ACID properties for data integrity and reliability, making it ideal for handling large-scale data efficiently.
[](https://spark.apache.org/releases/spark-release-3-5-0.html)
[](https://hadoop.apache.org/release/3.3.1.html)
[](https://github.com/delta-io/delta/releases/tag/v2.4.0)
[](https://codecov.io/gh/Qbeast-io/qbeast-spark)
## Features
1. **Data Lakehouse** - Data lake with **ACID** properties, thanks to the underlying [Delta Lake](https://delta.io/) architecture
2. **Multi-column indexing**: **Filter** your data with **multiple columns** using the Qbeast Format.
3. **Improved Sampling operator** - **Read** statistically significant **subsets** of files.
4. **Table Tolerance** - Model for sampling fraction and **query accuracy** trade-off.
## Query example with Qbeast
|  |  |
|:---------------------------------------------------------------:|:---------------------------------------------------------------:|
As you can see above, the Qbeast Spark extension allows **faster** queries with statistically **accurate** sampling.
| Format | Execution Time | Result |
|--------|:--------------:|:---------:|
| Delta | ~ 151.3 sec. | 37.869383 |
| Qbeast | ~ 6.6 sec. | 37.856333 |
In this example, **1% sampling** provides the result **x22 times faster** compared to using Delta format, with an **error of 0,034%**.
## Documentation
Explore the documentation for more details:
- [Quickstart for Qbeast-Spark](./docs/Quickstart.md)
- [Data Lakehouse with Qbeast Format](./docs/QbeastFormat.md)
- [OTree Algorithm](./docs/OTreeAlgorithm.md)
- [QbeastTable](./docs/QbeastTable.md)
- [Columns To Index Selector](./docs/ColumnsToIndexSelector.md)
- [Recommendations for different Cloud Storage systems](./docs/CloudStorages.md)
- [Advanced configurations](./docs/AdvancedConfiguration.md)
- [Qbeast Metadata](./docs/QbeastFormat.md)
- [FAQ: Frequently Asked Questions](./docs/FAQ.md)
# Quickstart
You can run the qbeast-spark application locally on your computer, or using a Docker image we already prepared with the dependencies.
You can find it in the [Packages section](https://github.com/orgs/Qbeast-io/packages?repo_name=qbeast-spark).
### Pre: Install **Spark**
Download **Spark 3.5.0 with Hadoop 3.3.4**, unzip it, and create the `SPARK_HOME` environment variable:
>:information_source: **Note**: You can use Hadoop 2.7 if desired, but you could have some troubles with different cloud providers' storage, read more about it [here](docs/CloudStorages.md).
```bash
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xzvf spark-3.5.0-bin-hadoop3.tgz
export SPARK_HOME=$PWD/spark-3.5.0-bin-hadoop3
```
### 1. Launch a spark-shell
**Inside the project folder**, launch a **spark shell** with the required dependencies:
```bash
$SPARK_HOME/bin/spark-shell \
--packages io.qbeast:qbeast-spark_2.12:0.7.0,io.delta:delta-spark_2.12:3.1.0 \
--conf spark.sql.extensions=io.qbeast.spark.internal.QbeastSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog
```
### 2. Indexing a dataset
**Read** the **CSV** source file placed inside the project.
```scala
val csvDF = spark.read.format("csv").
option("header", "true").
option("inferSchema", "true").
load("./src/test/resources/ecommerce100K_2019_Oct.csv")
```
Indexing the dataset by writing it into the **qbeast** format, specifying the columns to index.
```scala
val tmpDir = "/tmp/qbeast-spark"
csvDF.write.
mode("overwrite").
format("qbeast").
option("columnsToIndex", "user_id,product_id").
save(tmpDir)
```
#### SQL Syntax.
You can create a table with Qbeast with the help of `QbeastCatalog`.
```scala
spark.sql(
"CREATE TABLE student (id INT, name STRING, age INT) " +
"USING qbeast OPTIONS ('columnsToIndex'='id')")
```
Use **`INSERT INTO`** to add records to the new table. It will update the index in a **dynamic** fashion when new data is inserted.
```scala
val studentsDF = Seq((1, "Alice", 34), (2, "Bob", 36)).toDF("id", "name", "age")
studentsDF.write.mode("overwrite").saveAsTable("visitor_students")
// AS SELECT FROM
spark.sql("INSERT INTO table student SELECT * FROM visitor_students")
// VALUES
spark.sql("INSERT INTO table student VALUES (3, 'Charlie', 37)")
// SHOW
spark.sql("SELECT * FROM student").show()
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1| Alice| 34|
| 2| Bob| 36|
| 3|Charlie| 37|
+---+-------+---+
```
### 3. Load the dataset
Load the newly indexed dataset.
```scala
val qbeastDF =
spark.
read.
format("qbeast").
load(tmpDir)
```
### 4. Examine the Query plan for sampling
**Sampling the data**, notice how the sampler is converted into filters and pushed down to the source!
```scala
qbeastDF.sample(0.1).explain(true)
```
Go to the [Quickstart](./docs/Quickstart.md) or [notebook](docs/sample_pushdown_demo.ipynb) for more details.
### 5. Interact with the format
Get **insights** to the data using the `QbeastTable` interface!
```scala
import io.qbeast.spark.QbeastTable
val qbeastTable = QbeastTable.forPath(spark, tmpDir)
qbeastTable.getIndexMetrics()
```
### 6. Optimize the table
**Optimize** is an expensive operation that consist on **rewriting part of the files** to accomplish **better layout** and **improving query performance**.
To minimize write amplification of this command, **we execute it based on subsets of the table**, like `Revision ID's` or specific files.
> Read more about `Revision` and find an example [here](./docs/QbeastFormat.md).
#### Optimize API
These are the 3 ways of executing the `optimize` operation:
```scala
qbeastTable.optimize() // Optimizes the last Revision Available.
// This does NOT include previous Revision's optimizations.
qbeastTable.optimize(2L) // Optimizes the Revision number 2.
qbeastTable.optimize(Seq("file1", "file2")) // Optimizes the specific files
```
**If you want to optimize the full table, you must loop through `revisions`**:
```scala
val revisions = qbeastTable.revisionsIDs() // Get all the Revision ID's available in the table.
revisions.foreach(revision =>
qbeastTable.optimize(revision)
)
```
Go to [QbeastTable documentation](./docs/QbeastTable.md) for more detailed information.
### 7. Visualize index
Use [Python index visualizer](./utils/visualizer/README.md) for your indexed table to visually examine index structure and gather sampling metrics.
# Dependencies and Version Compatibility
| Version | Spark | Hadoop | Delta Lake |
|-------|:---------:|:---------:|:----------:|
| 0.1.0 | 3.0.0 | 3.2.0 | 0.8.0 |
| 0.2.0 | 3.1.x | 3.2.0 | 1.0.0 |
| 0.3.x | 3.2.x | 3.3.x | 1.2.x |
| 0.4.x | 3.3.x | 3.3.x | 2.1.x |
| 0.5.x | 3.4.x | 3.3.x | 2.4.x |
| 0.6.x | 3.5.x | 3.3.x | 3.1.x |
| **0.7.x** | **3.5.x** | **3.3.x** | **3.1.x** |
Check [here](https://docs.delta.io/latest/releases.html) for **Delta Lake** and **Apache Spark** version compatibility.
# Contribution Guide
See [Contribution Guide](./CONTRIBUTING.md) for more information.
# License
See [LICENSE](./LICENSE).
# Code of conduct
See [Code of conduct](./CODE_OF_CONDUCT.md)