https://github.com/qdrant/qdrant-spark

Qdrant's Apache Spark connector
https://github.com/qdrant/qdrant-spark
Last synced: 10 months ago
JSON representation
Qdrant's Apache Spark connector
Host: GitHub
URL: https://github.com/qdrant/qdrant-spark
Owner: qdrant
License: apache-2.0
Created: 2023-11-01T12:12:53.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2025-03-28T11:28:08.000Z (about 1 year ago)
Last Synced: 2025-06-08T09:08:34.996Z (11 months ago)
Language: Java
Homepage: https://qdrant.tech/documentation/frameworks/spark/
Size: 133 KB
Stars: 43
Watchers: 5
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Qdrant-Spark Connector

[Apache Spark](https://spark.apache.org/) is a distributed computing framework designed for big data processing and analytics. This connector enables [Qdrant](https://qdrant.tech/) to be a storage destination in Spark.

## Installation

To integrate the connector into your Spark environment, get the JAR file from one of the sources listed below.

> [!IMPORTANT]  

> Ensure your system is running Java 8.

### GitHub Releases

The packaged `jar` file can be found [here](https://github.com/qdrant/qdrant-spark/releases).

### Building from source

To build the `jar` from source, you need [JDK@8](https://www.azul.com/downloads/#zulu) and [Maven](https://maven.apache.org/) installed.

Once the requirements have been satisfied, run the following command in the project root.

```bash

mvn package

```

The JAR file will be written into the `target` directory by default.

### Maven Central

Find the project on Maven Central [here](https://central.sonatype.com/artifact/io.qdrant/spark).

## Usage

### Creating a Spark session (Single-node) with Qdrant support

```python

from pyspark.sql import SparkSession

spark = SparkSession.builder.config(

        "spark.jars",

        "spark-VERSION.jar",  # Specify the downloaded JAR file

    )

    .master("local[*]")

    .appName("qdrant")

    .getOrCreate()

```

### Loading data

> [!IMPORTANT]

> Before loading the data using this connector, a collection has to be [created](https://qdrant.tech/documentation/concepts/collections/#create-a-collection) in advance with the appropriate vector dimensions and configurations.

The connector supports ingesting multiple named/unnamed, dense/sparse vectors.

_Click each to expand._

  Unnamed/Default vector

```python

  

   .write

   .format("io.qdrant.spark.Qdrant")

   .option("qdrant_url", )

   .option("collection_name", )

   .option("embedding_field", )  # Expected to be a field of type ArrayType(FloatType)

   .option("schema", .schema.json())

   .mode("append")

   .save()

```

  Named vector

```python

  

   .write

   .format("io.qdrant.spark.Qdrant")

   .option("qdrant_url", )

   .option("collection_name", )

   .option("embedding_field", )  # Expected to be a field of type ArrayType(FloatType)

   .option("vector_name", )

   .option("schema", .schema.json())

   .mode("append")

   .save()

```

> #### NOTE

>

> The `embedding_field` and `vector_name` options are maintained for backward compatibility. It is recommended to use `vector_fields` and `vector_names` for named vectors as shown below.

  Multiple named vectors

```python

  

   .write

   .format("io.qdrant.spark.Qdrant")

   .option("qdrant_url", "")

   .option("collection_name", "")

   .option("vector_fields", ",")

   .option("vector_names", ",")

   .option("schema", .schema.json())

   .mode("append")

   .save()

```

  Sparse vectors

```python

  

   .write

   .format("io.qdrant.spark.Qdrant")

   .option("qdrant_url", "")

   .option("collection_name", "")

   .option("sparse_vector_value_fields", "")

   .option("sparse_vector_index_fields", "")

   .option("sparse_vector_names", "")

   .option("schema", .schema.json())

   .mode("append")

   .save()

```

  Multiple sparse vectors

```python

  

   .write

   .format("io.qdrant.spark.Qdrant")

   .option("qdrant_url", "")

   .option("collection_name", "")

   .option("sparse_vector_value_fields", ",")

   .option("sparse_vector_index_fields", ",")

   .option("sparse_vector_names", ",")

   .option("schema", .schema.json())

   .mode("append")

   .save()

```

  Combination of named dense and sparse vectors

```python

  

   .write

   .format("io.qdrant.spark.Qdrant")

   .option("qdrant_url", "")

   .option("collection_name", "")

   .option("vector_fields", ",")

   .option("vector_names", ",")

   .option("sparse_vector_value_fields", ",")

   .option("sparse_vector_index_fields", ",")

   .option("sparse_vector_names", ",")

   .option("schema", .schema.json())

   .mode("append")

   .save()

```

  Multi-vectors

```python

  

   .write

   .format("io.qdrant.spark.Qdrant")

   .option("qdrant_url", "")

   .option("collection_name", "")

   .option("multi_vector_fields", "")

   .option("multi_vector_names", "")

   .option("schema", .schema.json())

   .mode("append")

   .save()

```

  Multiple Multi-vectors

```python

  

   .write

   .format("io.qdrant.spark.Qdrant")

   .option("qdrant_url", "")

   .option("collection_name", "")

   .option("multi_vector_fields", ",")

   .option("multi_vector_names", ",")

   .option("schema", .schema.json())

   .mode("append")

   .save()

```

  No vectors - Entire dataframe is stored as payload

```python

  

   .write

   .format("io.qdrant.spark.Qdrant")

   .option("qdrant_url", "")

   .option("collection_name", "")

   .option("schema", .schema.json())

   .mode("append")

   .save()

```

## Databricks

> [!TIP]

> Check out our [example](https://qdrant.tech/documentation/examples/databricks/) of using the Spark connector with Databricks.

You can use the connector as a library in Databricks to ingest data into Qdrant.

- Go to the `Libraries` section in your cluster dashboard.

- Select `Install New` to open the library installation modal.

- Search for `io.qdrant:spark:VERSION` in the Maven packages and click `Install`.



## Datatype support

The appropriate Spark data types are mapped to the Qdrant payload based on the provided `schema`.

## Options and Spark types

| Option 
| :--------------------------- 
| `qdrant_url` 
| `collection_name` 
| `schema` 
| `embedding_field` 
| `id_field` 
| `batch_size` 
| `retries` 
| `api_key` 
| `vector_name` 
| `vector_fields` 
| `vector_names` 
| `sparse_vector_index_fields` 
| `sparse_vector_value_fields` 
| `sparse_vector_names` 
| `multi_vector_fields` 
| `multi_vector_names` 
| `shard_key_selector` 
| `wait`

| Description                                                                                           | Column DataType                   | Required | | :---------------------------------------------------------------------------------------------------- | :-------------------------------- | :------- | | gRPC URL of the Qdrant instance. Eg:                                           | -                                 | ✅       | | Name of the collection to write data into                                                             | -                                 | ✅       | | JSON string of the dataframe schema                                                                   | -                                 | ✅       | | Name of the column with the embeddings (Deprecated - Use `vector_fields` instead)                     | `ArrayType(FloatType)`            | ❌       | | Name of the column with the point IDs. Points with the same IDs are overwritten. Default: Random UUID | `StringType` or `IntegerType`     | ❌       | | Max size of the upload batch. Default: 64                                                             | -                                 | ❌       | | Number of upload retries. Default: 3                                                                  | -                                 | ❌       | | Qdrant API key for authentication                                                                     | -                                 | ❌       | | Name of the vector in the collection.                                                                 | -                                 | ❌       | | Comma-separated names of columns holding the vectors.                                                 | `ArrayType(FloatType)`            | ❌       | | Comma-separated names of vectors in the collection.                                                   | -                                 | ❌       | | Comma-separated names of columns holding the sparse vector indices.                                   | `ArrayType(IntegerType)`          | ❌       | | Comma-separated names of columns holding the sparse vector values.                                    | `ArrayType(FloatType)`            | ❌       | | Comma-separated names of the sparse vectors in the collection.                                        | -                                 | ❌       | | Comma-separated names of columns holding the multi-vector values.                                     | `ArrayType(ArrayType(FloatType))` | ❌       | | Comma-separated names of the multi-vectors in the collection.                                         | -                                 | ❌       | | Comma-separated names of custom shard keys to use during upsert.                                      | -                                 | ❌       | | Wait for each batch upsert to complete. `true` or `false`. Defaults to `true`.                        | -                                 | ❌       |

## LICENSE

Apache 2.0 © [2024](https://github.com/qdrant/qdrant-spark/blob/master/LICENSE)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/qdrant/qdrant-spark

Awesome Lists containing this project

README