{"id":28493387,"url":"https://github.com/qdrant/qdrant-spark","last_synced_at":"2025-07-08T11:32:19.709Z","repository":{"id":204887259,"uuid":"712888460","full_name":"qdrant/qdrant-spark","owner":"qdrant","description":"Qdrant's Apache Spark connector","archived":false,"fork":false,"pushed_at":"2025-03-28T11:28:08.000Z","size":136,"stargazers_count":43,"open_issues_count":0,"forks_count":1,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-06-08T09:08:34.996Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://qdrant.tech/documentation/frameworks/spark/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/qdrant.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-01T12:12:53.000Z","updated_at":"2025-04-09T11:49:50.000Z","dependencies_parsed_at":"2023-12-12T16:27:45.849Z","dependency_job_id":"891203e7-091f-4b44-8143-767f6889459a","html_url":"https://github.com/qdrant/qdrant-spark","commit_stats":null,"previous_names":["qdrant/qdrant-spark"],"tags_count":22,"template":false,"template_full_name":null,"purl":"pkg:github/qdrant/qdrant-spark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fqdrant-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fqdrant-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fqdrant-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fqdrant-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/qdrant","download_url":"https://codeload.github.com/qdrant/qdrant-spark/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fqdrant-spark/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264259884,"owners_count":23580900,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-08T09:08:38.738Z","updated_at":"2025-07-08T11:32:19.703Z","avatar_url":"https://github.com/qdrant.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Qdrant-Spark Connector\n\n[Apache Spark](https://spark.apache.org/) is a distributed computing framework designed for big data processing and analytics. This connector enables [Qdrant](https://qdrant.tech/) to be a storage destination in Spark.\n\n## Installation\n\nTo integrate the connector into your Spark environment, get the JAR file from one of the sources listed below.\n\n\u003e [!IMPORTANT]  \n\u003e Ensure your system is running Java 8.\n\n### GitHub Releases\n\nThe packaged `jar` file can be found [here](https://github.com/qdrant/qdrant-spark/releases).\n\n### Building from source\n\nTo build the `jar` from source, you need [JDK@8](https://www.azul.com/downloads/#zulu) and [Maven](https://maven.apache.org/) installed.\nOnce the requirements have been satisfied, run the following command in the project root.\n\n```bash\nmvn package\n```\n\nThe JAR file will be written into the `target` directory by default.\n\n### Maven Central\n\nFind the project on Maven Central [here](https://central.sonatype.com/artifact/io.qdrant/spark).\n\n## Usage\n\n### Creating a Spark session (Single-node) with Qdrant support\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.config(\n        \"spark.jars\",\n        \"spark-VERSION.jar\",  # Specify the downloaded JAR file\n    )\n    .master(\"local[*]\")\n    .appName(\"qdrant\")\n    .getOrCreate()\n```\n\n### Loading data\n\n\u003e [!IMPORTANT]\n\u003e Before loading the data using this connector, a collection has to be [created](https://qdrant.tech/documentation/concepts/collections/#create-a-collection) in advance with the appropriate vector dimensions and configurations.\n\nThe connector supports ingesting multiple named/unnamed, dense/sparse vectors.\n\n_Click each to expand._\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eUnnamed/Default vector\u003c/b\u003e\u003c/summary\u003e\n\n```python\n  \u003cpyspark.sql.DataFrame\u003e\n   .write\n   .format(\"io.qdrant.spark.Qdrant\")\n   .option(\"qdrant_url\", \u003cQDRANT_GRPC_URL\u003e)\n   .option(\"collection_name\", \u003cQDRANT_COLLECTION_NAME\u003e)\n   .option(\"embedding_field\", \u003cEMBEDDING_FIELD_NAME\u003e)  # Expected to be a field of type ArrayType(FloatType)\n   .option(\"schema\", \u003cpyspark.sql.DataFrame\u003e.schema.json())\n   .mode(\"append\")\n   .save()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eNamed vector\u003c/b\u003e\u003c/summary\u003e\n\n```python\n  \u003cpyspark.sql.DataFrame\u003e\n   .write\n   .format(\"io.qdrant.spark.Qdrant\")\n   .option(\"qdrant_url\", \u003cQDRANT_GRPC_URL\u003e)\n   .option(\"collection_name\", \u003cQDRANT_COLLECTION_NAME\u003e)\n   .option(\"embedding_field\", \u003cEMBEDDING_FIELD_NAME\u003e)  # Expected to be a field of type ArrayType(FloatType)\n   .option(\"vector_name\", \u003cVECTOR_NAME\u003e)\n   .option(\"schema\", \u003cpyspark.sql.DataFrame\u003e.schema.json())\n   .mode(\"append\")\n   .save()\n```\n\n\u003e #### NOTE\n\u003e\n\u003e The `embedding_field` and `vector_name` options are maintained for backward compatibility. It is recommended to use `vector_fields` and `vector_names` for named vectors as shown below.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eMultiple named vectors\u003c/b\u003e\u003c/summary\u003e\n\n```python\n  \u003cpyspark.sql.DataFrame\u003e\n   .write\n   .format(\"io.qdrant.spark.Qdrant\")\n   .option(\"qdrant_url\", \"\u003cQDRANT_GRPC_URL\u003e\")\n   .option(\"collection_name\", \"\u003cQDRANT_COLLECTION_NAME\u003e\")\n   .option(\"vector_fields\", \"\u003cCOLUMN_NAME\u003e,\u003cANOTHER_COLUMN_NAME\u003e\")\n   .option(\"vector_names\", \"\u003cVECTOR_NAME\u003e,\u003cANOTHER_VECTOR_NAME\u003e\")\n   .option(\"schema\", \u003cpyspark.sql.DataFrame\u003e.schema.json())\n   .mode(\"append\")\n   .save()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eSparse vectors\u003c/b\u003e\u003c/summary\u003e\n\n```python\n  \u003cpyspark.sql.DataFrame\u003e\n   .write\n   .format(\"io.qdrant.spark.Qdrant\")\n   .option(\"qdrant_url\", \"\u003cQDRANT_GRPC_URL\u003e\")\n   .option(\"collection_name\", \"\u003cQDRANT_COLLECTION_NAME\u003e\")\n   .option(\"sparse_vector_value_fields\", \"\u003cCOLUMN_NAME\u003e\")\n   .option(\"sparse_vector_index_fields\", \"\u003cCOLUMN_NAME\u003e\")\n   .option(\"sparse_vector_names\", \"\u003cSPARSE_VECTOR_NAME\u003e\")\n   .option(\"schema\", \u003cpyspark.sql.DataFrame\u003e.schema.json())\n   .mode(\"append\")\n   .save()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eMultiple sparse vectors\u003c/b\u003e\u003c/summary\u003e\n\n```python\n  \u003cpyspark.sql.DataFrame\u003e\n   .write\n   .format(\"io.qdrant.spark.Qdrant\")\n   .option(\"qdrant_url\", \"\u003cQDRANT_GRPC_URL\u003e\")\n   .option(\"collection_name\", \"\u003cQDRANT_COLLECTION_NAME\u003e\")\n   .option(\"sparse_vector_value_fields\", \"\u003cCOLUMN_NAME\u003e,\u003cANOTHER_COLUMN_NAME\u003e\")\n   .option(\"sparse_vector_index_fields\", \"\u003cCOLUMN_NAME\u003e,\u003cANOTHER_COLUMN_NAME\u003e\")\n   .option(\"sparse_vector_names\", \"\u003cSPARSE_VECTOR_NAME\u003e,\u003cANOTHER_SPARSE_VECTOR_NAME\u003e\")\n   .option(\"schema\", \u003cpyspark.sql.DataFrame\u003e.schema.json())\n   .mode(\"append\")\n   .save()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eCombination of named dense and sparse vectors\u003c/b\u003e\u003c/summary\u003e\n\n```python\n  \u003cpyspark.sql.DataFrame\u003e\n   .write\n   .format(\"io.qdrant.spark.Qdrant\")\n   .option(\"qdrant_url\", \"\u003cQDRANT_GRPC_URL\u003e\")\n   .option(\"collection_name\", \"\u003cQDRANT_COLLECTION_NAME\u003e\")\n   .option(\"vector_fields\", \"\u003cCOLUMN_NAME\u003e,\u003cANOTHER_COLUMN_NAME\u003e\")\n   .option(\"vector_names\", \"\u003cVECTOR_NAME\u003e,\u003cANOTHER_VECTOR_NAME\u003e\")\n   .option(\"sparse_vector_value_fields\", \"\u003cCOLUMN_NAME\u003e,\u003cANOTHER_COLUMN_NAME\u003e\")\n   .option(\"sparse_vector_index_fields\", \"\u003cCOLUMN_NAME\u003e,\u003cANOTHER_COLUMN_NAME\u003e\")\n   .option(\"sparse_vector_names\", \"\u003cSPARSE_VECTOR_NAME\u003e,\u003cANOTHER_SPARSE_VECTOR_NAME\u003e\")\n   .option(\"schema\", \u003cpyspark.sql.DataFrame\u003e.schema.json())\n   .mode(\"append\")\n   .save()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eMulti-vectors\u003c/b\u003e\u003c/summary\u003e\n\n```python\n  \u003cpyspark.sql.DataFrame\u003e\n   .write\n   .format(\"io.qdrant.spark.Qdrant\")\n   .option(\"qdrant_url\", \"\u003cQDRANT_GRPC_URL\u003e\")\n   .option(\"collection_name\", \"\u003cQDRANT_COLLECTION_NAME\u003e\")\n   .option(\"multi_vector_fields\", \"\u003cCOLUMN_NAME\u003e\")\n   .option(\"multi_vector_names\", \"\u003cMULTI_VECTOR_NAME\u003e\")\n   .option(\"schema\", \u003cpyspark.sql.DataFrame\u003e.schema.json())\n   .mode(\"append\")\n   .save()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eMultiple Multi-vectors\u003c/b\u003e\u003c/summary\u003e\n\n```python\n  \u003cpyspark.sql.DataFrame\u003e\n   .write\n   .format(\"io.qdrant.spark.Qdrant\")\n   .option(\"qdrant_url\", \"\u003cQDRANT_GRPC_URL\u003e\")\n   .option(\"collection_name\", \"\u003cQDRANT_COLLECTION_NAME\u003e\")\n   .option(\"multi_vector_fields\", \"\u003cCOLUMN_NAME\u003e,\u003cANOTHER_COLUMN_NAME\u003e\")\n   .option(\"multi_vector_names\", \"\u003cMULTI_VECTOR_NAME\u003e,\u003cANOTHER_MULTI_VECTOR_NAME\u003e\")\n   .option(\"schema\", \u003cpyspark.sql.DataFrame\u003e.schema.json())\n   .mode(\"append\")\n   .save()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eNo vectors - Entire dataframe is stored as payload\u003c/b\u003e\u003c/summary\u003e\n\n```python\n  \u003cpyspark.sql.DataFrame\u003e\n   .write\n   .format(\"io.qdrant.spark.Qdrant\")\n   .option(\"qdrant_url\", \"\u003cQDRANT_GRPC_URL\u003e\")\n   .option(\"collection_name\", \"\u003cQDRANT_COLLECTION_NAME\u003e\")\n   .option(\"schema\", \u003cpyspark.sql.DataFrame\u003e.schema.json())\n   .mode(\"append\")\n   .save()\n```\n\n\u003c/details\u003e\n\n## Databricks\n\n\u003e [!TIP]\n\u003e Check out our [example](https://qdrant.tech/documentation/examples/databricks/) of using the Spark connector with Databricks.\n\nYou can use the connector as a library in Databricks to ingest data into Qdrant.\n\n- Go to the `Libraries` section in your cluster dashboard.\n- Select `Install New` to open the library installation modal.\n- Search for `io.qdrant:spark:VERSION` in the Maven packages and click `Install`.\n\n\u003cimg width=\"704\" alt=\"Screenshot 2024-04-28 at 11 34 17 AM\" src=\"https://github.com/qdrant/qdrant-spark/assets/46051506/0c1bd356-3fba-436a-90ce-d8ff39b02d1f\"\u003e\n\n## Datatype support\n\nThe appropriate Spark data types are mapped to the Qdrant payload based on the provided `schema`.\n\n## Options and Spark types\n\n| Option                       | Description                                                                                           | Column DataType                   | Required |\n| :--------------------------- | :---------------------------------------------------------------------------------------------------- | :-------------------------------- | :------- |\n| `qdrant_url`                 | gRPC URL of the Qdrant instance. Eg: \u003chttp://localhost:6334\u003e                                          | -                                 | ✅       |\n| `collection_name`            | Name of the collection to write data into                                                             | -                                 | ✅       |\n| `schema`                     | JSON string of the dataframe schema                                                                   | -                                 | ✅       |\n| `embedding_field`            | Name of the column with the embeddings (Deprecated - Use `vector_fields` instead)                     | `ArrayType(FloatType)`            | ❌       |\n| `id_field`                   | Name of the column with the point IDs. Points with the same IDs are overwritten. Default: Random UUID | `StringType` or `IntegerType`     | ❌       |\n| `batch_size`                 | Max size of the upload batch. Default: 64                                                             | -                                 | ❌       |\n| `retries`                    | Number of upload retries. Default: 3                                                                  | -                                 | ❌       |\n| `api_key`                    | Qdrant API key for authentication                                                                     | -                                 | ❌       |\n| `vector_name`                | Name of the vector in the collection.                                                                 | -                                 | ❌       |\n| `vector_fields`              | Comma-separated names of columns holding the vectors.                                                 | `ArrayType(FloatType)`            | ❌       |\n| `vector_names`               | Comma-separated names of vectors in the collection.                                                   | -                                 | ❌       |\n| `sparse_vector_index_fields` | Comma-separated names of columns holding the sparse vector indices.                                   | `ArrayType(IntegerType)`          | ❌       |\n| `sparse_vector_value_fields` | Comma-separated names of columns holding the sparse vector values.                                    | `ArrayType(FloatType)`            | ❌       |\n| `sparse_vector_names`        | Comma-separated names of the sparse vectors in the collection.                                        | -                                 | ❌       |\n| `multi_vector_fields`        | Comma-separated names of columns holding the multi-vector values.                                     | `ArrayType(ArrayType(FloatType))` | ❌       |\n| `multi_vector_names`         | Comma-separated names of the multi-vectors in the collection.                                         | -                                 | ❌       |\n| `shard_key_selector`         | Comma-separated names of custom shard keys to use during upsert.                                      | -                                 | ❌       |\n| `wait`                       | Wait for each batch upsert to complete. `true` or `false`. Defaults to `true`.                        | -                                 | ❌       |\n\n## LICENSE\n\nApache 2.0 © [2024](https://github.com/qdrant/qdrant-spark/blob/master/LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqdrant%2Fqdrant-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqdrant%2Fqdrant-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqdrant%2Fqdrant-spark/lists"}