https://github.com/pinecone-io/spark-pinecone

The Apache Spark connector for Pinecone
https://github.com/pinecone-io/spark-pinecone
Last synced: about 2 months ago
JSON representation
The Apache Spark connector for Pinecone
Host: GitHub
URL: https://github.com/pinecone-io/spark-pinecone
Owner: pinecone-io
License: apache-2.0
Created: 2022-03-01T14:55:03.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2024-12-17T17:51:52.000Z (6 months ago)
Last Synced: 2025-04-06T08:27:24.584Z (3 months ago)
Language: Scala
Homepage:
Size: 87.9 KB
Stars: 19
Watchers: 21
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # spark-pinecone

The official [pinecone.io](https://pinecone.io) spark connector.

## Features

- Please note that the connector's write operation is not atomic - some vectors might be written while others aren't if the operation is stopped or if it fails. 

In practice this shouldn't cause a serious issue. Pinecone is an idempotent key-value store. Re-running the job will result in the desired state without a need to clear the index or calculate some delta from the source data.

- The client currently only supports batch writing of data into pinecone from a specific schema (see the example below).

If you need to use the connector with a streaming pipeline, it is recommended to use a function like `foreachBatch`.

## Support

This client currently supports Spark 3.5.0, Scala 2.12.X or 2.13.X and Java 8+.

- For Scala 2.12, use `spark-pinecone_2.12.jar`: https://central.sonatype.com/artifact/io.pinecone/spark-pinecone_2.12.

- For Scala 2.13, use `spark-pinecone_2.13.jar`: https://central.sonatype.com/artifact/io.pinecone/spark-pinecone_2.13.

Make sure to add the correct JAR file to your project's dependencies according to your Scala version.

### Databricks and friends

Due to various libraries provided by Databricks, please use the assembly jar from s3 to avoid dependency conflict.

S3 path for assembly jar:

1. v1.2.0 (latest): s3://pinecone-jars/1.2.0/spark-pinecone-uberjar.jar

2. v1.1.0: s3://pinecone-jars/1.1.0/spark-pinecone-uberjar.jar

3. v1.0.0: s3://pinecone-jars/1.0.0/spark-pinecone-uberjar.jar 

4. v0.2.2: s3://pinecone-jars/0.2.2/spark-pinecone-uberjar.jar

5. v0.2.1: s3://pinecone-jars/0.2.1/spark-pinecone-uberjar.jar

6. v0.1.4: s3://pinecone-jars/spark-pinecone-uberjar.jar

## Example

To connect to Pinecone with Spark you'll have to retrieve the api key from [your Pinecone console](https://app.pinecone.io). 

Navigate to your project and click the "API Keys" button on the sidebar. The sample.jsonl file used in the examples below

can be found [here](https://github.com/pinecone-io/spark-pinecone/blob/main/src/it/resources/sample.jsonl).

### Batch upsert

Below are examples in Python and Scala for batch upserting vectors in Pinecone DB.

#### Python

```python

from pyspark import SparkConf

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, ArrayType, FloatType, StringType, LongType

# Your API key and index name

api_key = "PINECONE_API_KEY"

index_name = "PINECONE_INDEX_NAME"

source_tag = "PINECONE_SOURCE_TAG"

COMMON_SCHEMA = StructType([

    StructField("id", StringType(), False),

    StructField("namespace", StringType(), True),

    StructField("values", ArrayType(FloatType(), False), False),

    StructField("metadata", StringType(), True),

    StructField("sparse_values", StructType([

        StructField("indices", ArrayType(LongType(), False), False),

        StructField("values", ArrayType(FloatType(), False), False)

    ]), True)

])

# Initialize Spark

spark = SparkSession.builder.getOrCreate()

# Read the file and apply the schema

df = spark.read \

    .option("multiLine", value = True) \

    .option("mode", "PERMISSIVE") \

    .schema(COMMON_SCHEMA) \

    .json("src/test/resources/sample.jsonl")

# Show if the read was successful

df.show()

# Write the dataFrame to Pinecone in batches 

df.write \

    .option("pinecone.apiKey", api_key) \

    .option("pinecone.indexName", index_name) \

    .option("pinecone.sourceTag", source_tag) \

    .format("io.pinecone.spark.pinecone.Pinecone") \

    .mode("append") \

    .save()

```

### Scala

```scala

import io.pinecone.spark.pinecone.{COMMON_SCHEMA, PineconeOptions}

import org.apache.spark.SparkConf

import org.apache.spark.sql.{SaveMode, SparkSession}

object MainApp extends App {

  // Your API key and index name

  val apiKey = "PINECONE_API_KEY"

  val indexName = "PINECONE_INDEX_NAME"

  val sourceTag = "PINECONE_SOURCE_TAG"

  // Configure Spark to run locally with all available cores

  val conf = new SparkConf()

    .setMaster("local[*]")

  // Create a Spark session with the defined configuration

  val spark = SparkSession.builder().config(conf).getOrCreate()

  // Read the JSON file into a DataFrame, applying the COMMON_SCHEMA

  val df = spark.read

    .option("multiLine", value = true)

    .option("mode", "PERMISSIVE")

    .schema(COMMON_SCHEMA)

    .json("src/test/resources/sample.jsonl") // path to sample.jsonl

  // Define Pinecone options as a Map

  val pineconeOptions = Map(

    PineconeOptions.PINECONE_API_KEY_CONF -> apiKey,

    PineconeOptions.PINECONE_INDEX_NAME_CONF -> indexName,

    PineconeOptions.PINECONE_SOURCE_TAG_CONF -> sourceTag

  )

  // Show if the read was successful

  df.show(df.count().toInt)

  

  // Write the DataFrame to Pinecone using the defined options in batches

  df.write

    .options(pineconeOptions)

    .format("io.pinecone.spark.pinecone.Pinecone")

    .mode(SaveMode.Append)

    .save()

}

```

### Stream upsert

Below are examples in Python and Scala for streaming upserts of vectors in Pinecone DB.

#### Python

```python

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, ArrayType, FloatType, StringType, LongType

import os

# Your API key and index name

api_key = "PINECONE_API_KEY"

index_name = "PINECONE_INDEX_NAME"

source_tag = "PINECONE_SOURCE_TAG"

COMMON_SCHEMA = StructType([

    StructField("id", StringType(), False),

    StructField("namespace", StringType(), True),

    StructField("values", ArrayType(FloatType(), False), False),

    StructField("metadata", StringType(), True),

    StructField("sparse_values", StructType([

        StructField("indices", ArrayType(LongType(), False), False),

        StructField("values", ArrayType(FloatType(), False), False)

    ]), True)

])

# Initialize Spark session

spark = SparkSession.builder \

    .appName("StreamUpsertExample") \

    .config("spark.sql.shuffle.partitions", 3) \

    .master("local") \

    .getOrCreate()

# Read the stream of JSON files, applying the schema from the input directory

lines = spark.readStream \

    .option("multiLine", True) \

    .option("mode", "PERMISSIVE") \

    .schema(COMMON_SCHEMA) \

    .json("path/to/input/directory/")

# Write the stream to Pinecone using the defined options

upsert = lines.writeStream \

    .format("io.pinecone.spark.pinecone.Pinecone") \

    .option("pinecone.apiKey", api_key) \

    .option("pinecone.indexName", index_name) \

    .option("pinecone.sourceTag", source_tag) \

    .option("checkpointLocation", "path/to/checkpoint/dir") \

    .outputMode("append") \

    .start()

upsert.awaitTermination()

```

### Scala

```scala

import io.pinecone.spark.pinecone.{COMMON_SCHEMA, PineconeOptions}

import org.apache.spark.SparkConf

import org.apache.spark.sql.{SaveMode, SparkSession}

object MainApp extends App {

  // Your API key and index name

  val apiKey = "PINECONE_API_KEY"

  val indexName = "PINECONE_INDEX_NAME"

  // Create a Spark session

  val spark = SparkSession.builder()

    .appName("StreamUpsertExample")

    .config("spark.sql.shuffle.partitions", 3)

    .master("local")

    .getOrCreate()

  // Read the JSON files into a DataFrame, applying the COMMON_SCHEMA from input directory

  val lines = spark.readStream

    .option("multiLine", value = true)

    .option("mode", "PERMISSIVE")

    .schema(COMMON_SCHEMA)

    .json("path/to/input/directory/")

  // Define Pinecone options as a Map

  val pineconeOptions = Map(

    PineconeOptions.PINECONE_API_KEY_CONF -> System.getenv("PINECONE_API_KEY"),

    PineconeOptions.PINECONE_INDEX_NAME_CONF -> System.getenv("PINECONE_INDEX"),

    PineconeOptions.PINECONE_SOURCE_TAG_CONF -> System.getenv("PINECONE_SOURCE_TAG")

  )

  // Write the stream to Pinecone using the defined options

  val upsert = lines

    .writeStream

    .format("io.pinecone.spark.pinecone.Pinecone")

    .options(pineconeOptions)

    .option("checkpointLocation", "path/to/checkpoint/dir")

    .outputMode("append")

    .start()

  upsert.awaitTermination()

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pinecone-io/spark-pinecone

Awesome Lists containing this project

README