{"id":28665526,"url":"https://github.com/lancedb/lance-spark","last_synced_at":"2025-06-13T13:39:48.530Z","repository":{"id":286155399,"uuid":"958852388","full_name":"lancedb/lance-spark","owner":"lancedb","description":"Spark integrations for working with Lance datasets","archived":false,"fork":false,"pushed_at":"2025-05-26T23:21:32.000Z","size":1362,"stargazers_count":5,"open_issues_count":11,"forks_count":2,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-05-26T23:32:34.739Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lancedb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-01T21:43:46.000Z","updated_at":"2025-05-26T23:21:35.000Z","dependencies_parsed_at":"2025-05-26T23:33:29.725Z","dependency_job_id":null,"html_url":"https://github.com/lancedb/lance-spark","commit_stats":null,"previous_names":["lancedb/lance-spark"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lancedb/lance-spark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Flance-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Flance-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Flance-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Flance-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lancedb","download_url":"https://codeload.github.com/lancedb/lance-spark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lancedb%2Flance-spark/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259654495,"owners_count":22891029,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-13T13:38:44.627Z","updated_at":"2025-06-13T13:39:48.516Z","avatar_url":"https://github.com/lancedb.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Apache Spark Connector for Lance\n\nThe Apache Spark Connector for Lance allows Apache Spark to efficiently read datasets stored in Lance format.\n\nLance is a modern columnar data format optimized for machine learning workflows and datasets,\nsupporting distributed, parallel scans, and optimizations such as column and filter pushdown to improve performance.\nAdditionally, Lance provides high-performance random access that is 100 times faster than Parquet \nwithout sacrificing scan performance.\n\nBy using the Apache Spark Connector for Lance, you can leverage Apache Spark's powerful data processing, SQL querying, \nand machine learning training capabilities on the AI data lake powered by Lance.\n\n## Features\n\nThe connector is built using the Spark DatasourceV2 (DSv2) API. \nPlease check [this presentation](https://www.slideshare.net/databricks/apache-spark-data-source-v2-with-wenchen-fan-and-gengliang-wang)\nto learn more about DSv2 features.\nSpecifically, you can use the Apache Spark Connector for Lance to:\n\n* **Query Lance Datasets**: Seamlessly query datasets stored in the Lance format using Spark.\n* **Distributed, Parallel Scans**: Leverage Spark's distributed computing capabilities to perform parallel scans on Lance datasets.\n* **Column and Filter Pushdown**: Optimize query performance by pushing down column selections and filters to the data source.\n\n## Installation\n\n### Requirements\n\n| Requirement | Supported Versions                         |\n|-------------|--------------------------------------------|\n| Java        | 8, 11, 17                                  |\n| Scala       | 2.12                                       |\n| Spark       | 3.5                                        |\n| OS          | Any OS that is supported by Lance Java SDK |\n\n### Maven Central\n\nThe connector packages are published to Maven Central under `com.lancedb` namespace:\n\n| Artifact Type | Name Pattern                                         | Description                                                                                                                                     | Example                     |\n|---------------|------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------|\n| Base Jar      | `lance-spark-base_\u003cscala_version\u003e`                   | Jar with logic shared by different versions of Spark Lance connectors, only intended for internal use.                                          | lance-spark-base_2.12       |\n| Lean Jar      | `lance-spark-\u003cspark-version\u003e_\u003cscala_version\u003e`        | Jar with only the Spark Lance connector logic, suitable for building a Spark application which you will re-bundle later with other dependencies | lance-spark-3.5_2.12        |\n| Bundled Jar   | `lance-spark-bundle-\u003cspark-version\u003e_\u003cscala_version\u003e` | Jar with all necessary non-Spark dependencies, suitable for use directly in a Spark session                                                     | lance-spark-bundle-3.5_2.12 |\n\n## Quick Start\n\nLaunch `spark-shell` with your selected JAR according to your Spark and Scala version:\n\n```shell\nspark-shell --packages com.lancedb.lance:lance-spark-bundle-3.5_2.12:0.0.1\n```\n\nExample Usage:\n\n```java\nimport org.apache.spark.sql.SparkSession;\n\nSparkSession spark = SparkSession.builder()\n    .appName(\"spark-lance-connector-test\")\n    .master(\"local\")\n    .getOrCreate();\n\nDataset\u003cRow\u003e data = spark.read()\n    .format(\"lance\")\n    .option(\"db\", \"/path/to/example_db\")\n    .option(\"dataset\", \"lance_example_dataset\")\n    .load();\n\ndata.show(100);\n```\n\nMore examples can be found in [SparkLanceConnectorReadTest](/lance-spark-base/src/test/java/com/lancedb/lance/spark/read/SparkConnectorReadTestBase.java).\n\n## Development Guide\n\n### Lance Java SDK Dependency\n\nThis package is dependent on the [Lance Java SDK](https://github.com/lancedb/lance/blob/main/java) and \n[Lance Catalog Java Modules](https://github.com/lancedb/lance-catalog/tree/main/java).\nYou need to build those repositories locally first before building this repository.\nIf your have changes affecting those repositories,\nthe PR in `lancedb/lance-spark` will only pass CI after the PRs in `lancedb/lance` and `lance/lance-catalog` are merged.\n\n### Build Commands\n\nThis connector is built using Maven. To build everything:\n\n```shell\n./mvnw clean install\n```\n\nTo build everything without running tests:\n\n```shell\n./mvnw clean install -DskipTests\n```\n\n### Multi-Version Support\n\nWe offer the following build profiles for you to switch among different build versions:\n\n- scala-2.12\n- scala-2.13\n- spark-3.4\n- spark-3.5\n\nFor example, to use Scala 2.13:\n\n```shell\n./mvnw clean install -Pscala-2.13\n```\n\nTo build a specific version like Spark 3.4:\n\n```shell\n./mvnw clean install -Pspark-3.4\n```\n\nTo build only Spark 3.4:\n\n```shell\n./mvnw clean install -Pspark-3.4 -pl lance-spark-3.4 -am\n```\n\nUse the `shade-jar` profile to create the jar with all dependencies for Spark 3.4:\n\n```shell\n./mvnw clean install -Pspark-3.4 -Pshade-jar -pl lance-spark-3.4 -am\n```\n\n### Styling Guide\n\nWe use checkstyle and spotless to lint the code.\n\nTo verify checkstyle:\n\n```shell\n./mvnw checkstyle:check\n```\n\nTo verify spotless:\n\n```shell\n./mvnw spotless:check\n```\n\nTo apply spotless changes:\n\n```shell\n./mvnw spotless:apply\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flancedb%2Flance-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flancedb%2Flance-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flancedb%2Flance-spark/lists"}