Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/delta-io/delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://github.com/delta-io/delta
acid analytics big-data delta-lake spark
Last synced: 6 days ago
JSON representation
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
- Host: GitHub
- URL: https://github.com/delta-io/delta
- Owner: delta-io
- License: apache-2.0
- Created: 2019-04-22T18:56:51.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-12-31T14:00:48.000Z (12 days ago)
- Last Synced: 2025-01-02T08:53:31.873Z (11 days ago)
- Topics: acid, analytics, big-data, delta-lake, spark
- Language: Scala
- Homepage: https://delta.io
- Size: 34.7 MB
- Stars: 7,716
- Watchers: 218
- Forks: 1,739
- Open Issues: 873
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- stars - delta-io/delta - source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (HarmonyOS / Windows Manager)
- awesome-databricks - Delta Lake - commit/delta-io/delta.svg"> - Storage layer with ACID transactions. (External Resources / Repos)
- awesome-starred - delta-io/delta - An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (analytics)
- awesome-open-data-centric-ai - Delta Lake - source storage framework that enables building a Lakehouse architecture. | ![GitHub stars](https://img.shields.io/github/stars/delta-io/delta?style=social) | <a href="https://github.com/delta-io/delta/blob/main/LICENSE"><img src="https://img.shields.io/github/license/delta-io/delta" height="15"/></a> | (Data versioning)
- awesome-mlops - Delta Lake - Storage layer that brings scalable, ACID transactions to Apache Spark and other engines. (Data Management)
- awesome-dataops - Delta Lake - An open source project that enables building a Lakehouse architecture on top of data lakes. (Data Serialization / Data Table Format)
- StarryDivineSky - delta-io/delta - dataframe、vega 等。 (数据库管理系统 / 网络服务_其他)
- awesome-production-machine-learning - Delta Lake - io/delta.svg?style=social) - Delta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines. (Data Storage Optimisation)
- awesome-spark - Delta Lake - commit/delta-io/delta.svg"> - Storage layer with ACID transactions. (Packages / Storage)
- awesome-llmops - Delta-Lake - io/delta.svg?style=flat-square) | (Data / Data Management)
- AiTreasureBox - delta-io/delta - 01-07_7728_1](https://img.shields.io/github/stars/delta-io/delta.svg)|An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs| (Repos)
README
[![Test](https://github.com/delta-io/delta/actions/workflows/test.yaml/badge.svg)](https://github.com/delta-io/delta/actions/workflows/test.yaml)
[![License](https://img.shields.io/badge/license-Apache%202-brightgreen.svg)](https://github.com/delta-io/delta/blob/master/LICENSE.txt)
[![PyPI](https://img.shields.io/pypi/v/delta-spark.svg)](https://pypi.org/project/delta-spark/)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/delta-spark)](https://pypistats.org/packages/delta-spark)Delta Lake is an open-source storage framework that enables building a [Lakehouse architecture](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.
* See the [Delta Lake Documentation](https://docs.delta.io) for details.
* See the [Quick Start Guide](https://docs.delta.io/latest/quick-start.html) to get started with Scala, Java and Python.
* Note, this repo is one of many Delta Lake repositories in the [delta.io](https://github.com/delta-io) organizations including
[delta](https://github.com/delta-io/delta),
[delta-rs](https://github.com/delta-io/delta-rs),
[delta-sharing](https://github.com/delta-io/delta-sharing),
[kafka-delta-ingest](https://github.com/delta-io/kafka-delta-ingest), and
[website](https://github.com/delta-io/website).The following are some of the more popular Delta Lake integrations, refer to [delta.io/integrations](https://delta.io/integrations/) for the complete list:
* [Apache Spark™](https://docs.delta.io/): This connector allows Apache Spark™ to read from and write to Delta Lake.
* [Apache Flink (Preview)](https://github.com/delta-io/delta/tree/master/connectors/flink): This connector allows Apache Flink to write to Delta Lake.
* [PrestoDB](https://prestodb.io/docs/current/connector/deltalake.html): This connector allows PrestoDB to read from Delta Lake.
* [Trino](https://trino.io/docs/current/connector/delta-lake.html): This connector allows Trino to read from and write to Delta Lake.
* [Delta Standalone](https://docs.delta.io/latest/delta-standalone.html): This library allows Scala and Java-based projects (including Apache Flink, Apache Hive, Apache Beam, and PrestoDB) to read from and write to Delta Lake.
* [Apache Hive](https://docs.delta.io/latest/hive-integration.html): This connector allows Apache Hive to read from Delta Lake.
* [Delta Rust API](https://docs.rs/deltalake/latest/deltalake/): This library allows Rust (with Python and Ruby bindings) low level access to Delta tables and is intended to be used with data processing frameworks like datafusion, ballista, rust-dataframe, vega, etc.
Table of Contents
* [Latest binaries](#latest-binaries)
* [API Documentation](#api-documentation)
* [Compatibility](#compatibility)
* [API Compatibility](#api-compatibility)
* [Data Storage Compatibility](#data-storage-compatibility)
* [Roadmap](#roadmap)
* [Building](#building)
* [Transaction Protocol](#transaction-protocol)
* [Requirements for Underlying Storage Systems](#requirements-for-underlying-storage-systems)
* [Concurrency Control](#concurrency-control)
* [Reporting issues](#reporting-issues)
* [Contributing](#contributing)
* [License](#license)
* [Community](#community)## Latest Binaries
See the [online documentation](https://docs.delta.io/latest/) for the latest release.
## API Documentation
* [Scala API docs](https://docs.delta.io/latest/delta-apidoc.html)
* [Java API docs](https://docs.delta.io/latest/api/java/index.html)
* [Python API docs](https://docs.delta.io/latest/api/python/index.html)## Compatibility
[Delta Standalone](https://docs.delta.io/latest/delta-standalone.html) library is a single-node Java library that can be used to read from and write to Delta tables. Specifically, this library provides APIs to interact with a table’s metadata in the transaction log, implementing the Delta Transaction Log Protocol to achieve the transactional guarantees of the Delta Lake format.### API Compatibility
There are two types of APIs provided by the Delta Lake project.
- Direct Java/Scala/Python APIs - The classes and methods documented in the [API docs](https://docs.delta.io/latest/delta-apidoc.html) are considered as stable public APIs. All other classes, interfaces, methods that may be directly accessible in code are considered internal, and they are subject to change across releases.
- Spark-based APIs - You can read Delta tables through the `DataFrameReader`/`Writer` (i.e. `spark.read`, `df.write`, `spark.readStream` and `df.writeStream`). Options to these APIs will remain stable within a major release of Delta Lake (e.g., 1.x.x).
- See the [online documentation](https://docs.delta.io/latest/releases.html) for the releases and their compatibility with Apache Spark versions.### Data Storage Compatibility
Delta Lake guarantees backward compatibility for all Delta Lake tables (i.e., newer versions of Delta Lake will always be able to read tables written by older versions of Delta Lake). However, we reserve the right to break forward compatibility as new features are introduced to the transaction protocol (i.e., an older version of Delta Lake may not be able to read a table produced by a newer version).
Breaking changes in the protocol are indicated by incrementing the minimum reader/writer version in the `Protocol` [action](https://github.com/delta-io/delta/blob/master/core/src/test/scala/org/apache/spark/sql/delta/ActionSerializerSuite.scala).
## Roadmap
* For the high-level Delta Lake roadmap, see [Delta Lake 2022H1 roadmap](http://delta.io/roadmap).
* For the detailed timeline, see the [project roadmap](https://github.com/delta-io/delta/milestones).## Transaction Protocol
[Delta Transaction Log Protocol](PROTOCOL.md) document provides a specification of the transaction protocol.
## Requirements for Underlying Storage Systems
Delta Lake ACID guarantees are predicated on the atomicity and durability guarantees of the storage system. Specifically, we require the storage system to provide the following.
1. **Atomic visibility**: There must be a way for a file to be visible in its entirety or not visible at all.
2. **Mutual exclusion**: Only one writer must be able to create (or rename) a file at the final destination.
3. **Consistent listing**: Once a file has been written in a directory, all future listings for that directory must return that file.See the [online documentation on Storage Configuration](https://docs.delta.io/latest/delta-storage.html) for details.
## Concurrency Control
Delta Lake ensures _serializability_ for concurrent reads and writes. Please see [Delta Lake Concurrency Control](https://docs.delta.io/latest/delta-concurrency.html) for more details.
## Reporting issues
We use [GitHub Issues](https://github.com/delta-io/delta/issues) to track community reported issues. You can also [contact](#community) the community for getting answers.
## Contributing
We welcome contributions to Delta Lake. See our [CONTRIBUTING.md](https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md) for more details.
We also adhere to the [Delta Lake Code of Conduct](https://github.com/delta-io/delta/blob/master/CODE_OF_CONDUCT.md).
## Building
Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
To compile, run
build/sbt compile
To generate artifacts, run
build/sbt package
To execute tests, run
build/sbt test
To execute a single test suite, run
build/sbt spark/'testOnly org.apache.spark.sql.delta.optimize.OptimizeCompactionSQLSuite'
To execute a single test within and a single test suite, run
build/sbt spark/'testOnly *.OptimizeCompactionSQLSuite -- -z "optimize command: on partitioned table - all partitions"'
Refer to [SBT docs](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html) for more commands.
## IntelliJ Setup
IntelliJ is the recommended IDE to use when developing Delta Lake. To import Delta Lake as a new project:
1. Clone Delta Lake into, for example, `~/delta`.
2. In IntelliJ, select `File` > `New Project` > `Project from Existing Sources...` and select `~/delta`.
3. Under `Import project from external model` select `sbt`. Click `Next`.
4. Under `Project JDK` specify a valid Java `1.8` JDK and opt to use SBT shell for `project reload` and `builds`.
5. Click `Finish`.### Setup Verification
After waiting for IntelliJ to index, verify your setup by running a test suite in IntelliJ.
1. Search for and open `DeltaLogSuite`
2. Next to the class declaration, right click on the two green arrows and select `Run 'DeltaLogSuite'`### Troubleshooting
If you see errors of the form
```
Error:(46, 28) object DeltaSqlBaseParser is not a member of package io.delta.sql.parser
import io.delta.sql.parser.DeltaSqlBaseParser._
...
Error:(91, 22) not found: type DeltaSqlBaseParser
val parser = new DeltaSqlBaseParser(tokenStream)
```then follow these steps:
1. Compile using the SBT CLI: `build/sbt compile`.
2. Go to `File` > `Project Structure...` > `Modules` > `delta-spark`.
3. In the right panel under `Source Folders` remove any `target` folders, e.g. `target/scala-2.12/src_managed/main [generated]`
4. Click `Apply` and then re-run your test.## License
Apache License 2.0, see [LICENSE](https://github.com/delta-io/delta/blob/master/LICENSE.txt).## Community
There are two mediums of communication within the Delta Lake community.
* Public Slack Channel
- [Register here](https://go.delta.io/slack)
- [Login here](https://delta-users.slack.com/)
* [Linkedin page](https://www.linkedin.com/company/deltalake)
* [Youtube channel](https://www.youtube.com/c/deltalake)
* Public [Mailing list](https://groups.google.com/forum/#!forum/delta-users)