https://github.com/eventual-inc/daft
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://github.com/eventual-inc/daft
big-data data-engineering data-science dataframe distributed-computing machine-learning python rust
Last synced: 13 days ago
JSON representation
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
- Host: GitHub
- URL: https://github.com/eventual-inc/daft
- Owner: Eventual-Inc
- License: apache-2.0
- Created: 2022-04-25T22:02:29.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2025-05-08T20:55:25.000Z (12 months ago)
- Last Synced: 2025-05-08T20:56:03.563Z (12 months ago)
- Topics: big-data, data-engineering, data-science, dataframe, distributed-computing, machine-learning, python, rust
- Language: Rust
- Homepage: https://getdaft.io
- Size: 26.8 MB
- Stars: 2,814
- Watchers: 20
- Forks: 207
- Open Issues: 345
-
Metadata Files:
- Readme: README.rst
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project
README
|Banner|
|CI| |PyPI| |Latest Tag| |Coverage| |Slack|
`Website `_ • `Docs `_ • `Installation `_ • `Daft Quickstart `_ • `Community and Support `_
Daft: High-Performance Data Engine for AI and Multimodal Workloads
==================================================================
|TrendShift|
`Daft `_ is a high-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale.
* **Native multimodal processing:** Process images, audio, video, and embeddings alongside structured data in a single framework
* **Built-in AI operations:** Run LLM prompts, generate embeddings, and classify data at scale using OpenAI, Transformers, or custom models
* **Python-native, Rust-powered:** Skip the JVM complexity with Python at its core and Rust under the hood for blazing performance
* **Seamless scaling:** Start local, scale to distributed clusters on `Ray `_, `Kubernetes `_
* **Universal connectivity:** Access data anywhere (S3, GCS, Iceberg, Delta Lake, Hugging Face, Unity Catalog)
* **Out-of-box reliability:** Intelligent memory management and sensible defaults eliminate configuration headaches
Getting Started
---------------
Installation
^^^^^^^^^^^^
Install Daft with ``pip install daft``. Requires Python 3.10 or higher.
For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our `Installation Guide `_
Quickstart
^^^^^^^^^^
Get started in minutes with our `Quickstart `_ - load a real-world e-commerce dataset, process product images, and run AI inference at scale.
More Resources
^^^^^^^^^^^^^^
* `Examples `_ - see Daft in action with use cases across text, images, audio, and more
* `User Guide `_ - take a deep-dive into each topic within Daft
* `API Reference `_ - API reference for public classes/functions of Daft
Benchmarks
----------
|Benchmark Image|
To see the full benchmarks, detailed setup, and logs, check out our `benchmarking page. `_
Contributing
------------
We ❤️ developers! To start contributing to Daft, please read `CONTRIBUTING.md `_. This document describes the development lifecycle and toolchain for working on Daft. It also details how to add new functionality to the core engine and expose it through a Python API.
Here's a list of `good first issues `_ to get yourself warmed up with Daft. Comment in the issue to pick it up, and feel free to ask any questions!
Telemetry
---------
To help improve Daft, we collect non-identifiable data via Scarf (https://scarf.sh).
To disable this behavior, set the environment variable ``DO_NOT_TRACK=true``.
The data that we collect is:
1. **Non-identifiable:** Events are keyed by a session ID which is generated on import of Daft
2. **Metadata-only:** We do not collect any of our users’ proprietary code or data
3. **For development only:** We do not buy or sell any user data
Please see our `documentation `_ for more details.
.. image:: https://static.scarf.sh/a.png?x-pxid=31f8d5ba-7e09-4d75-8895-5252bbf06cf6
Related Projects
----------------
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| Engine | Query Optimizer | Multimodal | Distributed | Arrow Backed | Vectorized Execution Engine | Out-of-core |
+===================================================+=================+===============+=============+=================+=============================+=============+
| Daft | Yes | Yes | Yes | Yes | Yes | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| `Pandas `_ | No | Python object | No | optional >= 2.0 | Some(Numpy) | No |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| `Polars `_ | Yes | Python object | No | Yes | Yes | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| `Modin `_ | Yes | Python object | Yes | No | Some(Pandas) | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| `Ray Data `_ | No | Yes | Yes | Yes | Some(PyArrow) | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| `PySpark `_ | Yes | No | Yes | Pandas UDF/IO | Pandas UDF | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| `Dask DF `_ | No | Python object | Yes | No | Some(Pandas) | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
License
-------
Daft has an Apache 2.0 license - please see the LICENSE file.
.. |Quickstart Image| image:: https://github.com/Eventual-Inc/Daft/assets/17691182/dea2f515-9739-4f3e-ac58-cd96d51e44a8
:alt: Dataframe code to load a folder of images from AWS S3 and create thumbnails
:height: 256
.. |Benchmark Image| image:: https://raw.githubusercontent.com/Eventual-Inc/Daft/refs/heads/main/assets/benchmark.png
:alt: AI Benchmarks
.. |Banner| image:: https://daft.ai/images/diagram.png
:target: https://www.daft.ai
:alt: Daft dataframes can load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying
.. |CI| image:: https://github.com/Eventual-Inc/Daft/actions/workflows/pr-test-suite.yml/badge.svg
:target: https://github.com/Eventual-Inc/Daft/actions/workflows/pr-test-suite.yml?query=branch:main
:alt: GitHub Actions tests
.. |PyPI| image:: https://img.shields.io/pypi/v/daft.svg?label=pip&logo=PyPI&logoColor=white
:target: https://pypi.org/project/daft
:alt: PyPI
.. |Latest Tag| image:: https://img.shields.io/github/v/tag/Eventual-Inc/Daft?label=latest&logo=GitHub
:target: https://github.com/Eventual-Inc/Daft/tags
:alt: latest tag
.. |Coverage| image:: https://codecov.io/gh/Eventual-Inc/Daft/branch/main/graph/badge.svg?token=J430QVFE89
:target: https://codecov.io/gh/Eventual-Inc/Daft
:alt: Coverage
.. |Slack| image:: https://img.shields.io/badge/slack-@distdata-purple.svg?logo=slack
:target: https://join.slack.com/t/dist-data/shared_invite/zt-3rh9jr9iv-tmmTNOlQpfvhEy2NTMWS_w
:alt: slack community
.. |TrendShift| image:: https://trendshift.io/api/badge/repositories/8239
:target: https://trendshift.io/repositories/8239
:alt: Eventual-Inc/Daft | Trendshift
:width: 250px
:height: 55px