An open API service indexing awesome lists of open source software.

https://github.com/eventual-inc/daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://github.com/eventual-inc/daft

big-data data-engineering data-science dataframe distributed-computing machine-learning python rust

Last synced: 13 days ago
JSON representation

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

Awesome Lists containing this project

README

          

|Banner|

|CI| |PyPI| |Latest Tag| |Coverage| |Slack|

`Website `_ • `Docs `_ • `Installation `_ • `Daft Quickstart `_ • `Community and Support `_

Daft: High-Performance Data Engine for AI and Multimodal Workloads
==================================================================

|TrendShift|

`Daft `_ is a high-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale.

* **Native multimodal processing:** Process images, audio, video, and embeddings alongside structured data in a single framework
* **Built-in AI operations:** Run LLM prompts, generate embeddings, and classify data at scale using OpenAI, Transformers, or custom models
* **Python-native, Rust-powered:** Skip the JVM complexity with Python at its core and Rust under the hood for blazing performance
* **Seamless scaling:** Start local, scale to distributed clusters on `Ray `_, `Kubernetes `_
* **Universal connectivity:** Access data anywhere (S3, GCS, Iceberg, Delta Lake, Hugging Face, Unity Catalog)
* **Out-of-box reliability:** Intelligent memory management and sensible defaults eliminate configuration headaches

Getting Started
---------------

Installation
^^^^^^^^^^^^

Install Daft with ``pip install daft``. Requires Python 3.10 or higher.

For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our `Installation Guide `_

Quickstart
^^^^^^^^^^

Get started in minutes with our `Quickstart `_ - load a real-world e-commerce dataset, process product images, and run AI inference at scale.

More Resources
^^^^^^^^^^^^^^

* `Examples `_ - see Daft in action with use cases across text, images, audio, and more
* `User Guide `_ - take a deep-dive into each topic within Daft
* `API Reference `_ - API reference for public classes/functions of Daft

Benchmarks
----------
|Benchmark Image|

To see the full benchmarks, detailed setup, and logs, check out our `benchmarking page. `_

Contributing
------------

We ❤️ developers! To start contributing to Daft, please read `CONTRIBUTING.md `_. This document describes the development lifecycle and toolchain for working on Daft. It also details how to add new functionality to the core engine and expose it through a Python API.

Here's a list of `good first issues `_ to get yourself warmed up with Daft. Comment in the issue to pick it up, and feel free to ask any questions!

Telemetry
---------

To help improve Daft, we collect non-identifiable data via Scarf (https://scarf.sh).

To disable this behavior, set the environment variable ``DO_NOT_TRACK=true``.

The data that we collect is:

1. **Non-identifiable:** Events are keyed by a session ID which is generated on import of Daft
2. **Metadata-only:** We do not collect any of our users’ proprietary code or data
3. **For development only:** We do not buy or sell any user data

Please see our `documentation `_ for more details.

.. image:: https://static.scarf.sh/a.png?x-pxid=31f8d5ba-7e09-4d75-8895-5252bbf06cf6

Related Projects
----------------

+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| Engine | Query Optimizer | Multimodal | Distributed | Arrow Backed | Vectorized Execution Engine | Out-of-core |
+===================================================+=================+===============+=============+=================+=============================+=============+
| Daft | Yes | Yes | Yes | Yes | Yes | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| `Pandas `_ | No | Python object | No | optional >= 2.0 | Some(Numpy) | No |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| `Polars `_ | Yes | Python object | No | Yes | Yes | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| `Modin `_ | Yes | Python object | Yes | No | Some(Pandas) | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| `Ray Data `_ | No | Yes | Yes | Yes | Some(PyArrow) | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| `PySpark `_ | Yes | No | Yes | Pandas UDF/IO | Pandas UDF | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| `Dask DF `_ | No | Python object | Yes | No | Some(Pandas) | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+

License
-------

Daft has an Apache 2.0 license - please see the LICENSE file.

.. |Quickstart Image| image:: https://github.com/Eventual-Inc/Daft/assets/17691182/dea2f515-9739-4f3e-ac58-cd96d51e44a8
:alt: Dataframe code to load a folder of images from AWS S3 and create thumbnails
:height: 256

.. |Benchmark Image| image:: https://raw.githubusercontent.com/Eventual-Inc/Daft/refs/heads/main/assets/benchmark.png
:alt: AI Benchmarks

.. |Banner| image:: https://daft.ai/images/diagram.png
:target: https://www.daft.ai
:alt: Daft dataframes can load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying

.. |CI| image:: https://github.com/Eventual-Inc/Daft/actions/workflows/pr-test-suite.yml/badge.svg
:target: https://github.com/Eventual-Inc/Daft/actions/workflows/pr-test-suite.yml?query=branch:main
:alt: GitHub Actions tests

.. |PyPI| image:: https://img.shields.io/pypi/v/daft.svg?label=pip&logo=PyPI&logoColor=white
:target: https://pypi.org/project/daft
:alt: PyPI

.. |Latest Tag| image:: https://img.shields.io/github/v/tag/Eventual-Inc/Daft?label=latest&logo=GitHub
:target: https://github.com/Eventual-Inc/Daft/tags
:alt: latest tag

.. |Coverage| image:: https://codecov.io/gh/Eventual-Inc/Daft/branch/main/graph/badge.svg?token=J430QVFE89
:target: https://codecov.io/gh/Eventual-Inc/Daft
:alt: Coverage

.. |Slack| image:: https://img.shields.io/badge/slack-@distdata-purple.svg?logo=slack
:target: https://join.slack.com/t/dist-data/shared_invite/zt-3rh9jr9iv-tmmTNOlQpfvhEy2NTMWS_w
:alt: slack community

.. |TrendShift| image:: https://trendshift.io/api/badge/repositories/8239
:target: https://trendshift.io/repositories/8239
:alt: Eventual-Inc/Daft | Trendshift
:width: 250px
:height: 55px