{"id":51058216,"url":"https://github.com/quickwit-oss/tantivy-datafusion","last_synced_at":"2026-06-22T23:01:19.214Z","repository":{"id":365800590,"uuid":"1157052355","full_name":"quickwit-oss/tantivy-datafusion","owner":"quickwit-oss","description":null,"archived":false,"fork":false,"pushed_at":"2026-06-18T23:18:03.000Z","size":427,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-19T01:13:51.273Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/quickwit-oss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-02-13T11:20:24.000Z","updated_at":"2026-06-18T23:18:07.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/quickwit-oss/tantivy-datafusion","commit_stats":null,"previous_names":["quickwit-oss/tantivy-datafusion"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/quickwit-oss/tantivy-datafusion","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Ftantivy-datafusion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Ftantivy-datafusion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Ftantivy-datafusion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Ftantivy-datafusion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/quickwit-oss","download_url":"https://codeload.github.com/quickwit-oss/tantivy-datafusion/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Ftantivy-datafusion/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34668499,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-22T02:00:06.391Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-22T23:01:18.188Z","updated_at":"2026-06-22T23:01:19.208Z","avatar_url":"https://github.com/quickwit-oss.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tantivy-datafusion\n\n`tantivy-datafusion` exposes Tantivy indexes as DataFusion tables so callers can\nrun SQL over search indexes.\n\nThe crate is aimed at query engines that already own Tantivy indexes or split\nmetadata and want DataFusion execution, projection, filtering, aggregation, and\ndistributed-plan serialization around that data. It is not an Elasticsearch API\ncompatibility layer.\n\nStatus: alpha. The core execution path is tested, but public APIs may still\nchange as the Quickwit integration settles.\n\n## What It Provides\n\n- A `TantivyTableProvider` that presents one or more Tantivy indexes as a single\n  DataFusion table.\n- SQL projection and filtering over Tantivy fast fields.\n- A `full_text(column, query)` SQL UDF that pushes Tantivy query parsing and\n  full-text search into the table scan.\n- Optional `_score` and `_document` columns for scored search results and stored\n  document retrieval.\n- `AggPushdown`, a DataFusion physical optimizer rule that pushes supported\n  aggregations into Tantivy's native aggregation engine.\n- Split/runtime abstractions for distributed execution:\n  `SplitDescriptor`, `SplitRuntimeFactory`, `SyncExecutionPool`, and\n  `TantivyCodec`.\n\n## Data Model\n\nThe provider builds a DataFusion schema from Tantivy fast fields. It also adds\ninternal/search columns:\n\n- `_doc_id`: Tantivy document id within a segment.\n- `_segment_ord`: Tantivy segment ordinal.\n- `_score`: `Float32`, populated when a scored full-text query is active.\n- `_document`: stored Tantivy document serialized as JSON.\n\nFast fields are the main SQL column surface. Text fields can also participate in\n`full_text(...)` predicates when they are indexed in Tantivy.\n\n## Basic Usage\n\n```rust\nuse std::sync::Arc;\n\nuse datafusion::prelude::*;\nuse tantivy::Index;\nuse tantivy_datafusion::{full_text_udf, TantivyTableProvider};\n\nasync fn query_index(index: Index) -\u003e datafusion::common::Result\u003c()\u003e {\n    let ctx = SessionContext::new();\n\n    ctx.register_udf(full_text_udf());\n    ctx.register_table(\"docs\", Arc::new(TantivyTableProvider::new(index)))?;\n\n    let batches = ctx\n        .sql(\n            \"SELECT id, price, _score\n             FROM docs\n             WHERE full_text(category, 'electronics') AND price \u003e 2.0\n             ORDER BY _score DESC\n             LIMIT 10\",\n        )\n        .await?\n        .collect()\n        .await?;\n\n    for batch in batches {\n        println!(\"{batch:?}\");\n    }\n\n    Ok(())\n}\n```\n\nFor aggregation pushdown, register `AggPushdown` in the DataFusion session state:\n\n```rust\nuse std::sync::Arc;\n\nuse datafusion::execution::SessionStateBuilder;\nuse datafusion::prelude::*;\nuse tantivy_datafusion::{full_text_udf, AggPushdown, TantivyTableProvider};\n\nlet state = SessionStateBuilder::new()\n    .with_config(SessionConfig::new())\n    .with_default_features()\n    .with_physical_optimizer_rule(Arc::new(AggPushdown::new()))\n    .build();\n\nlet ctx = SessionContext::new_with_state(state);\nctx.register_udf(full_text_udf());\nctx.register_table(\"docs\", Arc::new(TantivyTableProvider::new(index)))?;\n```\n\nSupported pushdowns include common grouped and ungrouped aggregations that can be\nrepresented by Tantivy aggregation requests. Unsupported aggregate shapes fall\nback to normal DataFusion execution.\n\n## Multi-Split Execution\n\nFor local multi-index execution, use:\n\n```rust\nlet provider = TantivyTableProvider::from_local_splits(indexes)?;\nctx.register_table(\"docs\", Arc::new(provider))?;\n```\n\nFor distributed execution, integrations provide split metadata and runtime\nresolution:\n\n- `SplitDescriptor` carries serializable split metadata.\n- `SplitRuntimeFactory` prepares a worker-local `PreparedSplit`.\n- `TantivyCodec` serializes DataFusion physical plans containing Tantivy data\n  sources.\n- `SyncExecutionPool` lets the embedding runtime choose where synchronous\n  Tantivy query work runs.\n\nThese hooks are intended for Quickwit-style split scheduling where planning and\nexecution happen in different processes.\n\n## Current Scope\n\nThis repository is focused on SQL execution over Tantivy:\n\n- fast-field scans;\n- full-text predicates through `full_text(...)`;\n- score/document projection when requested;\n- schema evolution across splits;\n- aggregation pushdown where Tantivy can execute the aggregation directly;\n- distributed physical-plan serialization.\n\nOut of scope for this crate:\n\n- implementing the Elasticsearch REST API;\n- preserving Elasticsearch response formats;\n- serving as a general ES compatibility layer.\n\nHistorical design notes and integration plans live under `docs/`; they are not\nAPI contracts.\n\n## Development\n\nRun the standard checks:\n\n```bash\ncargo fmt --check\ncargo test\ncargo clippy --all-targets -- -D warnings\nRUSTDOCFLAGS=\"-D warnings\" cargo doc --no-deps\n```\n\nThe benchmark suite includes aggregation scenarios:\n\n```bash\ncargo bench --bench agg_bench\n```\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquickwit-oss%2Ftantivy-datafusion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquickwit-oss%2Ftantivy-datafusion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquickwit-oss%2Ftantivy-datafusion/lists"}