{"id":28613999,"url":"https://github.com/lakehq/sail","last_synced_at":"2026-04-14T11:00:59.443Z","repository":{"id":251005736,"uuid":"734174248","full_name":"lakehq/sail","owner":"lakehq","description":"LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.","archived":false,"fork":false,"pushed_at":"2026-04-12T05:42:02.000Z","size":13036,"stargazers_count":1303,"open_issues_count":189,"forks_count":83,"subscribers_count":12,"default_branch":"main","last_synced_at":"2026-04-12T06:18:13.448Z","etag":null,"topics":["arrow","artificial-intelligence","big-data","data","data-engineering","datafusion","distributed-computing","machine-learning","pyspark","python","rust","spark","sql"],"latest_commit_sha":null,"homepage":"https://lakesail.com","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lakehq.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-12-21T03:43:49.000Z","updated_at":"2026-04-12T00:19:11.000Z","dependencies_parsed_at":"2024-08-19T09:43:42.156Z","dependency_job_id":"1f2449fd-ba89-4437-ba5c-459239be8724","html_url":"https://github.com/lakehq/sail","commit_stats":null,"previous_names":["lakehq/sail"],"tags_count":38,"template":false,"template_full_name":null,"purl":"pkg:github/lakehq/sail","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lakehq%2Fsail","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lakehq%2Fsail/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lakehq%2Fsail/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lakehq%2Fsail/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lakehq","download_url":"https://codeload.github.com/lakehq/sail/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lakehq%2Fsail/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31793225,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T02:24:21.117Z","status":"ssl_error","status_checked_at":"2026-04-14T02:24:20.627Z","response_time":153,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arrow","artificial-intelligence","big-data","data","data-engineering","datafusion","distributed-computing","machine-learning","pyspark","python","rust","spark","sql"],"created_at":"2025-06-12T01:10:43.658Z","updated_at":"2026-04-14T11:00:59.434Z","avatar_url":"https://github.com/lakehq.png","language":"Rust","funding_links":[],"categories":["Libraries","Rust","Data Processing \u0026 DataFrames","\u003ca name=\"Rust\"\u003e\u003c/a\u003eRust","Stream Processing"],"sub_categories":["Data processing"],"readme":"# Sail\n\n[![Build Status](https://github.com/lakehq/sail/actions/workflows/build.yml/badge.svg?branch=main\u0026event=push)](https://github.com/lakehq/sail/actions)\n[![Codecov](https://codecov.io/gh/lakehq/sail/graph/badge.svg)](https://app.codecov.io/gh/lakehq/sail)\n[![PyPI Release](https://img.shields.io/pypi/v/pysail)](https://pypi.org/project/pysail/)\n[![Static Slack Badge](https://img.shields.io/badge/slack-LakeSail_Community-3762E0?logo=slack)](https://www.launchpass.com/lakesail-community/free)\n\nSail is an open-source **unified and distributed multimodal computation framework** created by [LakeSail](https://lakesail.com/).\n\nOur mission is to **unify batch processing, stream processing, and compute-intensive AI workloads**. Sail is a compute engine that is:\n\n- **Compatible** with the Spark Connect protocol, supporting the Spark SQL and DataFrame API with no code rewrites required.\n- **~4x faster** than Spark in benchmarks (up to 8x in specific workloads).\n- **94% cheaper** on infrastructure costs.\n- **100% Rust-native** with no JVM overhead, delivering memory safety, instant startup, and predictable performance.\n\n**🚀 Sail outperforms Spark, popular Spark accelerators, Databricks, and Snowflake on [ClickBench](https://go.lakesail.com/clickbench).**\n\n**💬 [Join our Slack community](https://www.launchpass.com/lakesail-community/free)** to ask questions, share feedback, and connect with other Sail users and contributors.\n\n## Documentation\n\nThe documentation of the latest Sail version can be found [here](https://docs.lakesail.com/sail/latest/).\n\n## Installation\n\n### Quick Start\n\nSail is available as a Python package on PyPI. You can install it along with PySpark in your Python environment.\n\n```bash\npip install pysail\npip install \"pyspark[connect]\"\n```\n\nAlternatively, you can install the lightweight client package `pyspark-client` since Spark 4.0.\nThe `pyspark-connect` package, which is equivalent to `pyspark[connect]`, is also available since Spark 4.0.\n\n### Advanced Use Cases\n\nYou can install Sail from source to optimize performance for your specific hardware architecture. The detailed [Installation Guide](https://docs.lakesail.com/sail/latest/introduction/installation/) walks you through this process step-by-step.\n\nIf you need to deploy Sail in production environments, the [Deployment Guide](https://docs.lakesail.com/sail/latest/guide/deployment/) provides comprehensive instructions for deploying Sail on Kubernetes clusters and other infrastructure configurations.\n\n## Getting Started\n\n### Starting the Sail Server\n\n**Option 1: Command Line Interface.** You can start the local Sail server using the `sail` command.\n\n```bash\nsail spark server --port 50051\n```\n\n**Option 2: Python API.** You can start the local Sail server using the Python API.\n\n```python\nfrom pysail.spark import SparkConnectServer\n\nserver = SparkConnectServer(port=50051)\nserver.start(background=False)\n```\n\n**Option 3: Kubernetes.** You can deploy Sail on Kubernetes and run Sail in cluster mode for distributed processing.\nPlease refer to the [Kubernetes Deployment Guide](https://docs.lakesail.com/sail/latest/guide/deployment/kubernetes.html) for instructions on building the Docker image and writing the Kubernetes manifest YAML file.\n\n```bash\nkubectl apply -f sail.yaml\nkubectl -n sail port-forward service/sail-spark-server 50051:50051\n```\n\n### Connecting to the Sail Server\n\nOnce you have a running Sail server, you can connect to it in PySpark.\nNo changes are needed in your PySpark code!\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.remote(\"sc://localhost:50051\").getOrCreate()\nspark.sql(\"SELECT 1 + 1\").show()\n```\n\nPlease refer to the [Getting Started](https://docs.lakesail.com/sail/latest/introduction/getting-started/) guide for further details.\n\n## Feature Highlights\n\n### Storage\n\nSail supports a variety of storage backends for reading and writing data. You can read more details in our [Storage Guide](https://docs.lakesail.com/sail/latest/guide/storage/).\n\nHere are the storage options supported:\n\n- AWS S3\n- Cloudflare R2\n- Azure\n- Google Cloud Storage\n- Hugging Face\n- HDFS\n- File systems\n- HTTP/HTTPS\n- In-memory storage\n\n### Lakehouse Formats\n\nSail provides native support for modern lakehouse table formats, offering reliable storage layers with strong data management guarantees and ensuring interoperability with existing datasets.\n\nPlease refer to the following guides for the supported formats:\n\n- [Delta Lake Guide](https://docs.lakesail.com/sail/latest/guide/formats/delta.html)\n- [Apache Iceberg Guide](https://docs.lakesail.com/sail/latest/guide/formats/iceberg.html)\n\n### Catalog Providers\n\nSail supports multiple catalog providers, such as the Apache Iceberg REST Catalog and Unity Catalog. You can manage datasets as external tables and integrate with broader data-platform ecosystems.\n\nFor more details on usage and best practices, see the [Catalog Guide](https://docs.lakesail.com/sail/latest/guide/catalog/).\n\n## Benchmark Results\n\nDerived TPC-H results show that Sail outperforms Apache Spark in every query:\n\n- **Execution Time**: ~4× faster across diverse SQL workloads.\n- **Hardware Cost**: 94% lower with significantly lower peak memory usage and zero shuffle spill.\n\n| Metric                     | Spark    | Sail            |\n| -------------------------- | -------- | --------------- |\n| Total Query Time           | 387.36 s | **102.75 s**    |\n| Query Speed-Up             | Baseline | **43% – 727%**  |\n| Peak Memory Usage          | 54 GB    | **22 GB (1 s)** |\n| Disk Write (Shuffle Spill) | \u003e 110 GB | **0 GB**        |\n\nThese results come from a derived TPC-H benchmark (22 queries, scale factor 100, Parquet format) on AWS `r8g.4xlarge` instances.\n\n![Query Time Comparison](https://github.com/lakehq/sail/raw/46d0520532f22e99de6d9ade6373a117216484ca/.github/images/query-time.svg)\n\nSee the full analysis and graphs on our [Benchmark Results](https://docs.lakesail.com/sail/latest/introduction/benchmark-results/) page.\n\n## Why Choose Sail?\n\nWhen Spark was invented over 15 years ago, it was revolutionary. It redefined distributed data processing and powered ETL, machine learning, and analytics pipelines across industries.\n\nBut Spark’s JVM-based architecture now struggles to meet modern demands for performance and cloud efficiency:\n\n- **Garbage collection pauses** introduce latency spikes.\n- **Serialization overhead** slows data exchange between JVM and Python.\n- **Heavy executors** drive up cloud costs and complicate scaling.\n- **Row-based processing** performs poorly on analytical workloads and leaves hardware efficiency untapped.\n- **Slow startup** delays workloads, hurting interactive and on-demand use cases.\n\nSail solves these problems with a modern, Rust-native design.\n\n### Sail is Spark-compatible\n\nSail offers a drop-in replacement for Spark SQL and the Spark DataFrame API. Existing PySpark code works out of the box once you connect your Spark client session to Sail over the Spark Connect protocol.\n\n- **Spark SQL Dialect Support.** A custom Rust parser (built with parser combinators and Rust procedural macros) covers Spark SQL syntax with production-grade accuracy.\n- **DataFrame API Support.** Spark DataFrame operations run on Sail with identical semantics.\n- **Python UDF, UDAF, UDWF, and UDTF Support.** Python, Pandas, and Arrow UDFs all follow the same conventions as Spark.\n\n### Sail’s Advantages over Spark\n\n- **Rust-Native Engine.** No garbage collection pauses, no JVM memory tuning, and low memory footprint.\n- **Columnar Format and Vectorized Execution.** Built on top of [Apache Arrow](https://github.com/apache/arrow) and [Apache DataFusion](https://github.com/apache/datafusion), the columnar in-memory format and SIMD instructions unlock blazing-fast query execution.\n- **Lightning-Fast Python UDFs.** Python code runs inside Sail with zero serialization overhead as Arrow array pointers enable zero-copy data sharing.\n- **Performant Data Shuffling.** Workers exchange Arrow columnar data directly, minimizing shuffle costs for joins and aggregations.\n- **Lightweight, Stateless Workers.** Workers start in seconds, consume only a few megabytes of memory at idle, and scale elastically to cut cloud costs and simplify operations.\n- **Concurrency and Memory Safety You Can Trust.** Rust’s ownership model prevents null pointers, race conditions, and unsafe memory access for unmatched reliability.\n\nCurious about how Sail stacks up against Spark? Explore our [Why Sail?](https://lakesail.com/why-sail/) page. Ready to bring your existing workloads over? Our [Migration Guide](https://docs.lakesail.com/sail/latest/introduction/migrating-from-spark/) shows you how.\n\n## Further Reading\n\n- [Architecture](https://docs.lakesail.com/sail/latest/concepts/architecture/) – Overview of Sail’s design for both local and cluster modes, and how it transitions seamlessly between them.\n- [Query Planning](https://docs.lakesail.com/sail/latest/concepts/query-planning/) – Detailed explanation of how Sail parses SQL and Spark relations, builds logical and physical plans, and handles execution for local and cluster modes.\n- [SQL](https://docs.lakesail.com/sail/latest/guide/sql/) and [DataFrame](https://docs.lakesail.com/sail/latest/guide/dataframe/) Features – Complete reference for Spark SQL and DataFrame API compatibility.\n- [LakeSail Blog](https://lakesail.com/blog/) – Updates on Sail releases, benchmarks, and technical insights.\n\n**✨Using Sail? [Tell us your story](https://lakesail.com/share-story/) and get free merch!✨**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flakehq%2Fsail","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flakehq%2Fsail","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flakehq%2Fsail/lists"}