{"id":13487048,"url":"https://github.com/lakesoul-io/LakeSoul","last_synced_at":"2025-03-27T21:31:52.275Z","repository":{"id":36965871,"uuid":"442405131","full_name":"lakesoul-io/LakeSoul","owner":"lakesoul-io","description":"LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.","archived":false,"fork":false,"pushed_at":"2024-10-29T09:21:28.000Z","size":36113,"stargazers_count":2377,"open_issues_count":14,"forks_count":423,"subscribers_count":247,"default_branch":"main","last_synced_at":"2024-10-29T11:39:23.164Z","etag":null,"topics":["arrow","big-data","datafusion","datalake","flink","huggingface","lakehouse","lakesoul","postgresql","python","pytorch","rust","spark","sql","streaming","vectorized","velox"],"latest_commit_sha":null,"homepage":"https://lakesoul-io.github.io/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lakesoul-io.png","metadata":{"files":{"readme":"README-CN.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-28T08:53:11.000Z","updated_at":"2024-10-29T09:21:32.000Z","dependencies_parsed_at":"2023-11-06T03:24:25.860Z","dependency_job_id":"682021bf-c559-4818-bf0f-51cf3b8eaff9","html_url":"https://github.com/lakesoul-io/LakeSoul","commit_stats":{"total_commits":966,"total_committers":31,"mean_commits":"31.161290322580644","dds":0.7867494824016563,"last_synced_commit":"bb19636814b6c72706bca218e57ec929c979e457"},"previous_names":[],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lakesoul-io%2FLakeSoul","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lakesoul-io%2FLakeSoul/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lakesoul-io%2FLakeSoul/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lakesoul-io%2FLakeSoul/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lakesoul-io","download_url":"https://codeload.github.com/lakesoul-io/LakeSoul/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245798594,"owners_count":20673902,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arrow","big-data","datafusion","datalake","flink","huggingface","lakehouse","lakesoul","postgresql","python","pytorch","rust","spark","sql","streaming","vectorized","velox"],"created_at":"2024-07-31T18:00:54.779Z","updated_at":"2025-03-27T21:31:52.225Z","avatar_url":"https://github.com/lakesoul-io.png","language":"Java","funding_links":[],"categories":["Lakehouse","Table of Contents","Java","大数据"],"sub_categories":["Lakehouse System"],"readme":"\u003c!--\nSPDX-FileCopyrightText: 2023 LakeSoul Contributors\n\nSPDX-License-Identifier: Apache-2.0\n--\u003e\n\n\u003cimg src='https://github.com/lakesoul-io/artwork/blob/main/horizontal/color/LakeSoul_Horizontal_Color.svg' alt=\"LakeSoul\" height='200'\u003e\n\n\u003cimg src='https://github.com/lfai/artwork/blob/main/lfaidata-assets/lfaidata-project-badge/sandbox/color/lfaidata-project-badge-sandbox-color.svg' alt=\"LF AI \u0026 Data Sandbox Project\" height='180'\u003e\n\nLakeSoul 是一款开源云原生湖仓一体框架，具备高可扩展的元数据管理、ACID 事务、高效灵活的 upsert 操作、Schema 演进和批流一体化处理等特性。LakeSoul 支持多种计算引擎读写湖仓表数据，包括 Spark、Flink、Presto、PyTorch，支持批、流、MPP、AI 多种计算模式。LakeSoul 支持 HDFS、S3 等存储系统。\n![LakeSoul 架构](website/static/img/lakeSoulModel.png)\n\nLakeSoul 由数元灵科技研发并于 2023 年 5 月正式捐赠给 Linux Foundation AI \u0026 Data 基金会，成为基金会旗下 Sandbox 孵化项目。\n\nLakeSoul 专门为数据湖云存储之上的数据进行行、列级别增量更新、高并发入库、批量扫描读取做了大量优化。云原生计算存储分离的架构使得部署非常简单，同时可以以很低的成本支撑极大的数据量。\n\nLakeSoul 通过类似 LSM-Tree 的方式在哈希分区主键 upsert 场景支持了高性能的写吞吐能力。同时高度优化的 Merge on Read 实现也保证了读性能（参考 [性能对比](https://lakesoul-io.github.io/zh-Hans/blog/2023/04/21/lakesoul-2.2.0-release)）。LakeSoul 通过 PostgreSQL 来管理元数据，实现元数据的高可扩展性和高并发事物能力。\n\nLakeSoul 使用 Rust 实现了 native 的元数据层和 IO 层，并封装了 C/Java/Python 接口，从而能够支持大数据和 AI 等多种计算框架对接。\n\nLakeSoul 支持流、批并发读写，读写全面兼容 CDC 语义，通过自动 Schema 演进和严格一次语义等功能，能够轻松构建全链路流式数仓。\n\nLakeSoul 支持多工作空间和用户权限隔离。LakeSoul 使用 Postgres 的 RBAC 和行级别安全策略，实现了元数据的权限隔离。配合 Hadoop 用户和组，可以实现物理数据隔离。LakeSoul 的权限隔离对 SQL/Java/Python 的作业都是有效的。\n\nLakeSoul 支持自动分离式 Compaction 、自动表生命周期清理、自动冗余数据清理，降低维护成本，提升易用性。\n\n更多特性和其他产品对比请参考：[特性介绍](https://lakesoul-io.github.io/zh-Hans/docs/intro)\n\n# 使用教程\n* [湖仓对接 AI：使用 Python 进行数据预处理和模型训练](https://github.com/lakesoul-io/LakeSoul/tree/main/python/examples\u001b)：LakeSoul 将湖仓和 AI 无缝衔接，打造 Data+AI 的现代数据架构。\n* [CDC 整库入湖教程](https://lakesoul-io.github.io/zh-Hans/docs/Tutorials/flink-cdc-sink): LakeSoul 通过 Flink CDC 实现 MySQL 等多种数据库的整库同步，支持自动建表、自动 DDL 变更、严格一次（exactly once）保证。\n* [Flink SQL 教程](https://lakesoul-io.github.io/zh-Hans/docs/Usage%20Docs/flink-lakesoul-connector)：LakeSoul 支持 Flink 流、批读写。流式读写完整支持 Flink Changelog 语义，支持行级别流式增删改。\n* [多流合并构建宽表教程](https://lakesoul-io.github.io/zh-Hans/docs/Tutorials/mutil-stream-merge)：LakeSoul 原生支持多个具有相同主键的流（其余列可以不同）自动合并到同一张表，消除 Join.\n* [数据更新 (Upsert) 和 Merge UDF 使用教程](https://lakesoul-io.github.io/zh-Hans/docs/Tutorials/upsert-and-merge-udf)：LakeSoul 使用 Merge UDF 自定义 Merge 逻辑的用法示例。\n* [快照相关功能用法教程](https://lakesoul-io.github.io/zh-Hans/docs/Tutorials/snapshot-manage): LakeSoul 快照读、回滚、清理等功能用法。\n* [增量查询教程](https://lakesoul-io.github.io/zh-Hans/docs/Tutorials/incremental-query): Spark 中增量查询（支持流、批两种模式）用法。\n\n# 使用文档\n\n[快速开始](https://lakesoul-io.github.io/zh-Hans/docs/Getting%20Started/setup-local-env)\n\n[使用文档](https://lakesoul-io.github.io/zh-Hans/docs/Usage%20Docs/setup-meta-env)\n\n# 特性路线\n[Feature Roadmap](https://github.com/lakesoul-io/LakeSoul#feature-roadmap)\n\n# 社区准则\n[社区准则](community-guideline-cn.md)\n\n# 问题反馈\n\n欢迎提 issue、discussion 反馈问题。\n\n### 微信公众号\n欢迎关注 \u003cu\u003e**元灵数智**\u003c/u\u003e 公众号，我们会定期推送关于 LakeSoul 的架构代码解读、端到端算法业务落地案例分享等干货文章：\n\n### LakeSoul 开发者社区微信群\n欢迎加入 LakeSoul 开发者社区微信群，随时交流 LakeSoul 开发相关的各类问题：请关注公众号后点击下方 \"了解我们-用户交流\" 获取最新微信群二维码。或扫描以下二维码添加小助手微信后加群：\n\n![微信交流群](website/static/img/wechat.png)\n\n# 联系我们\n发送邮件至 [lakesoul-technical-discuss@lists.lfaidata.foundation](mailto:lakesoul-technical-discuss@lists.lfaidata.foundation).\n\n# 开源协议\nLakeSoul 采用 Apache License v2.0 开源协议。","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flakesoul-io%2FLakeSoul","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flakesoul-io%2FLakeSoul","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flakesoul-io%2FLakeSoul/lists"}