{"id":26943865,"url":"https://github.com/aixiasang/pylsm","last_synced_at":"2025-04-02T17:17:46.624Z","repository":{"id":285719875,"uuid":"959108321","full_name":"aixiasang/pyLsm","owner":"aixiasang","description":"PyLSM是一个基于LSM树(Log-Structured Merge Tree)架构的高性能键值存储引擎，使用Python实现。本项目不仅提供了高效的键值对存储和检索功能，还包含了布隆过滤器和分层压缩等高级特性。","archived":false,"fork":false,"pushed_at":"2025-04-02T09:29:36.000Z","size":66,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-02T10:32:23.472Z","etag":null,"topics":["kvsql","lsm","nosql","pylsm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aixiasang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-04-02T09:27:29.000Z","updated_at":"2025-04-02T09:30:23.000Z","dependencies_parsed_at":"2025-04-02T10:42:55.433Z","dependency_job_id":null,"html_url":"https://github.com/aixiasang/pyLsm","commit_stats":null,"previous_names":["aixiasang/pylsm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aixiasang%2FpyLsm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aixiasang%2FpyLsm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aixiasang%2FpyLsm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aixiasang%2FpyLsm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aixiasang","download_url":"https://codeload.github.com/aixiasang/pyLsm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246856600,"owners_count":20844974,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kvsql","lsm","nosql","pylsm"],"created_at":"2025-04-02T17:17:46.014Z","updated_at":"2025-04-02T17:17:46.618Z","avatar_url":"https://github.com/aixiasang.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PyLSM: Python LSM树键值存储引擎\n\nPyLSM是一个基于LSM树(Log-Structured Merge Tree)架构的高性能键值存储引擎，使用Python实现。本项目不仅提供了高效的键值对存储和检索功能，还包含了布隆过滤器和分层压缩等高级特性。\n\n## 🌟 核心特性\n\n- 📦 LSM树架构：实现了完整的LSM树数据结构\n- 🚀 高性能读写：优化的读写路径设计\n- 🔍 范围查询：支持高效的范围扫描操作\n- 🌸 布隆过滤器：减少不必要的磁盘访问\n- 📊 分层压缩：智能的文件合并策略\n- 🛠️ 命令行工具：交互式数据操作界面\n- 💾 数据持久化：支持崩溃恢复\n- 🔄 自动压缩：后台自动进行文件合并\n\n## 📚 教学价值\n\n本项目特别适合以下学习场景：\n\n1. **数据库原理课程**：\n   - LSM树数据结构的实际实现\n   - 存储引擎的核心概念\n   - 数据持久化机制\n\n2. **系统设计课程**：\n   - 高性能系统架构设计\n   - 并发控制实现\n   - 文件系统交互\n\n3. **Python高级编程**：\n   - 面向对象设计模式\n   - 文件IO处理\n   - 性能优化技术\n\n## 🔧 安装说明\n\n```bash\n# 克隆项目\ngit clone https://github.com/aixiasang/pyLsm.git\ncd pyLsm\n\n# 安装依赖\npip install -r requirements.txt\n```\n\n## 🚀 快速入门\n\n### 基础操作示例\n\n```python\nfrom pylsm.db import DB\n\n# 创建数据库实例\ndb = DB(\"./test_db\")\n\n# 写入键值对\ndb.put(b\"hello\", b\"world\")\ndb.put(b\"name\", b\"PyLSM\")\n\n# 读取值\nvalue = db.get(b\"hello\")  # 返回 b\"world\"\nprint(f\"读取结果: {value.decode()}\")\n\n# 范围查询\nprint(\"遍历所有键值对:\")\nfor key, value in db.range(b\"a\", b\"z\"):\n    print(f\"键: {key.decode()}, 值: {value.decode()}\")\n\n# 删除键\ndb.delete(b\"hello\")\n\n# 关闭数据库\ndb.close()\n```\n\n### 命令行工具使用\n\nPyLSM提供了强大的命令行工具，方便进行交互式操作：\n\n```bash\npython -m pylsm.cli ./my_db\n```\n\n支持的命令：\n```\nopen [--no-create]  # 打开数据库\nclose              # 关闭数据库\nput \u003ckey\u003e \u003cvalue\u003e  # 插入键值对\nget \u003ckey\u003e          # 获取值\ndelete \u003ckey\u003e       # 删除键\nscan               # 范围扫描\n  --start \u003ckey\u003e    # 起始键\n  --end \u003ckey\u003e      # 结束键\n  --limit \u003cn\u003e      # 限制返回数量\ncompact            # 手动触发压缩\ninfo               # 显示数据库信息\nhelp               # 显示帮助信息\nexit               # 退出CLI\n```\n\n## 📖 深入理解LSM树\n\n### LSM树工作原理\n\n1. **写入流程**：\n   ```\n   内存表(MemTable)\n        ↓\n   不可变内存表(Immutable MemTable)\n        ↓\n   SSTable文件(Level 0)\n        ↓\n   分层压缩(Level 1-N)\n   ```\n\n2. **读取流程**：\n   ```\n   查询键值\n     ↓\n   检查内存表 → 未找到\n     ↓\n   检查不可变内存表 → 未找到\n     ↓\n   检查布隆过滤器\n     ↓\n   按层检查SSTable文件\n   ```\n\n### 核心组件详解\n\n1. **MemTable（内存表）**\n   - 实现：跳表数据结构\n   - 特点：快速的读写性能\n   - 源码：`pylsm/memtable.py`\n\n2. **WAL（预写日志）**\n   - 作用：确保数据持久性\n   - 实现：顺序写入磁盘\n   - 源码：`pylsm/wal.py`\n\n3. **SSTable（排序字符串表）**\n   - 结构：数据块+索引块+元数据\n   - 特点：不可变、有序存储\n   - 源码：`pylsm/sstable.py`\n\n4. **布隆过滤器**\n   - 作用：快速判断键是否存在\n   - 原理：概率型数据结构\n   - 源码：`pylsm/bloom_filter.py`\n\n## 🔬 高级特性\n\n### 优化配置示例\n\n```python\nfrom pylsm.db import DB, Options\n\n# 创建优化的配置选项\noptions = Options(\n    # 内存表大小设置为2MB\n    memtable_size=2 * 1024 * 1024,\n    \n    # 布隆过滤器参数（每个键使用10位）\n    bloom_filter_bits=10,\n    \n    # 最大层级数（影响压缩策略）\n    max_level=7,\n    \n    # Level 0大小设置为4MB\n    level0_size=4 * 1024 * 1024,\n    \n    # 相邻层大小比例\n    size_ratio=10\n)\n\n# 使用优化配置创建数据库\ndb = DB(\"./optimized_db\", options=options)\n```\n\n### 批量写入操作\n\n```python\n# 原子性批量写入示例\nwith db.batch_write() as batch:\n    for i in range(1000):\n        key = f\"user:{i}\".encode()\n        value = f\"data:{i}\".encode()\n        batch.put(key, value)\n```\n\n### 高级范围查询\n\n```python\n# 带限制的范围查询示例\ndef range_query_with_limit(db, start_key, end_key, limit=10):\n    count = 0\n    print(f\"查询范围: {start_key.decode()} 到 {end_key.decode()}\")\n    print(\"-\" * 40)\n    \n    for key, value in db.range(start_key, end_key):\n        print(f\"键: {key.decode()}\")\n        print(f\"值: {value.decode()}\")\n        print(\"-\" * 20)\n        \n        count += 1\n        if count \u003e= limit:\n            break\n            \n    print(f\"共返回 {count} 条记录\")\n```\n\n## 🎯 性能优化技巧\n\n1. **内存管理**\n   - 合理设置内存表大小\n   - 控制缓存使用量\n   - 及时触发内存表刷盘\n\n2. **压缩策略**\n   - 选择合适的层级数\n   - 设置合理的大小比例\n   - 控制文件数量\n\n3. **读取优化**\n   - 利用布隆过滤器\n   - 缓存热点数据\n   - 优化查找路径\n\n## 🔍 调试和监控\n\n### 性能分析工具\n\n```python\nfrom pylsm.db import DB\nimport time\n\ndef benchmark_write(db, count=10000):\n    start_time = time.time()\n    \n    for i in range(count):\n        key = f\"bench:key:{i}\".encode()\n        value = f\"value:{i}\".encode()\n        db.put(key, value)\n        \n    duration = time.time() - start_time\n    ops_per_sec = count / duration\n    \n    print(f\"写入 {count} 条记录\")\n    print(f\"总耗时: {duration:.2f} 秒\")\n    print(f\"性能: {ops_per_sec:.2f} ops/sec\")\n```\n\n### 监控指标\n\n```python\ndef print_db_stats(db):\n    print(\"数据库状态:\")\n    print(f\"- 内存表大小: {db.memtable_size} bytes\")\n    print(f\"- SSTable文件数: {len(db.sstables)}\")\n    print(f\"- 总记录数: {db.total_keys}\")\n    print(f\"- 布隆过滤器误判率: {db.false_positive_rate:.4f}\")\n```\n\n## 📝 开发建议\n\n1. **代码风格**\n   - 遵循PEP 8规范\n   - 添加详细注释\n   - 使用类型提示\n\n2. **测试覆盖**\n   - 单元测试\n   - 集成测试\n   - 性能测试\n\n3. **错误处理**\n   - 异常捕获\n   - 日志记录\n   - 优雅降级\n\n## 🤝 参与贡献\n\n欢迎提交Pull Request来改进项目！建议：\n\n1. Fork本仓库\n2. 创建特性分支\n3. 提交更改\n4. 推送到分支\n5. 创建Pull Request\n\n## 📄 开源协议\n\n本项目采用MIT协议开源 - 详见 [LICENSE](LICENSE) 文件\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faixiasang%2Fpylsm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faixiasang%2Fpylsm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faixiasang%2Fpylsm/lists"}