{"id":51026636,"url":"https://github.com/gitstq/smartcompress","last_synced_at":"2026-06-21T20:02:09.250Z","repository":{"id":362796759,"uuid":"1260828418","full_name":"gitstq/smartcompress","owner":"gitstq","description":"智能文本压缩与Token优化工具 - Smart Text Compression \u0026 Token Optimization for LLM Context","archived":false,"fork":false,"pushed_at":"2026-06-05T23:22:21.000Z","size":27,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-06T01:13:50.981Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gitstq.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-05T23:19:52.000Z","updated_at":"2026-06-05T23:22:02.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/gitstq/smartcompress","commit_stats":null,"previous_names":["gitstq/smartcompress"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/gitstq/smartcompress","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fsmartcompress","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fsmartcompress/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fsmartcompress/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fsmartcompress/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gitstq","download_url":"https://codeload.github.com/gitstq/smartcompress/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gitstq%2Fsmartcompress/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34623906,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-21T02:00:05.568Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-21T20:02:08.525Z","updated_at":"2026-06-21T20:02:09.244Z","avatar_url":"https://github.com/gitstq.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 🗜️ SmartCompress\n\n**智能文本压缩与Token优化工具**\n\n[![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/)\n[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)\n[![Tests](https://img.shields.io/badge/Tests-21%2F21%20passed-brightgreen)](tests/)\n\n[简体中文](#简体中文) | [繁體中文](#繁體中文) | [English](#english)\n\n\u003c/div\u003e\n\n---\n\n## 简体中文\n\n### 🎉 项目介绍\n\n**SmartCompress** 是一款专为大型语言模型（LLM）上下文窗口优化而设计的智能文本压缩工具。随着GPT-4、Claude等大模型的普及，上下文窗口的限制成为开发者和用户面临的核心痛点——当需要处理的文本超出模型容量时，要么被迫截断内容导致信息丢失，要么需要多次调用增加成本和延迟。\n\n本项目灵感来源于GitHub Trending上的 [headroom](https://github.com/chopratejas/headroom) 项目，但我们进行了**完全独立自研开发**，在参考其产品逻辑的基础上，实现了多项差异化优化：\n\n- 🧠 **智能策略选择** - 自动识别文本类型并选择最优压缩策略\n- 📊 **精确Token计数** - 基于OpenAI tiktoken，精确计算压缩前后的token数量\n- 🌊 **流式大文件处理** - 支持超大文件的分块流式压缩，内存占用极低\n- 🔌 **可扩展架构** - 插件化策略设计，易于扩展新的压缩算法\n\n### ✨ 核心特性\n\n| 特性 | 说明 | 表情 |\n|------|------|------|\n| **🎯 五种压缩策略** | 摘要、关键词提取、语义去重、代码精简、智能混合 | 🧠 |\n| **📏 精确Token计数** | 支持cl100k_base等多种tokenizer | 📊 |\n| **📁 多格式支持** | 文本、代码、JSON、YAML、日志、配置文件等 | 📂 |\n| **🌊 流式处理** | 大文件分块压缩，内存友好 | 🌊 |\n| **💰 Token预算控制** | 设定最大token数，自动压缩至目标范围 | 💰 |\n| **🎨 美观CLI界面** | 基于Rich库的彩色命令行交互 | 🎨 |\n| **🔧 易于集成** | 既可作为命令行工具，也可作为Python库使用 | 🔧 |\n\n### 🚀 快速开始\n\n#### 环境要求\n\n- Python 3.8 或更高版本\n- pip 包管理器\n\n#### 安装步骤\n\n```bash\n# 从PyPI安装（即将发布）\npip install smartcompress\n\n# 或从源码安装\ngit clone https://github.com/gitstq/smartcompress.git\ncd smartcompress\npip install -e .\n```\n\n#### 基本使用\n\n```bash\n# 压缩单个文件\nsmartcompress compress input.txt --ratio 0.5\n\n# 查看文件token统计\nsmartcompress stats input.txt\n\n# 批量压缩目录\nsmartcompress batch ./logs --ratio 0.3 --output-dir ./compressed\n\n# 流式压缩大文件\nsmartcompress stream large.log --chunk-size 4000 --ratio 0.5\n```\n\n#### Python API 使用\n\n```python\nfrom smartcompress import SmartCompressor\n\n# 创建压缩器实例\ncompressor = SmartCompressor(strategy='hybrid')\n\n# 压缩文本\ntext = \"这是一段需要压缩的长文本...\" * 100\nresult = compressor.compress(text, target_ratio=0.5)\n\nprint(f\"原始Token: {result.original_tokens}\")\nprint(f\"压缩后Token: {result.compressed_tokens}\")\nprint(f\"压缩率: {result.reduction_ratio}%\")\n\n# 使用token预算控制\nresult = compressor.compress(text, max_tokens=1000)\n```\n\n### 📖 详细使用指南\n\n#### 压缩策略说明\n\n| 策略 | 适用场景 | 压缩方式 |\n|------|----------|----------|\n| `summarize` | 文章、报告、长文本 | 提取关键段落，去除冗余内容 |\n| `keyword` | 笔记、会议纪要、文档 | 提取关键词和关键短语 |\n| `dedup` | 日志、聊天记录、重复文本 | 语义去重，合并相似内容 |\n| `code` | 源代码、配置文件 | 去除注释和空行，保留核心逻辑 |\n| `hybrid` | 通用场景（**推荐**） | 自动识别文本类型并组合策略 |\n\n#### 命令行参数详解\n\n```bash\n# 压缩命令\nsmartcompress compress \u003c文件路径\u003e [选项]\n -s, --strategy 压缩策略 (summarize/keyword/dedup/code/hybrid)\n -r, --ratio 目标压缩率 (0-1)\n -m, --max-tokens 最大token限制\n -o, --output 输出文件路径\n --model Tokenizer模型 (默认: cl100k_base)\n -l, --language 代码语言（用于代码压缩）\n\n# 统计命令\nsmartcompress stats \u003c文件路径\u003e\n --model Tokenizer模型\n\n# 批量压缩\nsmartcompress batch \u003c目录路径\u003e [选项]\n -p, --pattern 文件匹配模式 (默认: *)\n --recursive 递归处理子目录\n -o, --output-dir 输出目录\n\n# 流式压缩\nsmartcompress stream \u003c文件路径\u003e [选项]\n -c, --chunk-size 每块token数 (默认: 4000)\n```\n\n### 💡 设计思路与迭代规划\n\n#### 技术选型原因\n\n- **Python 3.8+**: 兼顾语法特性与兼容性\n- **tiktoken**: OpenAI官方tokenizer，精确计算GPT系列模型的token数\n- **Click**: 成熟的Python CLI框架，支持丰富的命令行交互\n- **Rich**: 提供美观的终端输出，包括表格、进度条、面板等\n\n#### 后续功能迭代计划\n\n- [ ] 支持更多tokenizer（HuggingFace Tokenizers、SentencePiece等）\n- [ ] 增加基于LLM的智能摘要策略\n- [ ] 支持压缩质量评估与对比\n- [ ] Web UI界面\n- [ ] 支持更多文件格式（PDF、Word、Excel等）\n\n### 📦 打包与部署指南\n\n#### 本地开发\n\n```bash\n# 克隆仓库\ngit clone https://github.com/gitstq/smartcompress.git\ncd smartcompress\n\n# 创建虚拟环境\npython -m venv venv\nsource venv/bin/activate # Linux/Mac\n# 或 venv\\Scripts\\activate # Windows\n\n# 安装开发依赖\npip install -e \".[dev]\"\n\n# 运行测试\npytest tests/ -v\n```\n\n#### 打包发布\n\n```bash\n# 构建分发包\npython setup.py sdist bdist_wheel\n\n# 上传到PyPI\ntwine upload dist/*\n```\n\n### 🤝 贡献指南\n\n欢迎提交Issue和Pull Request！\n\n- **Bug报告**: 请提供复现步骤和错误日志\n- **功能建议**: 请描述使用场景和预期行为\n- **代码贡献**: 请遵循PEP 8规范，并确保测试通过\n\n### 📄 开源协议\n\n本项目采用 [MIT License](LICENSE) 开源协议。\n\n---\n\n## 繁體中文\n\n### 🎉 專案介紹\n\n**SmartCompress** 是一款專為大型語言模型（LLM）上下文視窗優化而設計的智慧文字壓縮工具。隨著 GPT-4、Claude 等大模型的普及，上下文視窗的限制成為開發者和使用者面臨的核心痛點。\n\n本專案靈感來源於 GitHub Trending 上的 headroom 專案，但我們進行了**完全獨立自研開發**，在參考其產品邏輯的基礎上，實現了多項差異化優化。\n\n### ✨ 核心特性\n\n- 🧠 **五種壓縮策略** - 摘要、關鍵詞提取、語義去重、程式碼精簡、智慧混合\n- 📊 **精確 Token 計數** - 支援 cl100k_base 等多種 tokenizer\n- 📁 **多格式支援** - 文字、程式碼、JSON、YAML、日誌、設定檔等\n- 🌊 **流式處理** - 大檔案分塊壓縮，記憶體友好\n- 💰 **Token 預算控制** - 設定最大 token 數，自動壓縮至目標範圍\n\n### 🚀 快速開始\n\n#### 安裝\n\n```bash\npip install smartcompress\n```\n\n#### 基本使用\n\n```bash\n# 壓縮單個檔案\nsmartcompress compress input.txt --ratio 0.5\n\n# 查看檔案 token 統計\nsmartcompress stats input.txt\n\n# 批次壓縮目錄\nsmartcompress batch ./logs --ratio 0.3 --output-dir ./compressed\n```\n\n#### Python API\n\n```python\nfrom smartcompress import SmartCompressor\n\ncompressor = SmartCompressor(strategy='hybrid')\nresult = compressor.compress(text, target_ratio=0.5)\nprint(f\"壓縮率: {result.reduction_ratio}%\")\n```\n\n### 📖 詳細使用指南\n\n#### 壓縮策略說明\n\n| 策略 | 適用場景 | 壓縮方式 |\n|------|----------|----------|\n| `summarize` | 文章、報告、長文本 | 提取關鍵段落 |\n| `keyword` | 筆記、會議紀要 | 提取關鍵詞和關鍵短語 |\n| `dedup` | 日誌、聊天記錄 | 語義去重 |\n| `code` | 原始碼、設定檔 | 去除註解和空行 |\n| `hybrid` | 通用場景（**推薦**） | 自動識別並組合策略 |\n\n### 💡 設計思路與迭代規劃\n\n#### 技術選型原因\n\n- **Python 3.8+**: 兼顧語法特性與相容性\n- **tiktoken**: OpenAI 官方 tokenizer\n- **Click**: 成熟的 Python CLI 框架\n- **Rich**: 提供美觀的終端輸出\n\n#### 後續功能迭代計劃\n\n- [ ] 支援更多 tokenizer\n- [ ] 增加基於 LLM 的智慧摘要策略\n- [ ] Web UI 介面\n- [ ] 支援更多檔案格式（PDF、Word 等）\n\n### 📦 打包與部署指南\n\n```bash\n# 本地開發\ngit clone https://github.com/gitstq/smartcompress.git\ncd smartcompress\npip install -e .\npytest tests/ -v\n```\n\n### 🤝 貢獻指南\n\n歡迎提交 Issue 和 Pull Request！\n\n### 📄 開源協議\n\n本專案採用 [MIT License](LICENSE) 開源協議。\n\n---\n\n## English\n\n### 🎉 Project Introduction\n\n**SmartCompress** is an intelligent text compression tool designed for Large Language Model (LLM) context window optimization. As GPT-4, Claude, and other large models become prevalent, context window limitations have become a core pain point for developers and users.\n\nInspired by the [headroom](https://github.com/chopratejas/headroom) project on GitHub Trending, we have developed this project **completely independently** with several differentiated optimizations.\n\n### ✨ Core Features\n\n- 🧠 **Five Compression Strategies** - Summarize, Keyword Extract, Semantic Deduplicate, Code Minify, Smart Hybrid\n- 📊 **Accurate Token Counting** - Supports cl100k_base and other tokenizers\n- 📁 **Multi-format Support** - Text, Code, JSON, YAML, Logs, Config files\n- 🌊 **Streaming Processing** - Chunk-based compression for large files, memory-friendly\n- 💰 **Token Budget Control** - Set max token limit, auto-compress to target range\n\n### 🚀 Quick Start\n\n#### Installation\n\n```bash\npip install smartcompress\n```\n\n#### Basic Usage\n\n```bash\n# Compress a single file\nsmartcompress compress input.txt --ratio 0.5\n\n# View file token statistics\nsmartcompress stats input.txt\n\n# Batch compress directory\nsmartcompress batch ./logs --ratio 0.3 --output-dir ./compressed\n```\n\n#### Python API\n\n```python\nfrom smartcompress import SmartCompressor\n\ncompressor = SmartCompressor(strategy='hybrid')\nresult = compressor.compress(text, target_ratio=0.5)\nprint(f\"Reduction: {result.reduction_ratio}%\")\n```\n\n### 📖 Detailed Usage Guide\n\n#### Compression Strategies\n\n| Strategy | Use Case | Method |\n|----------|----------|--------|\n| `summarize` | Articles, Reports, Long text | Extract key paragraphs |\n| `keyword` | Notes, Meeting minutes | Extract keywords and phrases |\n| `dedup` | Logs, Chat records | Semantic deduplication |\n| `code` | Source code, Config files | Remove comments and empty lines |\n| `hybrid` | General use (**Recommended**) | Auto-detect and combine strategies |\n\n### 💡 Design Philosophy \u0026 Roadmap\n\n#### Tech Stack Rationale\n\n- **Python 3.8+**: Balance of modern features and compatibility\n- **tiktoken**: Official OpenAI tokenizer\n- **Click**: Mature Python CLI framework\n- **Rich**: Beautiful terminal output\n\n#### Future Roadmap\n\n- [ ] Support for more tokenizers\n- [ ] LLM-based intelligent summarization\n- [ ] Web UI interface\n- [ ] Support for more file formats (PDF, Word, etc.)\n\n### 📦 Packaging \u0026 Deployment\n\n```bash\n# Local development\ngit clone https://github.com/gitstq/smartcompress.git\ncd smartcompress\npip install -e .\npytest tests/ -v\n```\n\n### 🤝 Contributing\n\nIssues and Pull Requests are welcome!\n\n### 📄 License\n\nThis project is licensed under the [MIT License](LICENSE).\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\nMade with ❤️ by SmartCompress Team\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgitstq%2Fsmartcompress","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgitstq%2Fsmartcompress","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgitstq%2Fsmartcompress/lists"}