https://github.com/sologuy/bookmarksummarizer
🧠 Turn Chrome bookmarks into a personal knowledge base with AI summaries. Supports OpenAI, Deepseek, Qwen, and Ollama.
https://github.com/sologuy/bookmarksummarizer
ai-summarization bookmark-manager bookmark-organizer chrome-bookmarks chrome-extension llm personal-knowledge-management
Last synced: about 2 months ago
JSON representation
🧠 Turn Chrome bookmarks into a personal knowledge base with AI summaries. Supports OpenAI, Deepseek, Qwen, and Ollama.
- Host: GitHub
- URL: https://github.com/sologuy/bookmarksummarizer
- Owner: sologuy
- License: apache-2.0
- Created: 2025-03-25T02:11:02.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-03-25T02:30:06.000Z (2 months ago)
- Last Synced: 2025-04-08T04:36:57.980Z (about 2 months ago)
- Topics: ai-summarization, bookmark-manager, bookmark-organizer, chrome-bookmarks, chrome-extension, llm, personal-knowledge-management
- Language: Python
- Homepage:
- Size: 28.3 KB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README-CN.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# BookmarkSummarizer (书签大脑)
![]()
![]()
![]()
BookmarkSummarizer 是一个强大的工具,它能够爬取您的 Chrome 书签内容,使用大语言模型生成摘要,并将它们转化为个人知识库。无需整理,轻松搜索和利用您收藏的所有网页资源。
## ✨ 主要功能
- 🔍 **智能书签内容爬取**:自动从 Chrome 书签抓取全文内容
- 🤖 **AI 摘要生成**:用大型语言模型为每个书签创建高质量摘要
- 🔄 **并行处理**:高效的多线程爬取,显著减少处理时间
- 🌐 **多种模型支持**:兼容 OpenAI、Deepseek、Qwen 和 Ollama 等多种大语言模型
- 💾 **断点续传**:支持中断后继续处理,不会丢失已完成的工作
- 📊 **详细日志**:清晰的进度和状态报告,便于监控和调试## 🚀 快速开始
### 前提条件
- Python 3.6+
- Chrome 浏览器
- 网络连接
- 大语言模型 API 密钥(可选)### 安装
1. 克隆仓库:
```bash
git clone https://github.com/yourusername/BookmarkSummarizer.git
cd BookmarkSummarizer
```2. 安装依赖:
```bash
pip install -r requirements.txt
```3. 配置环境变量(创建 `.env` 文件):
```
MODEL_TYPE=ollama # 可选: openai, deepseek, qwen, ollama
API_KEY=your_api_key_here
API_BASE=http://localhost:11434 # Ollama 本地端点或其他模型 API 地址
MODEL_NAME=llama2 # 或其他支持的模型
MAX_TOKENS=1000
TEMPERATURE=0.3
```### 使用方法
**基础用法**:
```bash
python crawl.py
```**限制书签数量**:
```bash
python crawl.py --limit 10
```**设置并行处理线程数**:
```bash
python crawl.py --workers 10
```**跳过摘要生成**:
```bash
python crawl.py --no-summary
```**从已爬取的内容生成摘要**:
```bash
python crawl.py --from-json
```## 📋 功能详解
### 书签爬取
BookmarkSummarizer 会自动从 Chrome 书签文件中读取所有书签,并智能过滤掉不符合条件的 URL。它使用两种策略爬取网页内容:
1. **常规爬取**: 使用 Requests 库抓取大多数网页内容
2. **动态内容爬取**: 对于动态网页(如知乎等平台),自动切换到 Selenium 爬取### 摘要生成
BookmarkSummarizer 使用先进的大语言模型为每个书签内容生成高质量摘要,包括:
- 提取关键信息和重要概念
- 保留专业术语和关键数据
- 生成结构化摘要,便于后续检索
- 支持多种主流大语言模型### 断点续传
- 每处理完一个书签就立即保存进度
- 中断后重启时会自动跳过已处理的书签
- 即使在大量书签处理过程中,也能保证数据安全## 📁 输出文件
- `bookmarks.json`: 过滤后的书签列表
- `bookmarks_with_content.json`: 带有内容和摘要的书签数据
- `failed_urls.json`: 爬取失败的 URL 及原因## 🔧 自定义配置
除了命令行参数外,您还可以通过 `.env` 文件设置以下环境变量:
```
# 模型类型设置
MODEL_TYPE=ollama # openai, deepseek, qwen, ollama
API_KEY=your_api_key_here
API_BASE=http://localhost:11434
MODEL_NAME=llama2# 内容处理设置
MAX_TOKENS=1000 # 生成摘要的最大令牌数
MAX_INPUT_CONTENT_LENGTH=6000 # 输入内容的最大长度
TEMPERATURE=0.3 # 生成摘要的随机性# 爬虫设置
BOOKMARK_LIMIT=0 # 默认不限制
MAX_WORKERS=20 # 并行工作线程数
GENERATE_SUMMARY=true # 是否生成摘要
```## 🤝 贡献
欢迎提交 Pull Requests! 有任何问题或建议,请创建 Issue。
## 📄 许可证
本项目采用 [Apache License 2.0](LICENSE) 许可证。
## 🔮 未来计划
- [ ] 添加向量数据库支持,实现语义搜索
- [ ] 开发 Web 界面,提供可视化管理
- [ ] 支持更多浏览器的书签导入
- [ ] 增加定时更新功能,保持书签内容最新
- [ ] 支持导出为知识图谱---
BookmarkSummarizer
![]()
![]()
![]()
BookmarkSummarizer is a powerful tool that crawls your Chrome bookmarks, generates summaries using large language models, and turns them into a personal knowledge base. Easily search and utilize all your bookmarked web resources without manual organization.
## ✨ Key Features
- 🔍 **Smart Bookmark Crawling**: Automatically extract content from Chrome bookmarks
- 🤖 **AI Summary Generation**: Create high-quality summaries for each bookmark using large language models
- 🔄 **Parallel Processing**: Efficient multi-threaded crawling to significantly reduce processing time
- 🌐 **Multiple Model Support**: Compatible with OpenAI, Deepseek, Qwen, and Ollama models
- 💾 **Checkpoint Recovery**: Continue processing after interruptions without losing completed work
- 📊 **Detailed Logging**: Clear progress and status reports for monitoring and debugging## 🚀 Quick Start
### Prerequisites
- Python 3.6+
- Chrome browser
- Internet connection
- Large language model API key (optional)### Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/BookmarkSummarizer.git
cd BookmarkSummarizer
```2. Install dependencies:
```bash
pip install -r requirements.txt
```3. Configure environment variables (create a `.env` file):
```
MODEL_TYPE=ollama # Options: openai, deepseek, qwen, ollama
API_KEY=your_api_key_here
API_BASE=http://localhost:11434 # Ollama local endpoint or other model API address
MODEL_NAME=llama2 # Or other supported model
MAX_TOKENS=1000
TEMPERATURE=0.3
```### Usage
**Basic usage**:
```bash
python crawl.py
```**Limit the number of bookmarks**:
```bash
python crawl.py --limit 10
```**Set the number of parallel processing threads**:
```bash
python crawl.py --workers 10
```**Skip summary generation**:
```bash
python crawl.py --no-summary
```**Generate summaries from already crawled content**:
```bash
python crawl.py --from-json
```## 📋 Detailed Features
### Bookmark Crawling
BookmarkSummarizer automatically reads all bookmarks from the Chrome bookmarks file and intelligently filters out ineligible URLs. It uses two strategies to crawl web content:
1. **Regular Crawling**: Uses the Requests library to capture content from most web pages
2. **Dynamic Content Crawling**: For dynamic webpages (such as Zhihu and other platforms), automatically switches to Selenium### Summary Generation
BookmarkSummarizer uses advanced large language models to generate high-quality summaries for each bookmark content, including:
- Extracting key information and important concepts
- Preserving technical terms and key data
- Generating structured summaries for easier retrieval
- Supporting various mainstream large language models### Checkpoint Recovery
- Saves progress immediately after processing each bookmark
- Automatically skips previously processed bookmarks when restarted
- Ensures data safety even when processing large numbers of bookmarks## 📁 Output Files
- `bookmarks.json`: Filtered bookmark list
- `bookmarks_with_content.json`: Bookmark data with content and summaries
- `failed_urls.json`: Failed URLs and reasons## 🔧 Custom Configuration
In addition to command-line parameters, you can set the following environment variables through the `.env` file:
```
# Model Type Settings
MODEL_TYPE=ollama # openai, deepseek, qwen, ollama
API_KEY=your_api_key_here
API_BASE=http://localhost:11434
MODEL_NAME=llama2# Content Processing Settings
MAX_TOKENS=1000 # Maximum number of tokens for summary generation
MAX_INPUT_CONTENT_LENGTH=6000 # Maximum length of input content
TEMPERATURE=0.3 # Randomness of summary generation# Crawler Settings
BOOKMARK_LIMIT=0 # No limit by default
MAX_WORKERS=20 # Number of parallel worker threads
GENERATE_SUMMARY=true # Whether to generate summaries
```## 🤝 Contributing
Pull Requests are welcome! For any issues or suggestions, please create an Issue.
## 📄 License
This project is licensed under the [Apache License 2.0](LICENSE).
## 🔮 Future Plans
- [ ] Add vector database support for semantic search
- [ ] Develop a web interface for visual management
- [ ] Support bookmark imports from more browsers
- [ ] Add scheduled update functionality to keep bookmark content current
- [ ] Support export to knowledge graphs