https://github.com/riii111/claude-readability-hook
✂️ HTML ➜ 📜 Text – tuned for AI prompts & token thrift
https://github.com/riii111/claude-readability-hook
bun claude-code fastapi
Last synced: 2 months ago
JSON representation
✂️ HTML ➜ 📜 Text – tuned for AI prompts & token thrift
- Host: GitHub
- URL: https://github.com/riii111/claude-readability-hook
- Owner: riii111
- Created: 2025-07-25T15:17:59.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-08-11T03:29:17.000Z (10 months ago)
- Last Synced: 2025-08-26T07:36:43.969Z (10 months ago)
- Topics: bun, claude-code, fastapi
- Language: TypeScript
- Homepage:
- Size: 521 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Claude Readability Hook
✂️ HTML ➜ 📜 Text – tuned for AI prompts & token thrift
---
## 👩💻 TL;DR
| | What it does | Why you care |
|---|---|---|
| 🧹 **Trim the fluff** | Strips ads, nav & code fences | ⬇️ 40‑70 % token cut |
| 🕸️ **Any website** | Handles JS‑heavy SPA via headless Chromium | No "blank page" failures |
| 🧠 **Self‑tuning** | Scores every extraction & auto‑switches engine | Always picks the best text |
| ⚡ **Forum‑optimized** | Direct API integration for Reddit/StackOverflow | 2‑3× better content capture |
| 🔐 **Safe by default** | SSRF guard + DNS re‑resolve | Drop‑in for prod |
---
## 🏃♂️ Quick Start
```bash
git clone https://github.com/you/claude-readability-hook
cd claude-readability-hook
docker compose up -d # start gateway + extractor + renderer
curl -XPOST :7777/extract -d '{"url":"https://example.com"}' | jq '.text | length'
```
---
## 🏗️ Architecture (60‑sec view)
```mermaid
graph TD
Claude[Claude Hook] --> A
subgraph "Gateway"
A[SSRF Guard] --> B{Needs SSR?}
B -->|No| C[Trafilatura]
B -->|Yes| R[Playwright] --> C
C -->|Low score| D[Readability.js]
C --> Result[Result]
D --> Result
end
Result --> Claude
```
---
## 🚀 Feature Highlights
* **Smart engine switch** – Trafilatura ➜ Readability whenever score < 50
* **AMP / print rewrite** – auto‑fetches lightweight HTML variants
* **24 h LRU cache** – hit‑ratio metric exposed via Prometheus
* **OpenTelemetry hooks** – trace every extract / render call
### 🎯 Special Site Support
Optimized extraction for developer-focused platforms:
* **Stack Overflow** – Official API integration fetches question + top 5 answers (vote-sorted)
* **Reddit** – JSON endpoint captures post + top 20 comments + replies
* **Auto-fallback** – Falls back to standard pipeline if API fails
**Results**: 2-3× better content capture vs. generic HTML parsing
---
## 📋 REST API
| Verb | Path | Description |
|------|------|-------------|
| `POST` | `/extract` | Return `{title,text,engine,score,cached}` |
| `GET` | `/health` | Dependency & self check |
| `GET` | `/metrics` | Prometheus exposition |
Example request
```bash
curl -XPOST :7777/extract \
-H 'Content-Type: application/json' \
-d '{"url":"https://news.ycombinator.com/item?id=39237223"}'
```
---
## 📈 Key Metrics
```promql
# success rate per engine
rate(gateway_extract_total{success="true"}[5m]) by (engine)
# SSR usage %
sum(rate(gateway_extract_total{ssr="true"}[5m]))
/ sum(rate(gateway_extract_total[5m]))
# cache hit ratio
sum(rate(gateway_cache_total{op="hit"}[5m]))
/ sum(rate(gateway_cache_total{op=~"hit|miss"}[5m]))
```
---
## 🛠️ Local Dev
```bash
# Start all services
docker compose up -d
# Development with hot-reload
cd apps/gateway && bun install && bun run dev # Gateway
cd apps/extractor && uv sync && uv run python server.py # Extractor
```
> Cache & rate‑limit are disabled when `NODE_ENV=test`.
---
## 🗺️ Roadmap
* [ ] Chunk‑level summarization for giant docs
* [ ] PDF / EPUB source support
* [ ] Optional GPT‑4 "refine" post‑processor
---
## 🙏 Acknowledgements
Powered by **Trafilatura**, **Mozilla Readability**, and **Playwright**.