An open API service indexing awesome lists of open source software.

https://github.com/riii111/claude-readability-hook

✂️ HTML ➜ 📜 Text – tuned for AI prompts & token thrift
https://github.com/riii111/claude-readability-hook

bun claude-code fastapi

Last synced: 2 months ago
JSON representation

✂️ HTML ➜ 📜 Text – tuned for AI prompts & token thrift

Awesome Lists containing this project

README

          

Claude Readability Hook



✂️ HTML ➜ 📜 Text – tuned for AI prompts & token thrift






---

## 👩‍💻 TL;DR

| | What it does | Why you care |
|---|---|---|
| 🧹 **Trim the fluff** | Strips ads, nav & code fences | ⬇️ 40‑70 % token cut |
| 🕸️ **Any website** | Handles JS‑heavy SPA via headless Chromium | No "blank page" failures |
| 🧠 **Self‑tuning** | Scores every extraction & auto‑switches engine | Always picks the best text |
| ⚡ **Forum‑optimized** | Direct API integration for Reddit/StackOverflow | 2‑3× better content capture |
| 🔐 **Safe by default** | SSRF guard + DNS re‑resolve | Drop‑in for prod |

---

## 🏃‍♂️ Quick Start

```bash
git clone https://github.com/you/claude-readability-hook
cd claude-readability-hook
docker compose up -d # start gateway + extractor + renderer
curl -XPOST :7777/extract -d '{"url":"https://example.com"}' | jq '.text | length'
```

---

## 🏗️ Architecture (60‑sec view)

```mermaid
graph TD
Claude[Claude Hook] --> A
subgraph "Gateway"
A[SSRF Guard] --> B{Needs SSR?}
B -->|No| C[Trafilatura]
B -->|Yes| R[Playwright] --> C
C -->|Low score| D[Readability.js]
C --> Result[Result]
D --> Result
end
Result --> Claude
```

---

## 🚀 Feature Highlights

* **Smart engine switch** – Trafilatura ➜ Readability whenever score < 50
* **AMP / print rewrite** – auto‑fetches lightweight HTML variants
* **24 h LRU cache** – hit‑ratio metric exposed via Prometheus
* **OpenTelemetry hooks** – trace every extract / render call

### 🎯 Special Site Support

Optimized extraction for developer-focused platforms:

* **Stack Overflow** – Official API integration fetches question + top 5 answers (vote-sorted)
* **Reddit** – JSON endpoint captures post + top 20 comments + replies
* **Auto-fallback** – Falls back to standard pipeline if API fails

**Results**: 2-3× better content capture vs. generic HTML parsing

---

## 📋 REST API

| Verb | Path | Description |
|------|------|-------------|
| `POST` | `/extract` | Return `{title,text,engine,score,cached}` |
| `GET` | `/health` | Dependency & self check |
| `GET` | `/metrics` | Prometheus exposition |

Example request

```bash
curl -XPOST :7777/extract \
-H 'Content-Type: application/json' \
-d '{"url":"https://news.ycombinator.com/item?id=39237223"}'
```

---

## 📈 Key Metrics

```promql
# success rate per engine
rate(gateway_extract_total{success="true"}[5m]) by (engine)

# SSR usage %
sum(rate(gateway_extract_total{ssr="true"}[5m]))
/ sum(rate(gateway_extract_total[5m]))

# cache hit ratio
sum(rate(gateway_cache_total{op="hit"}[5m]))
/ sum(rate(gateway_cache_total{op=~"hit|miss"}[5m]))
```

---

## 🛠️ Local Dev

```bash
# Start all services
docker compose up -d

# Development with hot-reload
cd apps/gateway && bun install && bun run dev # Gateway
cd apps/extractor && uv sync && uv run python server.py # Extractor
```

> Cache & rate‑limit are disabled when `NODE_ENV=test`.

---

## 🗺️ Roadmap

* [ ] Chunk‑level summarization for giant docs
* [ ] PDF / EPUB source support
* [ ] Optional GPT‑4 "refine" post‑processor

---

## 🙏 Acknowledgements

Powered by **Trafilatura**, **Mozilla Readability**, and **Playwright**.