https://github.com/vinaes/md-succ-ai

URL to Markdown API — md.succ.ai
https://github.com/vinaes/md-succ-ai
ai-agents html-to-markdown markdown mcp playwright rag readability url-to-markdown web-scraping youtube-transcript
Last synced: 4 months ago
JSON representation
URL to Markdown API — md.succ.ai
Host: GitHub
URL: https://github.com/vinaes/md-succ-ai
Owner: vinaes
License: other
Created: 2026-02-13T21:36:30.000Z (5 months ago)
Default Branch: master
Last Pushed: 2026-02-25T18:50:57.000Z (5 months ago)
Last Synced: 2026-02-25T18:56:38.836Z (5 months ago)
Topics: ai-agents, html-to-markdown, markdown, mcp, playwright, rag, readability, url-to-markdown, web-scraping, youtube-transcript
Language: JavaScript
Homepage: https://md.succ.ai
Size: 359 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          


  

  



  Clean Markdown from any URL. Fast, accurate, agent-friendly.





  

  

  

  





  Quick Start •

  Features •

  API •

  How It Works •

  Self-Hosting •

  Monitoring •

  Security



---

> Convert any webpage, document, feed, or video to clean, readable Markdown. Built for AI agents, MCP tools, and RAG pipelines. Powered by [succ](https://succ.ai).

## Quick Start

```bash

# Markdown output

curl https://md.succ.ai/https://example.com

# JSON output

curl -H "Accept: application/json" https://md.succ.ai/https://example.com

# Documents (PDF, DOCX, XLSX, CSV)

curl https://md.succ.ai/https://example.com/report.pdf

# YouTube transcript

curl https://md.succ.ai/https://youtube.com/watch?v=dQw4w9WgXcQ

# RSS/Atom feed

curl https://md.succ.ai/https://blog.example.com/feed.xml

# LLM-optimized (30-50% fewer tokens)

curl "https://md.succ.ai/https://example.com?mode=fit"

# Batch convert

curl -X POST https://md.succ.ai/batch \

  -H "Content-Type: application/json" \

  -d '{"urls": ["https://example.com", "https://httpbin.org/html"]}'

```

> **That's it.** No API key, no signup, no SDK. Just prepend `https://md.succ.ai/` to any URL.

## Features

| Feature | Description |

|---------|-------------|

| **9-Pass Extraction** | Readability, Defuddle, Article Extractor, CSS selectors, Schema.org, Open Graph, text density, cleaned body — quality-checked at each step |

| **7 Formats** | HTML, PDF, DOCX, XLSX, CSV, YouTube transcripts, RSS/Atom feeds |

| **4-Tier Pipeline** | HTTP fetch → headless browser → LLM extraction → BaaS anti-bot bypass |

| **Batch Conversion** | Convert up to 50 URLs in one request with concurrent processing |

| **Async + Webhooks** | Submit long conversions and get results via polling or webhook callback |

| **Structured Extraction** | `/extract` — JSON schema in, structured data out (LLM-powered) |

| **Quality Scoring** | Each conversion scored 0-1 with A-F grade |

| **Fit Mode** | LLM-optimized output — pruned boilerplate, 30-50% fewer tokens |

| **Citation Links** | Numbered references with footer instead of inline links |

| **Redis Cache** | Two-layer caching (Redis + in-memory fallback), SHA-256 hashed keys |

| **Rate Limiting** | Per-IP via Redis atomic pipeline, CF-Connecting-IP aware |

| **Prometheus + Grafana** | 11 custom metrics, pre-provisioned dashboard, auto-scraped |

| **Structured Logging** | JSON logs via Pino, per-request correlation IDs |

| **OpenAPI Docs** | Interactive API reference at `/docs` (Scalar UI) |

Supported formats

| Format | Content-Type | Method |

|--------|-------------|--------|

| HTML | `text/html` | 9-pass extraction + Turndown |

| PDF | `application/pdf` | Text extraction via unpdf |

| DOCX | `application/vnd...wordprocessingml` | mammoth → HTML → Turndown |

| XLSX/XLS | `application/vnd...spreadsheetml` | SheetJS → Markdown tables |

| CSV | `text/csv` | SheetJS → Markdown table |

| YouTube | `youtube.com`, `youtu.be` | Transcript extraction via innertube API |

| RSS/Atom | `application/rss+xml`, `application/atom+xml` | Feed parsing with item metadata |

Documents are also detected by URL extension (`.pdf`, `.docx`, `.xlsx`, `.csv`) when `Content-Type` is `application/octet-stream`.

## API

**Base URL:** `https://md.succ.ai`

**Docs:** [`/docs`](https://md.succ.ai/docs) (interactive Scalar UI) | [`/openapi.json`](https://md.succ.ai/openapi.json) (OpenAPI 3.1 spec)

### Endpoints

| Method | Path | Description |

|--------|------|-------------|

| `GET` | `/{url}` | Convert URL to Markdown |

| `GET` | `/?url={url}` | Same, query param format |

| `POST` | `/extract` | Structured data extraction via LLM (JSON schema) |

| `POST` | `/batch` | Batch convert up to 50 URLs |

| `POST` | `/async` | Async conversion with optional webhook |

| `GET` | `/job/:id` | Poll async job status |

| `GET` | `/health` | Health check (includes Redis status) |

| `GET` | `/docs` | Interactive API reference |

| `GET` | `/openapi.json` | OpenAPI 3.1 spec |

### Query Parameters

| Parameter | Values | Description |

|-----------|--------|-------------|

| `url` | URL | Target URL (alternative to path format) |

| `links` | `citations` | Convert inline links to numbered references with footer |

| `mode` | `fit` | Prune boilerplate sections for smaller LLM context |

| `max_tokens` | number | Truncate output to N tokens (use with `mode=fit`) |

### Response Headers

| Header | Description |

|--------|-------------|

| `x-request-id` | Unique request correlation ID |

| `x-markdown-tokens` | Token count (cl100k_base) |

| `x-conversion-tier` | `fetch`, `browser`, `baas:scrapfly`, `llm`, `youtube`, `feed`, `document:pdf`, etc. |

| `x-conversion-time` | Total conversion time in ms |

| `x-extraction-method` | Extraction pass used (`readability`, `defuddle`, `browser-raw`, etc.) |

| `x-quality-score` | Quality score 0-1 |

| `x-quality-grade` | Quality grade A-F |

| `x-readability` | `true` if Readability extracted clean content |

| `x-cache` | `hit` or `miss` (Redis-backed) |

| `x-ratelimit-limit` | Max requests per window |

| `x-ratelimit-remaining` | Requests remaining in current window |

| `x-ratelimit-reset` | Window reset timestamp (Unix seconds) |

### Rate Limits

| Endpoint | Limit |

|----------|-------|

| `GET /*` | 60 req/min per IP |

| `POST /extract` | 10 req/min per IP |

| `POST /batch` | 5 req/min per IP |

| `POST /async` | 10 req/min per IP |

JSON response format

```json

{

  "title": "Example Domain",

  "url": "https://example.com",

  "content": "# Example Domain\n\nThis domain is for use in...",

  "fit_markdown": "# Example Domain\n\nThis domain is...",

  "fit_tokens": 20,

  "excerpt": "This domain is for use in documentation examples...",

  "tokens": 33,

  "tier": "fetch",

  "readability": true,

  "method": "readability",

  "quality": { "score": 0.85, "grade": "A" },

  "time_ms": 245

}

```

Batch conversion

```bash

curl -X POST https://md.succ.ai/batch \

  -H "Content-Type: application/json" \

  -d '{

    "urls": [

      "https://example.com",

      "https://httpbin.org/html",

      "https://github.com"

    ],

    "options": {

      "mode": "fit",

      "links": "citations"

    }

  }'

```

Returns an array of results. Up to 50 URLs, processed with 10-way concurrency. Per-URL 60s timeout.

Async conversion with webhook

```bash

# Submit async job

curl -X POST https://md.succ.ai/async \

  -H "Content-Type: application/json" \

  -d '{

    "url": "https://example.com",

    "callback_url": "https://your-server.com/webhook"

  }'

# → {"job_id": "abc12345", "status": "processing", "poll_url": "/job/abc12345"}

# Poll for result

curl https://md.succ.ai/job/abc12345

```

Webhook delivers JSON `POST` to `callback_url` on completion/failure. HTTPS required, 3 retries with exponential backoff. Private/internal addresses blocked (SSRF-safe).

Structured data extraction

```bash

curl -X POST https://md.succ.ai/extract \

  -H "Content-Type: application/json" \

  -d '{

    "url": "https://github.com/trending",

    "schema": {

      "type": "object",

      "properties": {

        "repositories": {

          "type": "array",

          "items": {

            "type": "object",

            "properties": {

              "name": { "type": "string" },

              "author": { "type": "string" },

              "description": { "type": "string" },

              "stars_today": { "type": "number" }

            }

          }

        }

      }

    }

  }'

```

Returns structured JSON matching the provided schema, extracted by LLM. Automatically retries with headless browser for SPA/JS-heavy sites when initial extraction returns empty data.

More examples

```bash

# Citation-style links (numbered references)

curl "https://md.succ.ai/?url=https://en.wikipedia.org/wiki/Markdown&links=citations"

# LLM-optimized output (pruned boilerplate)

curl "https://md.succ.ai/?url=https://htmx.org/docs/&mode=fit"

# Token limit

curl "https://md.succ.ai/?url=https://example.com&mode=fit&max_tokens=4000"

# RSS feed as markdown

curl https://md.succ.ai/https://hnrss.org/frontpage

```

## How It Works

4-tier conversion pipeline — each tier only activates if the previous one produced insufficient quality:

```

URL ──→ Cache hit? ──→ Return cached result (Redis, dynamic TTL)

         │

         ├─ YouTube? ──→ Transcript extraction (innertube API)

         │

         ├─ RSS/Atom feed? ──→ Feed parsing with item metadata

         │

         ├─ Document? (PDF, DOCX, XLSX, CSV)

         │   └─→ Document converter → Markdown

         │

         ├─ Tier 1: HTTP fetch + 9-pass extraction

         │   └─→ Readability → Defuddle → Article Extractor → CSS selectors

         │       → Schema.org → Open Graph → Text density → Body fallback

         │

         ├─ Tier 2: Camoufox headless browser (SPA/JS-heavy)

         │   └─→ Same 9-pass pipeline on rendered DOM

         │

         ├─ Tier 2.5: LLM extraction (quality < B)

         │   └─→ nano-gpt API → content extraction

         │

         └─ Tier 3: BaaS anti-bot bypass (CF Turnstile / quality < D)

             └─→ ScrapFly → ZenRows → ScrapingBee (rotation)

             └─→ Same 9-pass pipeline on returned HTML

```

Cloudflare challenge pages are detected automatically. When fetch gets a CF challenge, browser is skipped (saves IP), and BaaS providers handle the bypass.

When both LLM and BaaS are needed, they race in parallel — saves 30-45s vs sequential.

Caching

Two-layer cache system backed by Redis 7:

| Content | TTL | Key |

|---------|-----|-----|

| HTML pages | 5 min | `cache:{sha256(url+options)}` |

| Browser renders | 10 min | Same |

| YouTube transcripts | 1 hr | Same |

| Documents | 2 hr | Same |

| /extract results | 1 hr | `extract:{sha256(url)}:{sha256(schema)}` |

Cache keys use SHA-256 hashes to prevent poisoning via long/malicious URLs. Tracking parameters (UTM, fbclid, gclid, etc.) are stripped before hashing. Falls back to in-memory Map when Redis is unavailable.

Stack

| Component | Role |

|-----------|------|

| [Hono](https://hono.dev) | HTTP framework |

| [Pino](https://getpino.io) | Structured JSON logging |

| [Mozilla Readability](https://github.com/mozilla/readability) | Primary content extraction |

| [Defuddle](https://github.com/nicedoc/defuddle) | Obsidian team's content extraction |

| [@extractus/article-extractor](https://github.com/nicedoc/extractus) | Alternative extraction heuristics |

| [Turndown](https://github.com/mixmark-io/turndown) | HTML → Markdown conversion |

| [linkedom](https://github.com/WebReflection/linkedom) | Lightweight DOM parser |

| [Camoufox](https://github.com/daijro/camoufox) | Firefox fork with C++ anti-detection |

| [Redis](https://redis.io) + [ioredis](https://github.com/redis/ioredis) | Cache, rate limiting, job storage |

| [prom-client](https://github.com/siimon/prom-client) | Prometheus metrics |

| [unpdf](https://github.com/unjs/unpdf) | PDF text extraction |

| [mammoth](https://github.com/mwilliamson/mammoth.js) | DOCX → HTML conversion |

| [SheetJS](https://sheetjs.com) | XLSX/XLS/CSV parsing |

| [NanoGPT](https://nano-gpt.com) | LLM API for Tier 2.5 and /extract |

| [Ajv](https://ajv.js.org) | JSON Schema validation for /extract |

| [gpt-tokenizer](https://github.com/niieani/gpt-tokenizer) | cl100k_base token counting |

| [nanoid](https://github.com/ai/nanoid) | Request/job IDs |

## Self-Hosting

### Docker (recommended)

```bash

git clone https://github.com/vinaes/md-succ-ai.git

cd md-succ-ai

cp .env.example .env  # edit with your API keys and passwords

docker compose up -d

```

This starts four containers:

| Container | Purpose | Port |

|-----------|---------|------|

| **md-succ-ai** | API server with Camoufox browser fallback | 127.0.0.1:3100 |

| **md-succ-redis** | Redis 7 (cache, rate limiting, jobs) | internal |

| **md-succ-prometheus** | Prometheus metrics collector | internal |

| **md-succ-grafana** | Grafana dashboards | 127.0.0.1:3200 |

The API is available at `http://localhost:3100`.

### Local (without Docker)

```bash

npm install

npx camoufox-js fetch

npm start

```

> Redis is optional for local development. Without Redis, caching and rate limiting fall back to in-memory Map, and async jobs are unavailable.

Environment variables

| Variable | Default | Description |

|----------|---------|-------------|

| `PORT` | `3000` | Server port |

| `ENABLE_BROWSER` | `true` | Enable Camoufox browser fallback |

| `NODE_ENV` | `production` | Node environment |

| `REDIS_URL` | `redis://redis:6379` | Redis connection URL (with password in Docker) |

| `REDIS_PASSWORD` | — | Redis authentication password (required in Docker) |

| `GRAFANA_PASSWORD` | — | Grafana admin password (required in Docker) |

| `NANOGPT_API_KEY` | — | nano-gpt API key for LLM tier and /extract |

| `NANOGPT_MODEL` | `meta-llama/llama-3.3-70b-instruct` | LLM model for content extraction (Tier 2.5) |

| `NANOGPT_EXTRACT_MODEL` | same as `NANOGPT_MODEL` | LLM model for `/extract` endpoint |

| `SCRAPFLY_API_KEY` | — | [ScrapFly](https://scrapfly.io) anti-bot bypass (1000 credits/mo free) |

| `ZENROWS_API_KEY` | — | [ZenRows](https://zenrows.com) anti-bot bypass (1000 credits trial) |

| `SCRAPINGBEE_API_KEY` | — | [ScrapingBee](https://scrapingbee.com) anti-bot bypass (1000 credits one-time) |

BaaS providers are optional. When configured, they activate as Tier 3 for Cloudflare-protected sites. Providers are tried in order; if one hits rate limits, the next is used automatically.

Nginx reverse proxy

An example nginx config is in `nginx/md.succ.ai.conf`:

- Rate limiting: 10 req/s per IP, burst 20

- Connection limit: 10 concurrent per IP

- Proxy timeouts: 60s read (for browser renders)

- POST endpoints with appropriate body limits

- HSTS, security headers (nosniff, X-Frame-Options, Referrer-Policy)

- `/metrics` blocked (403)

- `/grafana/` proxied to Grafana container with WebSocket support

## Monitoring

The project ships with a full Prometheus + Grafana stack:

**Prometheus** scrapes the `/metrics` endpoint every 10s (internal Docker network only).

**Grafana** is pre-provisioned with a 15-panel dashboard:

- Request rate, response time percentiles (p50/p95/p99)

- Conversion tier distribution, cache hit rate

- Quality score distribution, tokens per conversion

- Rate limit rejections, async job status

- Browser pool utilization, webhook deliveries

- Node.js process metrics (CPU, memory, event loop lag)

Access Grafana at `https://your-domain/grafana/` (proxied via nginx).

### Custom Metrics

| Metric | Type | Labels |

|--------|------|--------|

| `http_requests_total` | Counter | method, route, status |

| `http_request_duration_seconds` | Histogram | method, route, status |

| `conversion_tier_total` | Counter | tier |

| `conversion_tokens` | Histogram | tier |

| `conversion_quality` | Histogram | tier |

| `cache_hits_total` | Counter | source |

| `cache_misses_total` | Counter | — |

| `rate_limit_rejections_total` | Counter | route |

| `browser_pool_active` | Gauge | — |

| `async_jobs_total` | Counter | status |

| `webhook_deliveries_total` | Counter | status |

Plus Node.js default metrics (CPU, memory, event loop, GC) via `prom-client`.

## Security

- **SSRF protection** — URL validation, DNS resolution checks (IPv4 + IPv6), redirect validation per hop, Camoufox route blocking, webhook callback DNS validation

- **Private IP blocking** — 127/8, 10/8, 172.16/12, 192.168/16, 169.254/16, CGNAT, cloud metadata hostnames, hex/octal IP formats, IPv6 mapped addresses

- **Input limits** — 5MB response size, 5 max redirects, content-type validation, body size limits per endpoint

- **Output sanitization** — Error messages stripped of internal paths/stack traces, URLs sanitized in responses

- **Cache security** — SHA-256 hashed keys (no URL poisoning), tracking params stripped, Redis LRU eviction (128MB cap)

- **Redis authentication** — `--requirepass` with password from .env, authenticated connection URL

- **API key safety** — BaaS API keys only used in outbound requests, never logged or exposed in responses

- **LLM hardening** — Prompt injection protection (HTML sanitization, document delimiters, output validation), schema field whitelist, blocked schema keywords ($ref, $defs, etc.)

- **Rate limiting** — Per-IP via Redis INCR+EXPIRE (atomic pipeline), CF-Connecting-IP support, in-memory fallback

- **Security headers** — HSTS, X-Content-Type-Options, X-Frame-Options, Referrer-Policy, Permissions-Policy

- **CDN integrity** — Subresource Integrity (SRI) on third-party scripts

- **Container security** — Non-root user (`mduser`), `no-new-privileges`, pinned image versions

- **CF challenge detection** — Cloudflare challenge pages detected and handled without wasting browser/BaaS credits

## Architecture

```

                    ┌──────────────────┐

                    │   Cloudflare     │

                    │   (TLS + CDN)    │

                    └────────┬─────────┘

                             │

                    ┌────────▼─────────┐

                    │   nginx          │

                    │   (rate limit,   │

                    │    HSTS, proxy)  │

                    └────────┬─────────┘

                             │

         ┌───────────────────┼───────────────────┐

         │                   │                   │

┌────────▼─────────┐ ┌──────▼───────┐ ┌─────────▼────────┐

│  md-succ-ai      │ │  Prometheus  │ │  Grafana         │

│  (Node 22, Hono) │ │  (scrape     │ │  (dashboards,    │

│  Camoufox       │ │   /metrics)  │ │   alerting)      │

│  BaaS clients    │ └──────────────┘ └──────────────────┘

│  Pino logging    │

└────────┬─────────┘

         │

┌────────▼─────────┐

│  Redis 7         │

│  (cache, rate    │

│   limit, jobs)   │

└──────────────────┘

```

## License

[FSL-1.1-Apache-2.0](LICENSE) — Free for non-competitive use. Apache 2.0 after 2 years.

> **Disclaimer:** Not affiliated with [NanoGPT](https://nano-gpt.com). LLM features use the NanoGPT API for pay-per-prompt model access.

---

Part of the [succ](https://succ.ai) ecosystem.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vinaes/md-succ-ai

Awesome Lists containing this project

README