{"id":50444336,"url":"https://github.com/dotcommander/defuddle","last_synced_at":"2026-05-31T20:32:10.891Z","repository":{"id":353198136,"uuid":"1201678641","full_name":"dotcommander/defuddle","owner":"dotcommander","description":"Go library and CLI for extracting web page content — articles, metadata, and clean text from any URL","archived":false,"fork":false,"pushed_at":"2026-05-29T21:19:54.000Z","size":1254,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-29T23:10:32.266Z","etag":null,"topics":["cli","content-extraction","defuddle","go","html-parser","markdown","web-scraping"],"latest_commit_sha":null,"homepage":"https://pkg.go.dev/github.com/dotcommander/defuddle","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dotcommander.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-05T02:15:47.000Z","updated_at":"2026-05-29T21:20:14.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/dotcommander/defuddle","commit_stats":null,"previous_names":["dotcommander/defuddle"],"tags_count":16,"template":false,"template_full_name":null,"purl":"pkg:github/dotcommander/defuddle","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dotcommander%2Fdefuddle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dotcommander%2Fdefuddle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dotcommander%2Fdefuddle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dotcommander%2Fdefuddle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dotcommander","download_url":"https://codeload.github.com/dotcommander/defuddle/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dotcommander%2Fdefuddle/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33748607,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","content-extraction","defuddle","go","html-parser","markdown","web-scraping"],"created_at":"2026-05-31T20:32:06.032Z","updated_at":"2026-05-31T20:32:10.882Z","avatar_url":"https://github.com/dotcommander.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\u003ca href=\"https://github.com/dotcommander/defuddle/releases\"\u003e\u003cimg src=\"https://img.shields.io/github/v/release/dotcommander/defuddle\" alt=\"Release\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/dotcommander/defuddle/actions\"\u003e\u003cimg src=\"https://github.com/dotcommander/defuddle/workflows/Test/badge.svg\" alt=\"Tests\"\u003e\u003c/a\u003e\n\u003ca href=\"https://goreportcard.com/report/github.com/dotcommander/defuddle\"\u003e\u003cimg src=\"https://goreportcard.com/badge/github.com/dotcommander/defuddle\" alt=\"Go Report Card\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pkg.go.dev/github.com/dotcommander/defuddle\"\u003e\u003cimg src=\"https://pkg.go.dev/badge/github.com/dotcommander/defuddle.svg\" alt=\"Go Reference\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n## Introduction\n\nDefuddle Go is a port of the [Defuddle](https://github.com/kepano/defuddle) TypeScript library. It extracts clean, readable content from any web page — stripping away navigation, ads, sidebars, and other clutter so you're left with just the article.\n\nAvailable as both a **Go library** and a drop-in **CLI tool** compatible with the original [Defuddle CLI](https://github.com/kepano/defuddle-cli).\n\n## Installation\n\n### CLI\n\nDownload a pre-built binary from the [releases page](https://github.com/dotcommander/defuddle/releases), or install with Go:\n\n```bash\ngo install github.com/dotcommander/defuddle/cmd/defuddle@latest\n```\n\n### Library\n\n```bash\ngo get github.com/dotcommander/defuddle\n```\n\n\u003e Requires Go 1.26 or higher.\n\n## Quick Start\n\n### CLI\n\n```bash\ndefuddle parse https://example.com/article\n```\n\nAdd `--markdown` or `--json` for different output formats:\n\n```bash\ndefuddle parse https://example.com/article --markdown\ndefuddle parse https://example.com/article --json\n```\n\n### Library — fetch and parse a URL\n\n```go\nimport (\n    \"context\"\n    \"fmt\"\n    \"github.com/dotcommander/defuddle\"\n)\n\nresult, err := defuddle.ParseFromURL(context.Background(), \"https://example.com/article\", nil)\nif err != nil {\n    log.Fatal(err)\n}\nfmt.Println(result.Title)\nfmt.Println(result.Content) // clean HTML\n```\n\n### Library — parse HTML you already have\n\n```go\nresult, err := defuddle.ParseFromString(ctx, htmlString, \u0026defuddle.Options{\n    URL: \"https://example.com/article\", // enables relative URL resolution\n})\n```\n\n### Lower-level API\n\nWhen you need to reuse the parsed document or configure options before parsing, use the two-step form:\n\n```go\nd, err := defuddle.NewDefuddle(htmlString, \u0026defuddle.Options{\n    URL:      \"https://example.com/article\",\n    Markdown: true,\n})\nif err != nil {\n    log.Fatal(err)\n}\n\nresult, err := d.Parse(ctx)\n\nfmt.Printf(\"Title:       %s\\n\", result.Title)\nfmt.Printf(\"Author:      %s\\n\", result.Author)\nfmt.Printf(\"Published:   %s\\n\", result.Published)\nfmt.Printf(\"Language:    %s\\n\", result.Language)\nfmt.Printf(\"Word Count:  %d\\n\", result.WordCount)\nfmt.Printf(\"Content:     %s\\n\", result.Content) // Markdown when Markdown: true\n```\n\n## Extracting Content\n\n### From a URL\n\n`ParseFromURL` handles HTTP fetching, encoding detection, and parsing in one call:\n\n```go\nresult, err := defuddle.ParseFromURL(ctx, \"https://example.com/article\", \u0026defuddle.Options{\n    Markdown: true,\n})\n```\n\n### Multiple URLs (concurrent)\n\n```go\nurls := []string{\n    \"https://example.com/article-1\",\n    \"https://example.com/article-2\",\n}\n\nresults := defuddle.ParseFromURLs(ctx, urls, \u0026defuddle.Options{\n    MaxConcurrency: 10,\n    Markdown:       true,\n})\n\nfor _, r := range results {\n    if r.Err != nil {\n        log.Printf(\"failed %s: %v\", r.URL, r.Err)\n        continue\n    }\n    fmt.Printf(\"%s (%d words)\\n\", r.Result.Title, r.Result.WordCount)\n}\n```\n\n### Markdown Output\n\nSet `Markdown: true` to receive the extracted content as Markdown:\n\n```go\nresult, err := defuddle.ParseFromURL(ctx, url, \u0026defuddle.Options{Markdown: true})\nfmt.Println(result.Content) // Markdown\n```\n\nTo receive both HTML and Markdown in the same result:\n\n```go\nresult, err := defuddle.ParseFromURL(ctx, url, \u0026defuddle.Options{SeparateMarkdown: true})\nfmt.Println(result.Content)          // HTML\nfmt.Println(*result.ContentMarkdown) // Markdown\n```\n\n## Site-Specific Extractors\n\nDefuddle automatically detects popular platforms and applies specialized extraction logic. No configuration needed — if the URL matches, the right extractor activates.\n\n**Conversation**\n\n| Platform | Domains | Content Type |\n|----------|---------|-------------|\n| ChatGPT | `chatgpt.com` | Conversations with role-separated messages |\n| Claude | `claude.ai` | Conversations with human/assistant turns |\n| Grok | `grok.com`, `grok.x.ai`, `x.ai` | xAI conversations |\n| Gemini | `gemini.google.com` | Google AI conversations |\n\n**News**\n\n| Platform | Domains | Content Type |\n|----------|---------|-------------|\n| Substack | `substack.com` | Newsletter articles |\n| Medium | `medium.com` | Articles with publication metadata |\n| NYTimes | `nytimes.com` | News articles |\n| LWN | `lwn.net` | Linux Weekly News articles |\n\n**Social**\n\n| Platform | Domains | Content Type |\n|----------|---------|-------------|\n| X / Twitter (article) | `x.com`, `twitter.com` | Long-form articles (Draft.js) |\n| Twitter (legacy) | `x.com`, `twitter.com` | Tweets and threads |\n| Bluesky | `bsky.app` | Posts and threads |\n| Threads | `threads.com`, `threads.net` | Posts and threads |\n| LinkedIn | `linkedin.com` | Posts and articles |\n| X oEmbed | `publish.twitter.com`, `publish.x.com` | Embedded tweet markup |\n\n**Tech**\n\n| Platform | Domains | Content Type |\n|----------|---------|-------------|\n| YouTube | `youtube.com`, `youtu.be` | Video metadata and descriptions |\n| Reddit | `reddit.com`, `old.reddit.com`, `new.reddit.com` | Posts with comment trees |\n| Hacker News | `news.ycombinator.com` | Posts and threaded comment discussions |\n| GitHub | `github.com` | Issues and pull requests with comments |\n| Wikipedia | `*.wikipedia.org` | Article body with section structure |\n| C2 Wiki | `c2.com` | Wiki pages |\n| LeetCode | `leetcode.com` | Problem statements |\n\n**Catchall (DOM-signature — matches any host)**\n\n| Platform | Content Type |\n|----------|-------------|\n| Discourse | Forum topics and reply threads |\n| Mastodon | Posts and threads |\n\n23 extractors total: 4 conversation, 4 news, 6 social, 7 tech, 2 catchall.\n\n### Custom Extractors\n\nImplement the `BaseExtractor` interface to add support for any site.\n\nThree things to know before you write one:\n\n1. Registration order matters — the first matching extractor wins.\n2. `CanExtract()` runs before fallback content scoring. Return `false` to fall through to the generic pipeline.\n3. Setting `Variables[\"title\"]` and `Variables[\"author\"]` overrides the values in `Result.Title` / `Result.Author`.\n\n```go\ntype RecipeExtractor struct {\n    *extractors.ExtractorBase\n}\n\nfunc NewRecipeExtractor(doc *goquery.Document, url string, schema any) extractors.BaseExtractor {\n    return \u0026RecipeExtractor{ExtractorBase: extractors.NewExtractorBase(doc, url, schema)}\n}\n\nfunc (e *RecipeExtractor) Name() string { return \"RecipeExtractor\" }\n\n// CanExtract returns true only when the page has a recipe card — not every page on the host.\nfunc (e *RecipeExtractor) CanExtract() bool {\n    return e.GetDocument().Find(\"article.recipe-card\").Length() \u003e 0\n}\n\nfunc (e *RecipeExtractor) Extract() *extractors.ExtractorResult {\n    doc := e.GetDocument()\n\n    // ContentHTML is what becomes Result.Content.\n    content, _ := doc.Find(\"article.recipe-card\").Html()\n\n    title := strings.TrimSpace(doc.Find(\"h1.recipe-title\").Text())\n    author := strings.TrimSpace(doc.Find(\".recipe-author\").Text())\n\n    return \u0026extractors.ExtractorResult{\n        ContentHTML: content,\n        Variables: map[string]string{\n            \"title\":  title,\n            \"author\": author,\n            \"site\":   \"Recipe Site\",\n        },\n    }\n}\n```\n\nRegister it before parsing — typically in `init()` or application startup:\n\n```go\nextractors.Register(extractors.ExtractorMapping{\n    Patterns:  []any{\"recipes.example.com\"},\n    Extractor: NewRecipeExtractor,\n})\n```\n\n## Configuration\n\n### Options\n\nAll options have sensible defaults. Pass `nil` for zero-config extraction.\n\n```go\nopts := \u0026defuddle.Options{\n    // Output\n    Markdown:         false, // Return content as Markdown\n    SeparateMarkdown: false, // Return both HTML and Markdown\n\n    // Content selection\n    ContentSelector:  \"\",    // CSS selector override for main content\n    URL:              \"\",    // Source URL (used for link resolution and domain detection)\n\n    // Removal controls — pointer bools default to true when nil.\n    // Use defuddle.PtrBool(false) to explicitly disable.\n    RemoveExactSelectors:   nil, // Remove known clutter (ads, nav, social buttons)\n    RemovePartialSelectors: nil, // Remove probable clutter (class/id pattern matching)\n    RemoveHiddenElements:   nil, // Remove display:none and hidden elements\n    RemoveContentPatterns:  nil, // Remove boilerplate (breadcrumbs, related posts, etc.)\n    RemoveLowScoring:       nil, // Remove low-scoring non-content blocks\n    RemoveImages:           false, // Strip all images from output\n\n    // Element processing\n    ProcessCode:      false, // Normalize code blocks with language detection\n    ProcessImages:    false, // Optimize images (lazy-load resolution, srcset)\n    ProcessHeadings:  false, // Clean heading hierarchy\n    ProcessMath:      false, // Normalize MathJax/KaTeX formulas\n    ProcessFootnotes: false, // Standardize footnote format\n    ProcessRoles:     false, // Convert ARIA roles to semantic HTML\n\n    // HTTP (for ParseFromURL / ParseFromURLs)\n    Client:         nil, // Custom *requests.Client; default uses 30s timeout\n    MaxConcurrency: 5,   // Parallel limit for ParseFromURLs\n    Debug:          false,\n}\n```\n\n### Content Selector\n\nOverride automatic content detection with a CSS selector:\n\n```go\nresult, err := defuddle.ParseFromURL(ctx, url, \u0026defuddle.Options{\n    ContentSelector: \"article.post-body\",\n})\n```\n\n## The Extraction Pipeline\n\nDefuddle processes content through a multi-stage pipeline:\n\n```\nHTML Input\n |\n v\n1. Schema.org         -- Extract JSON-LD structured data\n2. Site Detection     -- Match URL to specialized extractor\n3. Shadow DOM         -- Flatten shadow roots and resolve React SSR\n4. Selector Removal   -- Strip known clutter by CSS selector\n5. Content Scoring    -- Score nodes and identify main content\n6. Content Patterns   -- Remove boilerplate (breadcrumbs, related posts, newsletters)\n7. Standardization    -- Normalize headings, footnotes, code blocks, images, math\n8. Markdown           -- Convert to Markdown (if requested)\n |\n v\nResult\n```\n\nThe pipeline includes an automatic retry cascade: if initial extraction yields fewer than 50 words, Defuddle progressively relaxes removal filters to recover content from heavily-decorated pages.\n\n## The Result Object\n\n| Field | Type | Description |\n|-------|------|-------------|\n| `Title` | `string` | Article title |\n| `Author` | `string` | Article author |\n| `Description` | `string` | Article description or summary |\n| `Domain` | `string` | Website domain |\n| `Favicon` | `string` | Website favicon URL |\n| `Image` | `string` | Main article image URL |\n| `Language` | `string` | BCP 47 language tag (e.g. `en`, `pt-BR`) |\n| `Published` | `string` | Publication date |\n| `Site` | `string` | Website name |\n| `Content` | `string` | Cleaned HTML (or Markdown if enabled) |\n| `ContentMarkdown` | `*string` | Markdown version (with `SeparateMarkdown`) |\n| `WordCount` | `int` | Word count of extracted content |\n| `ParseTime` | `int64` | Parse duration in milliseconds |\n| `SchemaOrgData` | `any` | Schema.org structured data |\n| `Variables` | `map[string]string` | Extractor-specific variables |\n| `MetaTags` | `[]MetaTag` | Document meta tags |\n| `ExtractorType` | `*string` | Which extractor was used |\n| `DebugInfo` | `*debug.Info` | Debug processing steps (with `Debug`) |\n\n## CLI Usage\n\nThe `defuddle` command provides a fast interface for content extraction, fully compatible with the original [TypeScript CLI](https://github.com/kepano/defuddle-cli).\n\n### Extracting Content\n\n```bash\n# From a URL\ndefuddle parse https://example.com/article\n\n# From a local file\ndefuddle parse article.html\n\n# From stdin (pipe HTML in)\ncurl -s https://example.com/article | defuddle parse\n\n# As Markdown\ndefuddle parse https://example.com/article --markdown\n\n# As JSON with all metadata\ndefuddle parse https://example.com/article --json\n\n# Extract a single field\ndefuddle parse https://example.com/article --property title\n```\n\n### Batch Processing\n\nRead one URL per line, output one JSON object per line (JSONL):\n\n```bash\ndefuddle batch \u003c urls.txt \u003e articles.jsonl\n\n# From a file, with markdown, 10 parallel fetches\ndefuddle batch --input urls.txt --markdown --concurrency 10 \u003e articles.jsonl\n\n# Bound total batch duration; --continue-on-error emits per-line error objects\ndefuddle batch --input urls.txt --timeout 2m --continue-on-error \u003e articles.jsonl\n```\n\n### Saving Output\n\n```bash\ndefuddle parse https://example.com/article --markdown --output article.md\n```\n\n### Authentication and Proxies\n\n```bash\n# Custom headers\ndefuddle parse https://example.com --header \"Authorization: Bearer token123\"\n\n# Through a proxy\ndefuddle parse https://example.com --proxy http://localhost:8080\n\n# Custom timeout\ndefuddle parse https://slow-site.com --timeout 120s\n```\n\n### All CLI Options\n\n| Option | Short | Description |\n|--------|-------|-------------|\n| `--output` | `-o` | Output file path (default: stdout) |\n| `--markdown` | `-m` | Convert content to Markdown |\n| `--json` | `-j` | Output as JSON with metadata |\n| `--property` | `-p` | Extract a specific property |\n| `--header` | `-H` | Custom header (repeatable) |\n| `--proxy` | | Proxy URL |\n| `--user-agent` | | Custom user agent |\n| `--timeout` | | Request timeout (default: 30s) |\n| `--content-selector` | | CSS selector for content root |\n| `--no-clutter-removal` | | Disable all clutter removal heuristics |\n| `--remove-images` | | Strip images from output |\n| `--debug` | | Enable debug output |\n\n## Limitations\n\nDefuddle works best on static, article-style HTML. Several categories of pages will produce poor or empty results:\n\n**JS-rendered pages.** If a site uses client-side rendering (React, Vue, Svelte without SSR), defuddle receives the shell HTML before JavaScript runs — usually near-empty. Pre-render with a headless browser and pipe the resulting HTML in: `playwright ... | defuddle parse -`.\n\n**Paywalled and login-gated content.** Defuddle fetches exactly what an unauthenticated request returns. For login-gated content, pass an authenticated `*requests.Client` with session cookies. For hard paywalls, you get the paywall HTML.\n\n**PDFs and binary content.** Any response whose `Content-Type` is not HTML, XML, or text returns `ErrNotHTML`. Sniff the content type before calling defuddle.\n\n**Large responses.** Responses over 5 MB return `ErrTooLarge`. This is intentional — defuddle is an article extractor, not a bulk downloader. The CLI applies the same 5 MiB cap to stdin and local HTML files, so all input paths share one ceiling.\n\n**CAPTCHA and bot-detection pages.** Defuddle returns whatever HTML the server sent. It does not solve CAPTCHAs or bypass bot-detection.\n\n**Non-article pages.** Content scoring is heuristic. Forum threads, comment sections, and listing pages without a site-specific extractor may return partial or noisy results.\n\nSee [docs/limitations.md](docs/limitations.md) for detailed workarounds.\n\n## Examples\n\nThe [`examples/`](./examples/) directory contains ready-to-run programs:\n\n```bash\ngo run ./examples/basic              # Simple extraction\ngo run ./examples/markdown           # HTML to Markdown\ngo run ./examples/advanced           # Full option usage\ngo run ./examples/extractors         # Site-specific extraction\ngo run ./examples/custom_extractor   # Building a custom extractor\n```\n\n## Testing\n\n```bash\n# Run all tests\ngo test ./...\n\n# With race detection\ngo test -race ./...\n\n# Benchmarks\ngo test -bench=. -benchmem ./...\n```\n\n## Credits\n\n- [Defuddle](https://github.com/kepano/defuddle) by Steph Ango ([@kepano](https://github.com/kepano)) — the original TypeScript library\n- [Defuddle CLI](https://github.com/kepano/defuddle-cli) by Steph Ango — the original CLI tool\n- Inspired by Mozilla's [Readability](https://github.com/mozilla/readability) algorithm\n\n## License\n\nDefuddle Go is open-sourced software licensed under the [MIT license](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdotcommander%2Fdefuddle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdotcommander%2Fdefuddle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdotcommander%2Fdefuddle/lists"}