{"id":26091488,"url":"https://github.com/kepano/defuddle","last_synced_at":"2026-04-21T22:05:44.600Z","repository":{"id":279872553,"uuid":"940076402","full_name":"kepano/defuddle","owner":"kepano","description":"Get the main content of any page as Markdown.","archived":false,"fork":false,"pushed_at":"2026-04-15T00:58:48.000Z","size":3653,"stargazers_count":6742,"open_issues_count":24,"forks_count":270,"subscribers_count":17,"default_branch":"main","last_synced_at":"2026-04-15T02:35:12.084Z","etag":null,"topics":["cli","defuddle","html","markdown","md"],"latest_commit_sha":null,"homepage":"https://defuddle.md","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kepano.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-02-27T15:20:42.000Z","updated_at":"2026-04-15T02:35:06.000Z","dependencies_parsed_at":"2025-02-28T07:39:48.052Z","dependency_job_id":"be45b653-bdbb-4f83-9c09-5bfb38b6037b","html_url":"https://github.com/kepano/defuddle","commit_stats":null,"previous_names":["kepano/defuddle"],"tags_count":38,"template":false,"template_full_name":null,"purl":"pkg:github/kepano/defuddle","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kepano%2Fdefuddle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kepano%2Fdefuddle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kepano%2Fdefuddle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kepano%2Fdefuddle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kepano","download_url":"https://codeload.github.com/kepano/defuddle/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kepano%2Fdefuddle/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31885013,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-16T11:36:10.202Z","status":"ssl_error","status_checked_at":"2026-04-16T11:36:09.652Z","response_time":69,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","defuddle","html","markdown","md"],"created_at":"2025-03-09T10:02:10.431Z","updated_at":"2026-04-21T22:05:44.595Z","avatar_url":"https://github.com/kepano.png","language":"TypeScript","readme":"\u003e de·​fud·dle /diˈfʌdl/ *transitive verb*  \n\u003e to remove unnecessary elements from a web page, and make it easily readable.\n\n**Beware! Defuddle is very much a work in progress!**\n\nDefuddle extracts the main content from web pages. It cleans up web pages by removing clutter like comments, sidebars, headers, footers, and other non-essential elements, leaving only the primary content.\n\n## Overview\n\nDefuddle takes a URL or HTML, finds the main content, and returns cleaned HTML or Markdown. Defuddle was created for the browser extension [Obsidian Web Clipper](https://github.com/obsidianmd/obsidian-clipper), but it is designed to run in any environment.\n\nDefuddle can be used as a replacement for [Mozilla Readability](https://github.com/mozilla/readability) with a few differences:\n\n- More forgiving, removes fewer uncertain elements.\n- Provides a consistent output for footnotes, math, code blocks, etc.\n- Uses a page's mobile styles to guess at unnecessary elements.\n- Extracts more metadata from the page, including schema.org data.\n\n## Usage\n\n### Browser\n\n```javascript\nimport Defuddle from 'defuddle';\n\n// Parse the current document\nconst defuddle = new Defuddle(document);\nconst result = defuddle.parse();\n\n// Access the content and metadata\nconsole.log(result.content);\nconsole.log(result.title);\nconsole.log(result.author);\n```\n\n### Node.js\n\n`defuddle/node` accepts a DOM `Document` from any implementation (JSDOM, linkedom, happy-dom, etc.).\n\n```javascript\nimport { parseHTML } from 'linkedom';\nimport { Defuddle } from 'defuddle/node';\n\nconst { document } = parseHTML(html);\nconst result = await Defuddle(document, 'https://example.com/article', {\n  markdown: true\n});\n\nconsole.log(result.content);\nconsole.log(result.title);\nconsole.log(result.author);\n```\n\nOr with JSDOM:\n\n```javascript\nimport { JSDOM } from 'jsdom';\nimport { Defuddle } from 'defuddle/node';\n\nconst dom = new JSDOM(html, { url: 'https://example.com/article' });\nconst result = await Defuddle(dom.window.document, 'https://example.com/article');\n```\n\n_Note: for `defuddle/node` to import properly, the module format in your `package.json` has to be set to `{ \"type\": \"module\" }`_\n\n### CLI\n\nDefuddle includes a command-line interface for parsing web pages directly from the terminal. You can run it with `npx` or [install it globally](#cli-installation).\n\n```bash\n# Parse a local HTML file\nnpx defuddle parse page.html\n\n# Parse a URL\nnpx defuddle parse https://example.com/article\n\n# Output as markdown\nnpx defuddle parse page.html --markdown\n\n# Output as JSON with metadata\nnpx defuddle parse page.html --json\n\n# Extract a specific property\nnpx defuddle parse page.html --property title\n\n# Save output to a file\nnpx defuddle parse page.html --output result.html\n\n# Enable debug mode\nnpx defuddle parse page.html --debug\n```\n\n#### CLI Options\n\n| Option | Alias | Description |\n|--------|-------|-------------|\n| `--output \u003cfile\u003e` | `-o` | Write output to a file instead of stdout |\n| `--markdown` | `-m` | Convert content to markdown format |\n| `--md` | | Alias for `--markdown` |\n| `--json` | `-j` | Output as JSON with metadata and content |\n| `--property \u003cname\u003e` | `-p` | Extract a specific property (e.g., title, description, domain) |\n| `--debug` | | Enable debug mode |\n| `--lang \u003ccode\u003e` | `-l` | Preferred language (BCP 47, e.g. `en`, `fr`, `ja`) |\n\n## Installation\n\n```bash\nnpm install defuddle\n```\n\nFor Node.js usage, install a DOM implementation:\n\n```bash\nnpm install linkedom\n```\n\nOr use JSDOM:\n\n```bash\nnpm install jsdom\n```\n\n### CLI installation\n\nTo use the `defuddle` command globally, install it with the `-g` flag:\n\n```bash\nnpm install -g defuddle\n```\n\nOr use `npx` to run the CLI without installing globally:\n\n```bash\nnpx defuddle parse https://example.com/article\n```\n\n## Response\n\nDefuddle returns an object with the following properties:\n\n| Property | Type | Description |\n|----------|------|-------------|\n| `author` | string | Author of the article |\n| `content` | string | Cleaned up string of the extracted content |\n| `description` | string | Description or summary of the article |\n| `domain` | string | Domain name of the website |\n| `favicon` | string | URL of the website's favicon |\n| `image` | string | URL of the article's main image |\n| `language` | string | Language of the page in [BCP 47](https://www.rfc-editor.org/info/bcp47) format (e.g. `en`, `en-US`) |\n| `metaTags` | object | Meta tags |\n| `parseTime` | number | Time taken to parse the page in milliseconds |\n| `published` | string | Publication date of the article |\n| `site` | string | Name of the website |\n| `schemaOrgData` | object | Raw schema.org data extracted from the page |\n| `title` | string | Title of the article |\n| `wordCount` | number | Total number of words in the extracted content |\n| `debug` | object | Debug info including content selector and removals (when `debug: true`) |\n\n## Bundles\n\nDefuddle is available in three different bundles:\n\n1. Core bundle (`defuddle`): The main bundle for browser usage. No dependencies.\n2. Full bundle (`defuddle/full`): Includes additional features for math equation parsing and Markdown conversion.\n3. Node.js bundle (`defuddle/node`): For Node.js environments. Accepts any DOM `Document` (e.g. from linkedom, JSDOM, or happy-dom). Includes full capabilities for math and Markdown conversion.\n\nThe core bundle is recommended for most use cases. It still handles math content, but doesn't include fallbacks for converting between MathML and LaTeX formats. The full bundle adds the ability to create reliable `\u003cmath\u003e` elements using `mathml-to-latex` and `temml` libraries.\n\n## Options\n\n| Option                   | Type    | Default | Description                                                               |\n| ------------------------ | ------- | ------- | ------------------------------------------------------------------------- |\n| `debug`                  | boolean | false   | Enable debug logging and return debug info in the response                |\n| `url`                    | string  |         | URL of the page being parsed                                              |\n| `markdown`               | boolean | false   | Convert `content` to Markdown                                             |\n| `separateMarkdown`       | boolean | false   | Keep `content` as HTML and return `contentMarkdown` as Markdown           |\n| `removeExactSelectors`   | boolean | true    | Remove elements matching exact selectors like ads, social buttons, etc.   |\n| `removePartialSelectors` | boolean | true    | Remove elements matching partial selectors like ads, social buttons, etc. |\n| `removeHiddenElements`   | boolean | true    | Remove elements hidden via CSS (display:none, visibility:hidden, etc.)    |\n| `removeLowScoring`       | boolean | true    | Remove non-content blocks by scoring (navigation, link lists, etc.)       |\n| `removeSmallImages`      | boolean | true    | Remove small images (icons, tracking pixels, etc.)                        |\n| `removeImages`           | boolean | false   | Remove images.                                                            |\n| `standardize`            | boolean | true    | Standardize HTML (footnotes, headings, code blocks, etc.)                 |\n| `contentSelector`        | string  |         | CSS selector to use as the main content element, bypassing auto-detection |\n| `useAsync`               | boolean | true    | Allow async extractors to fetch from third-party APIs when no local content is available. |\n| `language`               | string  |         | Preferred language (BCP 47 tag, e.g. `en`, `fr`). Sets `Accept-Language` header and selects transcript language. |\n| `includeReplies`         | boolean \\| 'extractors' | 'extractors' | Include replies: `'extractors'` for site-specific extractors only, `true` for all, `false` for none. |\n\n## HTML standardization\n\nDefuddle attempts to standardize HTML elements to provide a consistent input for subsequent manipulation such as conversion to Markdown.\n\n### Headings\n\n- The first H1 or H2 heading is removed if it matches the title.\n- H1s are converted to H2s.\n- Anchor links in H1 to H6 elements are removed and become plain headings.\n\n### Code blocks\n\nCode block are standardized. If present, line numbers and syntax highlighting are removed, but the language is retained and added as a data attribute and class.\n\n```html\n\u003cpre\u003e\n  \u003ccode data-lang=\"js\" class=\"language-js\"\u003e\n    // code\n  \u003c/code\u003e\n\u003c/pre\u003e\n```\n\n### Footnotes\n\nInline references and footnotes are converted to a standard format:\n\n```html\nInline reference\u003csup id=\"fnref:1\"\u003e\u003ca href=\"#fn:1\"\u003e1\u003c/a\u003e\u003c/sup\u003e.\n\n\u003cdiv id=\"footnotes\"\u003e\n  \u003col\u003e\n    \u003cli class=\"footnote\" id=\"fn:1\"\u003e\n      \u003cp\u003e\n        Footnote content.\u0026nbsp;\u003ca href=\"#fnref:1\" class=\"footnote-backref\"\u003e↩\u003c/a\u003e\n      \u003c/p\u003e\n    \u003c/li\u003e\n    \u003c/ol\u003e\n\u003c/div\u003e\n```\n\n### Math\n\nMath elements, including MathJax and KaTeX, are converted to standard MathML:\n\n```html\n\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\" display=\"inline\" data-latex=\"a \\neq 0\"\u003e\n  \u003cmi\u003ea\u003c/mi\u003e\n  \u003cmo\u003e≠\u003c/mo\u003e\n  \u003cmn\u003e0\u003c/mn\u003e\n\u003c/math\u003e\n```\n\n### Callouts\n\nCallout and alert elements from various sources are standardized to blockquotes with a `data-callout` attribute. When converting to Markdown, these become [Obsidian-style callouts](https://help.obsidian.md/Editing+and+formatting/Callouts).\n\nSupported sources:\n- GitHub markdown alerts (`div.markdown-alert`)\n- Obsidian Publish callouts (`div.callout[data-callout]`)\n- Callout asides (`aside.callout-*`)\n- Bootstrap alerts (`div.alert.alert-*`)\n\nThe standardized HTML follows the [Obsidian Publish](https://help.obsidian.md/Editing+and+formatting/Callouts) format:\n\n```html\n\u003cdiv data-callout=\"info\" class=\"callout\"\u003e\n  \u003cdiv class=\"callout-title\"\u003e\n    \u003cdiv class=\"callout-title-inner\"\u003eInfo\u003c/div\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"callout-content\"\u003e\n    \u003cp\u003eThis is an informational callout.\u003c/p\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n```\n\nIn Markdown:\n\n```markdown\n\u003e [!info] Info\n\u003e This is an informational callout.\n```\n\n## Development\n\n### Build\n\nTo build the package, you'll need Node.js and npm installed. Then run:\n\n```bash\n# Install dependencies\nnpm install\n\n# Clean and build\nnpm run build\n```\n\n## Third-party services\n\nWhen using `parseAsync()`, if no content can be extracted from the local HTML, Defuddle may fetch content from third-party APIs as a fallback. This only happens when the page HTML contains no usable content (e.g. client-side rendered SPAs). You can disable this by setting `useAsync: false` in options.\n\n- [FxTwitter API](https://github.com/FixTweet/FxTwitter) — Used to extract X (Twitter) article content, which is not available in server-rendered HTML.\n\n## Debugging\n\n### Debug mode\n\nYou can enable debug mode by passing an options object when creating a new Defuddle instance:\n\n```typescript\nconst result = new Defuddle(document, { debug: true }).parse();\n\n// Access debug info\nconsole.log(result.debug.contentSelector); // CSS selector path of chosen main content element\nconsole.log(result.debug.removals);        // Array of removed elements with reasons\n```\n\nWhen debug mode is enabled:\n\n- Returns a `debug` field in the response with detailed information about content extraction\n- More verbose console logging about the parsing process\n- Preserves HTML class and id attributes that are normally stripped\n- Retains all data-* attributes\n- Skips div flattening to preserve document structure\n\nThe `debug` field contains:\n\n| Property | Type | Description |\n|----------|------|-------------|\n| `contentSelector` | string | CSS selector path of the chosen main content element |\n| `removals` | array | List of elements removed during processing |\n\nEach removal entry contains:\n\n| Property | Type | Description |\n|----------|------|-------------|\n| `step` | string | Pipeline step that removed the element (e.g. `removeLowScoring`, `removeBySelector`, `removeHiddenElements`) |\n| `selector` | string | CSS selector or pattern that matched (for selector-based removal) |\n| `reason` | string | Why the element was removed (e.g. `score: -20`, `display:none`) |\n| `text` | string | First 200 characters of the removed element's text content |\n\n### Pipeline toggles\n\nYou can disable individual pipeline steps to diagnose content extraction issues:\n\n```typescript\n// Skip content scoring to see if it's removing content incorrectly\nconst result = new Defuddle(document, { removeLowScoring: false }).parse();\n\n// Skip hidden element removal (useful for CSS sidenote layouts)\nconst result = new Defuddle(document, { removeHiddenElements: false }).parse();\n\n// Skip small image removal\nconst result = new Defuddle(document, { removeSmallImages: false }).parse();\n```\n\n### Content selector\n\nUse `contentSelector` to bypass Defuddle's auto-detection and specify the main content element directly:\n\n```typescript\nconst result = new Defuddle(document, {\n  contentSelector: 'article.post-content'\n}).parse();\n```\n\nIf the selector doesn't match any element, Defuddle falls back to auto-detection.\n","funding_links":[],"categories":["TypeScript","HTML"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkepano%2Fdefuddle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkepano%2Fdefuddle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkepano%2Fdefuddle/lists"}