{"id":13564654,"url":"https://github.com/romansky/dom-to-semantic-markdown","last_synced_at":"2025-05-14T11:10:55.740Z","repository":{"id":249786542,"uuid":"832343102","full_name":"romansky/dom-to-semantic-markdown","owner":"romansky","description":"DOM to Semantic-Markdown for use with LLMs","archived":false,"fork":false,"pushed_at":"2025-02-06T09:16:41.000Z","size":291,"stargazers_count":811,"open_issues_count":4,"forks_count":19,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-04-12T13:55:24.110Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/romansky.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-22T20:38:49.000Z","updated_at":"2025-04-12T12:41:50.000Z","dependencies_parsed_at":"2024-11-04T17:34:59.244Z","dependency_job_id":"516dfabe-15ca-477d-b778-cf2d3368c5ef","html_url":"https://github.com/romansky/dom-to-semantic-markdown","commit_stats":null,"previous_names":["romansky/dom-to-semantic-markdown"],"tags_count":33,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/romansky%2Fdom-to-semantic-markdown","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/romansky%2Fdom-to-semantic-markdown/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/romansky%2Fdom-to-semantic-markdown/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/romansky%2Fdom-to-semantic-markdown/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/romansky","download_url":"https://codeload.github.com/romansky/dom-to-semantic-markdown/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254129489,"owners_count":22019628,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T13:01:34.100Z","updated_at":"2025-05-14T11:10:50.727Z","avatar_url":"https://github.com/romansky.png","language":"TypeScript","funding_links":[],"categories":["others","TypeScript","A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003ch1 align=\"center\"\u003e\n    \u003cimg width=\"100\" height=\"100\" src=\"d2m_color.svg\" alt=\"DOM to Semantic Markdown Logo\"\u003e\u003cbr\u003e\n    DOM to Semantic Markdown\n\u003c/h1\u003e\n\n[![CI](https://github.com/romansky/dom-to-semantic-markdown/actions/workflows/ci.yml/badge.svg)](https://github.com/romansku/dom-to-semantic-markdown/actions/workflows/ci.yml)\n[![npm version](https://badge.fury.io/js/dom-to-semantic-markdown.svg)](https://badge.fury.io/js/dom-to-semantic-markdown)\n[![License: ISC](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n\nThis library converts HTML DOM to a semantic Markdown format optimized for use with Large Language Models (LLMs). It\npreserves the semantic structure of web content, extracts essential metadata, and reduces token usage compared to raw\nHTML, making it easier for LLMs to understand and process information.\n\n## Key Features\n\n* **Semantic Structure Preservation:** Retains the meaning of HTML elements like `\u003cheader\u003e`, `\u003cfooter\u003e`, `\u003cnav\u003e`, and\n  more.\n* **Metadata Extraction:** Captures important metadata such as title, description, keywords, Open Graph tags, Twitter\n  Card tags, and JSON-LD data.\n* **Token Efficiency:** Optimizes for token usage through URL refification and concise representation of content.\n* **Main Content Detection:** Automatically identifies and extracts the primary content section of a webpage.\n* **Table Column Tracking:** Adds unique identifiers to table columns, improving LLM's ability to correlate data across\n  rows.\n\n## Special Feature Examples\n\nHere are examples showcasing the library's special features using the CLI tool:\n\n**1. Simple Content Extraction:**\n\n```bash\nnpx d2m@latest -u https://xkcd.com\n```\n\nThis command fetches `https://xkcd.com` and converts it to Markdown\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to view the output\u003c/summary\u003e\n\n```markdown\n\n- [Archive](/archive)\n- [What If?](https://what-if.xkcd.com/)\n- [About](/about)\n- [Feed](/atom.xml)   ‚Ä¢ [Email](/newsletter/)\n- [TW](https://twitter.com/xkcd/)   ‚Ä¢ [FB](https://www.facebook.com/TheXKCD/)\n  ‚Ä¢ [IG](https://www.instagram.com/xkcd/)\n- [-Books-](/books/)\n- [What If? 2](/what-if-2/)\n- [WI?](/what-if/)   ‚Ä¢ [TE](/thing-explainer/)   ‚Ä¢ [HT](/how-to/)\n\n\u003ca href=\"/\"\u003e![xkcd.com logo](/s/0b7742.png)\u003c/a\u003e A webcomic of romance,\nsarcasm, math, and language. [Special 10th anniversary edition of WHAT IF?](https://xkcd.com/what-if/) ‚Äîrevised and\nannotated with brand-new illustrations and answers to important questions you never thought to ask‚Äîcoming from\nNovember 2024. Preorder [here](https://bit.ly/WhatIf10th) ! Exam Numbers\n\n- [|\u003c](/1/)\n- [\u003c Prev](/2965/)\n- [Random](//c.xkcd.com/random/comic/)\n- [Next \u003e](about:blank#)\n- [\u003e|](/)\n\n![Exam Numbers](//imgs.xkcd.com/comics/exam_numbers.png)\n\n- [|\u003c](/1/)\n- [\u003c Prev](/2965/)\n- [Random](//c.xkcd.com/random/comic/)\n- [Next \u003e](about:blank#)\n- [\u003e|](/)\n\nPermanent link to this comic: [https://xkcd.com/2966/](https://xkcd.com/2966)\nImage URL (for\nhotlinking/embedding): [https://imgs.xkcd.com/comics/exam_numbers.png](https://imgs.xkcd.com/comics/exam_numbers.png)![Selected Comics](//imgs.xkcd.com/s/a899e84.jpg)\n\u003ca href=\"//xkcd.com/1732/\"\u003e![Earth temperature timeline](//imgs.xkcd.com/s/temperature.png)\u003c/a\u003e\n[RSS Feed](/rss.xml) - [Atom Feed](/atom.xml) - [Email](/newsletter/)\nComics I enjoy:\n[Three Word Phrase](http://threewordphrase.com/) , [SMBC](https://www.smbc-comics.com/) , [Dinosaur Comics](https://www.qwantz.com/) , [Oglaf](https://oglaf.com/) (\nnsfw), [A Softer World](https://www.asofterworld.com/) , [Buttersafe](https://buttersafe.com/) , [Perry Bible Fellowship](https://pbfcomics.com/) , [Questionable Content](https://questionablecontent.net/) , [Buttercup Festival](http://www.buttercupfestival.com/) , [Homestuck](https://www.homestuck.com/) , [Junior Scientist Power Hour](https://www.jspowerhour.com/)\nOther things:\n[Tips on technology and government](https://medium.com/civic-tech-thoughts-from-joshdata/so-you-want-to-reform-democracy-7f3b1ef10597) ,\n[Climate FAQ](https://www.nytimes.com/interactive/2017/climate/what-is-climate-change.html) , [Katharine Hayhoe](https://twitter.com/KHayhoe)\nxkcd.com is best viewed with Netscape Navigator 4.0 or below on a Pentium 3¬±1 emulated in Javascript on an Apple IIGS\nat a screen resolution of 1024x1. Please enable your ad blockers, disable high-heat drying, and remove your device\nfrom Airplane Mode and set it to Boat Mode. For security reasons, please leave caps lock on while browsing. This work is\nlicensed under\na [Creative Commons Attribution-NonCommercial 2.5 License](https://creativecommons.org/licenses/by-nc/2.5/).\n\nThis means you're free to copy and share these comics (but not to sell them). [More details](/license.html).\n```\n\n\u003c/details\u003e\n\n**2. Table Column Tracking:**\n\n```bash\nnpx d2m@latest -u https://softwareyoga.com/latency-numbers-everyone-should-know/ -t -e\n```\n\nThis command fetches and converts the main content from to Markdown and adds unique identifiers to table columns, aiding\nLLMs in understanding table structure.\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to view the output\u003c/summary\u003e\n\n```markdown\n\n# Latency Numbers Everyone Should Know\n\n## Latency\n\nIn a computer network, latency is defined as the amount of time it takes for a packet of data to get from one designated\npoint to another.\n\nIn more general terms, it is the amount of time between the cause and the observation of the effect.\n\nAs you would expect, latency is important, very important. As programmers, we all know reading from disk takes longer\nthan reading from memory or the fact that L1 cache is faster than the L2 cache.\n\nBut do you know the orders of magnitude by which these aspects are faster/slower compared to others?\n\n## Latency for common operations\n\nJeff Dean from Google studied exactly that and came up with figures for latency in various situations.\n\nWith improving hardware, the latency at the higher ends of the spectrum are reducing, but not enough to ignore them\ncompletely! For instance, to read 1MB sequentially from disk might have taken¬†20,000,000 ns a decade earlier and with\nthe advent of SSDs may probably take 1,000,000 ns today. But it is never going to surpass reading directly from memory.\n\nThe table below presents the latency for the most common operations on commodity hardware. These data are only\napproximations and will vary with the hardware and the execution environment of your code. However, they do serve their\nprimary purpose, which is to enable us make informed technical decisions to reduce latency.\n\nFor better comprehension of¬† the multi-fold increase in latency, scaled figures in relation to L2 cache are also\nprovided by assuming that the L1 cache reference is 1 sec.\n\n**Scroll horizontally on the table in smaller screens**\n\n| Operation \u003c!-- col-0 --\u003e | Note \u003c!-- col-1 --\u003e | Latency \u003c!-- col-2 --\u003e | Scaled Latency \u003c!-- col-3 --\u003e |\n| --- | --- | --- | --- |\n| L1 cache reference \u003c!-- col-0 --\u003e | Level-1 cache, usually built onto the microprocessor chip itself. \u003c!-- col-1 --\u003e | 0.5 ns \u003c!-- col-2 --\u003e | Consider L1 cache reference duration is 1 sec \u003c!-- col-3 --\u003e |\n| Branch mispredict \u003c!-- col-0 --\u003e | During the execution of a program, CPU predicts the next set of instructions. Branch misprediction is when it makes the wrong prediction. Hence, the previous prediction has to be erased and new one calculated and placed on the execution stack. \u003c!-- col-1 --\u003e | 5 ns \u003c!-- col-2 --\u003e | 10 s \u003c!-- col-3 --\u003e |\n| L2 cache reference \u003c!-- col-0 --\u003e | Level-2 cache is memory built on a separate chip. \u003c!-- col-1 --\u003e | 7 ns \u003c!-- col-2 --\u003e | 14 s \u003c!-- col-3 --\u003e |\n| Mutex lock/unlock \u003c!-- col-0 --\u003e | Simple synchronization method used to ensure exclusive access to resources shared between many threads. \u003c!-- col-1 --\u003e | 25 ns \u003c!-- col-2 --\u003e | 50 s \u003c!-- col-3 --\u003e |\n| Main memory reference \u003c!-- col-0 --\u003e | Time to reference main memory i.e. RAM. \u003c!-- col-1 --\u003e | 100 ns \u003c!-- col-2 --\u003e | 3m 20s \u003c!-- col-3 --\u003e |\n| Compress 1K bytes with Snappy \u003c!-- col-0 --\u003e | Snappy is a fast data compression and decompression library written in C++ by Google and used in many Google projects like BigTable, MapReduce and other open source projects. \u003c!-- col-1 --\u003e | 3,000 ns \u003c!-- col-2 --\u003e | 1h 40 m \u003c!-- col-3 --\u003e |\n| Send 1K bytes over 1 Gbps network \u003c!-- col-0 --\u003e |  \u003c!-- col-1 --\u003e | 10,000 ns \u003c!-- col-2 --\u003e | 5h 33m 20s \u003c!-- col-3 --\u003e |\n| Read 1 MB sequentially from memory \u003c!-- col-0 --\u003e | Read from RAM. \u003c!-- col-1 --\u003e | 250,000 ns \u003c!-- col-2 --\u003e | 5d 18h 53m 20s \u003c!-- col-3 --\u003e |\n| Round trip within same datacenter \u003c!-- col-0 --\u003e | We can assume that the DNS lookup will be much faster within a datacenter than it is to go over an external router. \u003c!-- col-1 --\u003e | 500,000 ns \u003c!-- col-2 --\u003e | 11d 13h 46m 40s \u003c!-- col-3 --\u003e |\n| Read 1 MB sequentially from SSD disk \u003c!-- col-0 --\u003e | Assumes SSD disk. SSD boasts random data access times of 100000 ns or less. \u003c!-- col-1 --\u003e | 1,000,000 ns \u003c!-- col-2 --\u003e | 23d 3h 33m 20s \u003c!-- col-3 --\u003e |\n| Disk seek \u003c!-- col-0 --\u003e | Disk seek is method to get to the sector and head in the disk where the required data exists. \u003c!-- col-1 --\u003e | 10,000,000 ns \u003c!-- col-2 --\u003e | 231d 11h 33m 20s \u003c!-- col-3 --\u003e |\n| Read 1 MB sequentially from disk \u003c!-- col-0 --\u003e | Assumes regular disk, not SSD. Check the difference in comparison to SSD! \u003c!-- col-1 --\u003e | 20,000,000 ns \u003c!-- col-2 --\u003e | 462d 23h 6m 40s \u003c!-- col-3 --\u003e |\n| Send packet CA-\u003eNetherlands-\u003eCA \u003c!-- col-0 --\u003e | Round trip for packet data from U.S.A to Europe and back. \u003c!-- col-1 --\u003e | 150,000,000 ns \u003c!-- col-2 --\u003e | 3472d 5h 20m \u003c!-- col-3 --\u003e |\n\n### References:\n\n1. [Designs, Lessons and Advice from Building Large Distributed Systems](http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf)\n2. [Peter Norvig‚Äôs post on ‚Äì Teach Yourself Programming in Ten Years](http://norvig.com/21-days.html#answers)\n```\n\n\u003c/details\u003e\n\n**3. Metadata Extraction (Basic):**\n\n```bash\nnpx d2m@latest -u https://xkcd.com -meta basic\n```\n\nThis command extracts basic metadata (title, description, keywords) and includes it in the Markdown output.\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to view the output\u003c/summary\u003e\n\n```markdown\n---\ntitle: \"xkcd: Exam Numbers\"\n---\n\n- [Archive](/archive)\n- [What If?](https://what-if.xkcd.com/)\n- [About](/about)\n- [Feed](/atom.xml)   ‚Ä¢ [Email](/newsletter/)\n- [TW](https://twitter.com/xkcd/)   ‚Ä¢ [FB](https://www.facebook.com/TheXKCD/)\n  ‚Ä¢ [IG](https://www.instagram.com/xkcd/)\n- [-Books-](/books/)\n- [What If? 2](/what-if-2/)\n- [WI?](/what-if/)   ‚Ä¢ [TE](/thing-explainer/)   ‚Ä¢ [HT](/how-to/)\n\n\u003ca href=\"/\"\u003e![xkcd.com logo](/s/0b7742.png)\u003c/a\u003e A webcomic of romance,\nsarcasm, math, and language. [Special 10th anniversary edition of WHAT IF?](https://xkcd.com/what-if/) ‚Äîrevised and\nannotated with brand-new illustrations and answers to important questions you never thought to ask‚Äîcoming from\nNovember 2024. Preorder [here](https://bit.ly/WhatIf10th) ! Exam Numbers\n\n- [|\u003c](/1/)\n- [\u003c Prev](/2965/)\n- [Random](//c.xkcd.com/random/comic/)\n- [Next \u003e](about:blank#)\n- [\u003e|](/)\n\n![Exam Numbers](//imgs.xkcd.com/comics/exam_numbers.png)\n\n- [|\u003c](/1/)\n- [\u003c Prev](/2965/)\n- [Random](//c.xkcd.com/random/comic/)\n- [Next \u003e](about:blank#)\n- [\u003e|](/)\n\nPermanent link to this comic: [https://xkcd.com/2966/](https://xkcd.com/2966)\nImage URL (for\nhotlinking/embedding): [https://imgs.xkcd.com/comics/exam_numbers.png](https://imgs.xkcd.com/comics/exam_numbers.png)![Selected Comics](//imgs.xkcd.com/s/a899e84.jpg)\n\u003ca href=\"//xkcd.com/1732/\"\u003e![Earth temperature timeline](//imgs.xkcd.com/s/temperature.png)\u003c/a\u003e\n[RSS Feed](/rss.xml) - [Atom Feed](/atom.xml) - [Email](/newsletter/)\nComics I enjoy:\n[Three Word Phrase](http://threewordphrase.com/) , [SMBC](https://www.smbc-comics.com/) , [Dinosaur Comics](https://www.qwantz.com/) , [Oglaf](https://oglaf.com/) (\nnsfw), [A Softer World](https://www.asofterworld.com/) , [Buttersafe](https://buttersafe.com/) , [Perry Bible Fellowship](https://pbfcomics.com/) , [Questionable Content](https://questionablecontent.net/) , [Buttercup Festival](http://www.buttercupfestival.com/) , [Homestuck](https://www.homestuck.com/) , [Junior Scientist Power Hour](https://www.jspowerhour.com/)\nOther things:\n[Tips on technology and government](https://medium.com/civic-tech-thoughts-from-joshdata/so-you-want-to-reform-democracy-7f3b1ef10597) ,\n[Climate FAQ](https://www.nytimes.com/interactive/2017/climate/what-is-climate-change.html) , [Katharine Hayhoe](https://twitter.com/KHayhoe)\nxkcd.com is best viewed with Netscape Navigator 4.0 or below on a Pentium 3¬±1 emulated in Javascript on an Apple IIGS\nat a screen resolution of 1024x1. Please enable your ad blockers, disable high-heat drying, and remove your device\nfrom Airplane Mode and set it to Boat Mode. For security reasons, please leave caps lock on while browsing. This work is\nlicensed under\na [Creative Commons Attribution-NonCommercial 2.5 License](https://creativecommons.org/licenses/by-nc/2.5/).\n\nThis means you're free to copy and share these comics (but not to sell them). [More details](/license.html).\n```\n\n\u003c/details\u003e\n\n**4. Metadata Extraction (Extended):**\n\n```bash\nnpx d2m@latest -u https://xkcd.com -meta extended\n```\n\nThis command extracts extended metadata, including Open Graph, Twitter Card tags, and JSON-LD data, and includes it in\nthe Markdown output.\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to view the output\u003c/summary\u003e\n\n```markdown\n---\ntitle: \"xkcd: Exam Numbers\"\nopenGraph:\n  site_name: \"xkcd\"\n  title: \"Exam Numbers\"\n  url: \"https://xkcd.com/2966/\"\n  image: \"https://imgs.xkcd.com/comics/exam_numbers_2x.png\"\ntwitter:\n  card: \"summary_large_image\"\n---\n\n- [Archive](/archive)\n- [What If?](https://what-if.xkcd.com/)\n- [About](/about)\n- [Feed](/atom.xml)   ‚Ä¢ [Email](/newsletter/)\n- [TW](https://twitter.com/xkcd/)   ‚Ä¢ [FB](https://www.facebook.com/TheXKCD/)\n  ‚Ä¢ [IG](https://www.instagram.com/xkcd/)\n- [-Books-](/books/)\n- [What If? 2](/what-if-2/)\n- [WI?](/what-if/)   ‚Ä¢ [TE](/thing-explainer/)   ‚Ä¢ [HT](/how-to/)\n\n\u003ca href=\"/\"\u003e![xkcd.com logo](/s/0b7742.png)\u003c/a\u003e A webcomic of romance,\nsarcasm, math, and language. [Special 10th anniversary edition of WHAT IF?](https://xkcd.com/what-if/) ‚Äîrevised and\nannotated with brand-new illustrations and answers to important questions you never thought to ask‚Äîcoming from\nNovember 2024. Preorder [here](https://bit.ly/WhatIf10th) ! Exam Numbers\n\n- [|\u003c](/1/)\n- [\u003c Prev](/2965/)\n- [Random](//c.xkcd.com/random/comic/)\n- [Next \u003e](about:blank#)\n- [\u003e|](/)\n\n![Exam Numbers](//imgs.xkcd.com/comics/exam_numbers.png)\n\n- [|\u003c](/1/)\n- [\u003c Prev](/2965/)\n- [Random](//c.xkcd.com/random/comic/)\n- [Next \u003e](about:blank#)\n- [\u003e|](/)\n\nPermanent link to this comic: [https://xkcd.com/2966/](https://xkcd.com/2966)\nImage URL (for\nhotlinking/embedding): [https://imgs.xkcd.com/comics/exam_numbers.png](https://imgs.xkcd.com/comics/exam_numbers.png)![Selected Comics](//imgs.xkcd.com/s/a899e84.jpg)\n\u003ca href=\"//xkcd.com/1732/\"\u003e![Earth temperature timeline](//imgs.xkcd.com/s/temperature.png)\u003c/a\u003e\n[RSS Feed](/rss.xml) - [Atom Feed](/atom.xml) - [Email](/newsletter/)\nComics I enjoy:\n[Three Word Phrase](http://threewordphrase.com/) , [SMBC](https://www.smbc-comics.com/) , [Dinosaur Comics](https://www.qwantz.com/) , [Oglaf](https://oglaf.com/) (\nnsfw), [A Softer World](https://www.asofterworld.com/) , [Buttersafe](https://buttersafe.com/) , [Perry Bible Fellowship](https://pbfcomics.com/) , [Questionable Content](https://questionablecontent.net/) , [Buttercup Festival](http://www.buttercupfestival.com/) , [Homestuck](https://www.homestuck.com/) , [Junior Scientist Power Hour](https://www.jspowerhour.com/)\nOther things:\n[Tips on technology and government](https://medium.com/civic-tech-thoughts-from-joshdata/so-you-want-to-reform-democracy-7f3b1ef10597) ,\n[Climate FAQ](https://www.nytimes.com/interactive/2017/climate/what-is-climate-change.html) , [Katharine Hayhoe](https://twitter.com/KHayhoe)\nxkcd.com is best viewed with Netscape Navigator 4.0 or below on a Pentium 3¬±1 emulated in Javascript on an Apple IIGS\nat a screen resolution of 1024x1. Please enable your ad blockers, disable high-heat drying, and remove your device\nfrom Airplane Mode and set it to Boat Mode. For security reasons, please leave caps lock on while browsing. This work is\nlicensed under\na [Creative Commons Attribution-NonCommercial 2.5 License](https://creativecommons.org/licenses/by-nc/2.5/).\n\nThis means you're free to copy and share these comics (but not to sell them). [More details](/license.html).\n```\n\n\u003c/details\u003e\n\n## Installation\n\n### Using npm\n\n```bash\nnpm install dom-to-semantic-markdown\n```\n\n### Using npx (CLI)\n\n```bash\n\u003e npx d2m@latest -h\nUsage: d2m [options]\n\nConvert DOM to Semantic Markdown\n\nOptions:\n  -V, --version                                      output the version number\n  -i, --input \u003cfile\u003e                                 Input HTML file\n  -o, --output \u003cfile\u003e                                Output Markdown file\n  -e, --extract-main                                 Extract main content\n  -u, --url \u003curl\u003e                                    URL to fetch HTML content from\n  -t, --track-table-columns                          Enable table column tracking for improved LLM data correlation\n  -meta, --include-meta-data \u003c\"basic\" | \"extended\"\u003e  Include metadata extracted from the HTML head\n  -h, --help                                         display help for command\n```\n\n## Usage\n\n### Browser\n\n```javascript\nimport {convertHtmlToMarkdown} from 'dom-to-semantic-markdown';\n\nconst markdown = convertHtmlToMarkdown(document.body);\nconsole.log(markdown);\n```\n\n### Node.js\n\n```javascript\nimport {convertHtmlToMarkdown} from 'dom-to-semantic-markdown';\nimport {JSDOM} from 'jsdom';\n\nconst html = '\u003ch1\u003eHello, World!\u003c/h1\u003e\u003cp\u003eThis is a \u003cstrong\u003etest\u003c/strong\u003e.\u003c/p\u003e';\nconst dom = new JSDOM(html);\nconst markdown = convertHtmlToMarkdown(html, {overrideDOMParser: new dom.window.DOMParser()});\nconsole.log(markdown);\n```\n\n### CLI\n\n```bash\nd2m -i input.html -o output.md # Convert input.html to output.md\nd2m -u https://example.com -o output.md # Fetch and convert a webpage to Markdown\nd2m -i input.html -e # Extract main content from input.html\nd2m -i input.html -t # Enable table column tracking\nd2m -i input.html -meta basic # Include basic metadata\nd2m -i input.html -meta extended # Include extended metadata\n```\n\n## API\n\n### `convertHtmlToMarkdown(html: string, options?: ConversionOptions): string`\n\nConverts an HTML string to semantic Markdown.\n\n### `convertElementToMarkdown(element: Element, options?: ConversionOptions): string`\n\nConverts an HTML Element to semantic Markdown.\n\n### `ConversionOptions`\n\n* `websiteDomain?: string`: The domain of the website being converted.\n* `extractMainContent?: boolean`: Whether to extract only the main content of the page.\n* `refifyUrls?: boolean`: Whether to convert URLs to reference-style links.\n* `debug?: boolean`: Enable debug logging.\n* `overrideDOMParser?: DOMParser`: Custom DOMParser for Node.js environments.\n* `enableTableColumnTracking?: boolean`: Adds unique identifiers to table columns.\n* `overrideElementProcessing?: (element: Element, options: ConversionOptions, indentLevel: number) =\u003e SemanticMarkdownAST[] | undefined`:\n  Custom processing for HTML elements.\n* `processUnhandledElement?: (element: Element, options: ConversionOptions, indentLevel: number) =\u003e SemanticMarkdownAST[] | undefined`:\n  Handler for unknown HTML elements.\n* `overrideNodeRenderer?: (node: SemanticMarkdownAST, options: ConversionOptions, indentLevel: number) =\u003e string | undefined`:\n  Custom renderer for AST nodes.\n* `renderCustomNode?: (node: CustomNode, options: ConversionOptions, indentLevel: number) =\u003e string | undefined`:\n  Renderer for custom AST nodes.\n* `includeMetaData?: 'basic' | 'extended'`: Controls whether to include metadata extracted from the HTML head.\n    - `'basic'`: Includes standard meta tags like title, description, and keywords.\n    - `'extended'`: Includes basic meta tags, Open Graph tags, Twitter Card tags, and JSON-LD data.\n\n## Using the Output with LLMs\n\nThe semantic Markdown produced by this library is optimized for use with Large Language Models (LLMs). To use it effectively:\n\n1. Extract the Markdown content using the library.\n2. Start with a brief instruction or context for the LLM.\n3. Wrap the extracted Markdown in triple backticks (```).\n4. Follow the Markdown with your question or prompt.\n\nExample:\n\n````\nThe following is a semantic Markdown representation of a webpage. Please analyze its content:\n\n```markdown\n{paste your extracted markdown here}\n```\n\n{your question, e.g., \"What are the main points discussed in this article?\"}\n````\n\nThis format helps the LLM understand its task and the context of the content, enabling more accurate and relevant responses to your questions.\n\n## Contributing\n\nContributions are welcome! See the [CONTRIBUTING.md](CONTRIBUTING.md) file for details.\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fromansky%2Fdom-to-semantic-markdown","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fromansky%2Fdom-to-semantic-markdown","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fromansky%2Fdom-to-semantic-markdown/lists"}