{"id":17049065,"url":"https://github.com/toddself/dr-sax","last_synced_at":"2025-04-12T16:13:08.785Z","repository":{"id":57215897,"uuid":"19345378","full_name":"toddself/dr-sax","owner":"toddself","description":"An HTML to markdown converter that uses a sax based parser (htmlparser2)","archived":false,"fork":false,"pushed_at":"2015-08-17T16:29:02.000Z","size":471,"stargazers_count":17,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-10T16:08:54.987Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://toddself.github.io/presentations/dr-sax/demo","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/toddself.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-05-01T13:36:16.000Z","updated_at":"2024-12-20T20:25:10.000Z","dependencies_parsed_at":"2022-08-26T13:31:43.773Z","dependency_job_id":null,"html_url":"https://github.com/toddself/dr-sax","commit_stats":null,"previous_names":[],"tags_count":19,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/toddself%2Fdr-sax","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/toddself%2Fdr-sax/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/toddself%2Fdr-sax/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/toddself%2Fdr-sax/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/toddself","download_url":"https://codeload.github.com/toddself/dr-sax/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248594136,"owners_count":21130314,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-14T09:53:46.929Z","updated_at":"2025-04-12T16:13:08.692Z","avatar_url":"https://github.com/toddself.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![build status](https://secure.travis-ci.org/toddself/dr-sax.png)](http://travis-ci.org/toddself/dr-sax)\n\n# Dr. SAX\n\nDr. SAX is an HTML to markdown converter that uses a [SAX-based parser](http://github.com/fb55/htmlparser2) to convert HTML to markdown. (SAX to MD.  SAX MD? DR SAX? GET IT?!)\n\n![your project name is bad and you should feel bad](http://i.imgur.com/qgxiLco.png)\n\nIt presents both a standard (non-streaming) and transform stream interface for converting HTML to markdown.\n\n## Live Demo\n[WYSWYG Editor with markdown export](http://toddself.github.io/presentations/dr-sax/demo)\n\n## Installing\n\n`npm install --save dr-sax`\n\n## Usage\n\n_Non-Streaming_\n\n```javascript\n\u003e var DrSax = require('dr-sax');\n\u003e var drsax = new DrSax();\n\u003e drsax.write('\u003cp\u003eWow, this is an \u003cb\u003eawesome\u003c/b\u003e HTML parser dude! You should \u003ca href=\"http://yahoo.com\"\u003esubmit it to yahoo!\u003c/a\u003e');\n\"\\n\\nWow, this is an **awesome** HTML parser dude! You should [submit it to yahoo!](http://yahoo.com)\\n\\n\"\n```\n\n_Streaming_\n\n```javascript\n\u003e var transform = require('dr-sax').stream();\n\u003e fs.createReadStream('input.html').pipe(transform).pipe(fs.createWriteStream('output.md'));\n```\n\n_Stripping out non-Markdown tags_\n\n```javascript\n\u003e var DrSax = require('dr-sax');\n\u003e var drsax = new DrSax({stripTags: true});\n\u003e drsax.write('\u003cspan class=\"txt-center\"\u003e\u003cb\u003ecentered text\u003c/b\u003e\u003c/span\u003e');\n\"**centered text**\"\n\u003e var drsax2 = new DrSax();\n\u003e drsax2.write('\u003cspan class=\"txt-center\"\u003e\u003cb\u003ecentered text\u003c/b\u003e\u003c/span\u003e');\n\"\u003cspan class=\\\"txt-center\\\"\u003e**centered text**\u003c/span\u003e\"\n```\n\n_Supply your own markdown dialect_\n\n```javascript\n\u003e var dialect = {b: {open: '__', close: '__'}};\n\u003e var DrSax = require('dr-sax');\n\u003e var drsax = new DrSax({dialect: dialect});\n\u003e drsax.write('\u003cp\u003eWow, this is an \u003cb\u003eawesome\u003c/b\u003e HTML parser dude! You should \u003ca href=\"http://yahoo.com\"\u003esubmit it to yahoo!\u003c/a\u003e');\n\"\\n\\nWow, this is an __awesome__ HTML parser dude! You should [submit it to yahoo!](http://yahoo.com)\\n\\n\"\n```\n\n## Why?\nThere are a few node.js based html to markdown converters available, why do we need another?\n\n1. [html-md](https://github.com/neocotic/html.md) and [upndown](https://github.com/netgusto/upndown) are both jsdom based for node.js. JSDOM is slow, and [has some memory issues when used in a loop](https://github.com/neocotic/html.md/pull/43)\n2. Others use regular expressions to parse your HTML. Why hello Zalgo! Nice to meet you today!\n\n## Benchmarking \u0026 Compliance\n\nBenchmarks are available in [Dr. Sax Benchmarks](https://github.com/toddself/dr-sax-benchmarks).\n\nHere are the results for:\n\n```\n\"dependencies\": {\n  \"benchmark\": \"~1.0.0\",\n  \"dr-sax\": \"~1.0.7\",\n  \"hammerdown\": \"0.0.18\",\n  \"html-md\": \"~3.0.2\",\n  \"html2markdown\": \"~1.1.0\",\n  \"pdc\": \"~0.1.2\",\n  \"to-markdown\": \"0.0.2\",\n  \"unmarked\": \"0.0.12\",\n  \"upndown\": \"~0.0.7\"\n}\n```\n\nOn a 2014 quad-core 3.5gHz Core i7 iMac running node 0.10.28\n\n(pdc is using Pandoc 1.12.3)\n\n```\n\u003e dr-sax-benchmarks@0.0.0 start /Users/todd/src/dr-sax-benchmarks\n\u003e node index\n\ndr sax x 5,838 ops/sec ±1.90% (92 runs sampled)\nhtmlmd x 228 ops/sec ±4.27% (75 runs sampled)\nupndown x 216 ops/sec ±6.05% (83 runs sampled)\nto-markdown x 6,696 ops/sec ±5.03% (90 runs sampled)\nhtml2markdown x 2,400 ops/sec ±5.08% (87 runs sampled)\nunmarked:\nhammerdown x 932 ops/sec ±6.56% (74 runs sampled)\npdc x 24.07 ops/sec ±17.90% (66 runs sampled)\nFastest is to-markdown\n```\n\nThe fastest is *not* Dr. Sax, but rather [`to-markdown`](https://github.com/domchristie/to-markdown). Alas, `to-markdown` does not handle malformed HTML well as it is based on a regular-expression type parser:\n\n```\n\u003e var drsax = new (require('dr-sax'))();\n\u003e var bs = '\u003cb\u003ethis is a totally\u003ci\u003eBroken\u003c/b\u003e string that I want parsed';\n\u003e drsax.write(bs);\n\n'**this is a totally_Broken_** string that I want parsed'\n\n\u003e var tomd = require('to-markdown').toMarkdown;\n\u003e tomd(bs);\n\n'**this is a totally\u003ci\u003eBroken** string that I want parsed'\n```\n\nBoth of the DOM based parsers ([html-md](https://github.com/neocotic/html.md) and [upndown](https://github.com/netgusto/upndown/)) handle that string identically to how Dr. Sax handles it.\n\n[unmarked](https://github.com/tcr/unmarked) does not seem to work correctly however:\n\n```\n\u003e var unmarked = require('unmarked');\nundefined\n\u003e unmarked.parse('\u003cb\u003etest\u003c/b\u003e');\n'test'\n```\n\n## Round Trip Conversion\n\nThe test [tests/throughput-compliance.js](throughput-compliance.js) attempts to test HTML -\u003e Markdown -\u003e HTML conversion using the following Markdown -\u003e HTML converters:\n\n* [Gruber](http://daringfireball.net/projects/markdown)\n* [Marked](https://github.com/chjj/marked)\n* [CommonMark](https://github.com/jgm/commonmark.js)\n* [Markdown-JS](https://github.com/evilstreak/markdown-js)\n\nCurrently Markdown-JS is considered (by me) non-conforming Markdown -\u003e HTML renderers due to its handling of block-level `\u003ciframe\u003e` tags. Its lack of conformance is not due to how Dr. Sax generates its output.  \n\nThese are being tracked by:\n* [markdown-js bug 212](https://github.com/evilstreak/markdown-js/issues/212)\n\nThe issue is pretty simple:\n\nGiven the following input\n\n```html\n\u003cp\u003eI am a \u003cstrong\u003eparagraph\u003c/strong\u003e of text\u003c/p\u003e\n\u003ciframe\u003e\u003c/iframe\u003e\n```\n\nDr Sax will create the following Markdown\n\n```markdown\nI am a **paragraph** of text\n\n\u003ciframe\u003e\u003c/iframe\u003e\n```\n\nBoth Gruber and Marked accept this input and regenerate the original input HTML. However, stmd and Markdown-js output:\n\n```html\n\u003cp\u003eI am a \u003cstrong\u003eparagraph\u003c/strong\u003e of test\u003c/p\u003e\n\n\u003cp\u003e\u003ciframe\u003e\u003c/iframe\u003e\u003c/p\u003e\n```\n\nWrapping the `\u003ciframe\u003e` tag in an extraneous `\u003cp\u003e` tag makes them very hard to style appropriately without doing crazy tricks, so I'm going to side with Gruber and Marked on this case and recommend them for rendering Markdown.\n\nHowever, there are some caveats and quirks being that Markdown is a whitespace significant language and HTML is not.\n\nThe primary munging occurs if your input is pretty-printed HTML.\n\n```html\n\u003ch1\u003eWhy use \u003ca href=\"https://github.com/toddself/dr-sax/\"\u003eDr. Sax\u003c/a\u003e\u003c/h1\u003e\n\u003col\u003e\n  \u003cli\u003eBecause you like puns!\u003c/li\u003e\n  \u003cli\u003eBecause you need speed\u003c/li\u003e\n\u003c/ol\u003e\n\u003cstrong\u003eThis is going to be bold!\u003c/strong\u003e\n\u003ch2\u003eKittens\u003c/h2\u003eLook at these funny little furry things!\n\u003ciframe width=\"560\" height=\"315\" src=\"//www.youtube.com/embed/h_hKJCe_-sI\" frameborder=\"0\" allowfullscreen\u003e\u003c/iframe\u003e\n```\n\nThis will convert to the following markdown\n\n```markdown\n# Why use [Dr. Sax](https://github.com/toddself/dr-sax/)\n\n1. Because you like puns!\n1. Because you need speed\n\n\n**This is going to be bold!**\n\n## Kittens\n\nLook at these funny little furry things!\u003ciframe width=\"560\" height=\"315\" src=\"//www.youtube.com/embed/h\u003cem\u003ehKJCe\u003c/em\u003e-sI\" frameborder=\"0\" allowfullscreen=\"\"\u003e\u003c/iframe\u003e\n```\n\nBut, will convert back to the following HTML (using [marked](https://github.com/chjj/marked))\n\n```html\n\u003ch1 id=\"why-use-dr-sax-https-github-com-toddself-dr-sax-\"\u003eWhy use \u003ca href=\"https://github.com/toddself/dr-sax/\"\u003eDr. Sax\u003c/a\u003e\u003c/h1\u003e\n\u003col\u003e\n\u003cli\u003eBecause you like puns!\u003c/li\u003e\n\u003cli\u003eBecause you need speed\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003e\u003cstrong\u003eThis is going to be bold!\u003c/strong\u003e\u003c/p\u003e\n\u003ch2 id=\"kittens\"\u003eKittens\u003c/h2\u003e\n\u003cp\u003eLook at these funny little furry things!\u003ciframe width=\"560\" height=\"315\" src=\"//www.youtube.com/embed/h_hKJCe_-sI\" frameborder=\"0\" allowfullscreen=\"\"\u003e\u003c/iframe\u003e\u003c/p\u003e\n```\n\nThe `tab` characters from the `\u003col\u003e` are missing, and the `\u003ciframe\u003e` and `\u003cstrong\u003e` tags are wrapped in paragraphs.  This is a result of how Markdown's dialect handles block-level elements like the `\u003col\u003e` and `\u003ch2\u003e` tags in the page.\n\nThere is a specially formatted test to verify round-trip results in the test suite for compliance.\n\n## Dialects\nCustom dialects can be supplied to the parser. You can get a general concept of how they're defined by looking at [dialect.js](dialect.js).\n\nA dialect is an object with the top-level keys being the HTML tags you're trying to convert. Each of these points to an object with an `open` and `close` key, which is the markdown token to insert instead of the `open`ing HTML tag, and the one use instead of the `close`ing tag. You can omit a tag by just using an empty string. If the tag is indentable (like `\u003cblockquote\u003e`), set that flag to `true`. If the tag is a block-level element, set that flag to `true` as well so that the correct line-spacing will be entered.  If the tag requires attributes to be parsed, create a new key called `attrs` which is an object explaining how to deal with the attributes for that tag.\n\n**e.g.**\nThe anchor tag is `\u003ca href=\"url\"\u003ecaptured text\u003c/a\u003e`, but in markdown you need `[captured text](url)`.\n\nThis is defined as:\n\n```javascript\n{\n  a: {\n    open: '',\n    close: '',\n    attrs: {\n      text: {\n        open: '[',\n        close: ']'\n      },\n      href: {\n        open: '(',\n        close: ')'\n      }\n    }\n  }\n}\n```\n\nThe nodes in `attrs` are processed in order -- so since the \"text\" (which is anything in the `captured text` section) needs to come first, we list it first. And since the text needs to be wrapped in `[` and `]`, we note those as the open and close.  Same goes for the `href` tag. Since that completes the entire tag, we actually leave the `open` and `close` for the `\u003ca\u003e` itself empty since it requires no additional tokens.\n\n## Testing\n\nThe test script will, by default, download the Markdown package from Gruber's site, unzip it and include spawning his parser as well. If you do not wish to test against Gruber, don't have a Perl intepreter installed, etc, you by skip these tests by setting `NOGRUBER=true` on the command line before running the tests.\n\n```\ngit clone git@github.com:toddself/dr-sax\ncd dr-sax\nnpm install\nnpm test\nNOGRUBER=true npm test\n```\n\n## License\nDr. Sax is ©2014 Todd Kennedy. Available for use under the [MIT License](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftoddself%2Fdr-sax","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftoddself%2Fdr-sax","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftoddself%2Fdr-sax/lists"}