{"id":42577647,"url":"https://github.com/ilyashusterman/doc-to-readable","last_synced_at":"2026-01-28T22:00:32.074Z","repository":{"id":304008984,"uuid":"1017540653","full_name":"ilyashusterman/doc-to-readable","owner":"ilyashusterman","description":"Universal document-to-markdown and section splitter for HTML, URLs, and PDFs.","archived":false,"fork":false,"pushed_at":"2025-07-15T12:26:29.000Z","size":38734,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-01T13:39:29.011Z","etag":null,"topics":["docs","document-conversion","documents","file-processing","html","javascript","json","markdown","nodejs","npm","rag","splitter"],"latest_commit_sha":null,"homepage":"https://ilyashusterman.github.io/doc-to-readable/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ilyashusterman.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-10T17:33:55.000Z","updated_at":"2025-07-17T08:23:50.000Z","dependencies_parsed_at":"2025-07-10T22:35:31.591Z","dependency_job_id":null,"html_url":"https://github.com/ilyashusterman/doc-to-readable","commit_stats":null,"previous_names":["ilyashusterman/doc-to-readable"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/ilyashusterman/doc-to-readable","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilyashusterman%2Fdoc-to-readable","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilyashusterman%2Fdoc-to-readable/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilyashusterman%2Fdoc-to-readable/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilyashusterman%2Fdoc-to-readable/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ilyashusterman","download_url":"https://codeload.github.com/ilyashusterman/doc-to-readable/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilyashusterman%2Fdoc-to-readable/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28853194,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-28T15:15:36.453Z","status":"ssl_error","status_checked_at":"2026-01-28T15:15:13.020Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docs","document-conversion","documents","file-processing","html","javascript","json","markdown","nodejs","npm","rag","splitter"],"created_at":"2026-01-28T22:00:31.169Z","updated_at":"2026-01-28T22:00:32.061Z","avatar_url":"https://github.com/ilyashusterman.png","language":"JavaScript","readme":"[![CI](https://github.com/ilyashusterman/doc-to-readable/actions/workflows/node.js.yml/badge.svg)](https://github.com/ilyashusterman/doc-to-readable/actions)\n[![npm version](https://badge.fury.io/js/doc-to-readable.svg)](https://www.npmjs.com/package/doc-to-readable)\n[![npm downloads](https://img.shields.io/npm/dm/doc-to-readable.svg)](https://www.npmjs.com/package/doc-to-readable)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![TypeScript](https://img.shields.io/badge/TypeScript-007ACC?logo=typescript\u0026logoColor=white)](https://www.typescriptlang.org/)\n[![Node.js](https://img.shields.io/badge/Node.js-43853D?logo=node.js\u0026logoColor=white)](https://nodejs.org/)\n\n# doc-to-readable\n\nUniversal document-to-markdown and section splitter for HTML, URLs, and PDFs.\n\n## Features\n- **Cross-platform:** Works in both Node.js and browser environments\n- Convert HTML, URLs, or PDFs to Markdown\n- Split Markdown into logical sections by headers\n- Works in Node.js and browser (PDF support is best in Node.js)\n- **High Performance**: Sub-second processing for most documents\n- **Memory Efficient**: Optimized for large files up to 2MB\n\n## Installation\n```sh\nnpm install doc-to-readable\n```\n\n## Usage\n\n### Convert to Markdown\n```js\nimport { docToMarkdown } from 'doc-to-readable';\n\n// From HTML string\nconst md = await docToMarkdown('\u003ch1\u003eHello\u003c/h1\u003e\u003cp\u003eWorld\u003c/p\u003e', { type: 'html' });\n\n// From URL\nconst mdFromUrl = await docToMarkdown('https://example.com', { type: 'url' });\n\n// From Markdown (returns as-is)\nconst mdFromMarkdown = await docToMarkdown('# Title\\nContent', { type: 'markdown' });\n```\n\n\n### Split into Sections\n```js\nimport { splitReadableDocs } from 'doc-to-readable';\n\n// From Markdown\nconst sections = await splitReadableDocs('# Title\\n\\nContent here\\n\\n## Subtitle\\n\\nMore content');\n// sections: [{ title: 'Title', content: 'Content here' }, { title: 'Subtitle', content: 'More content' }]\n\n// From HTML\nconst html = '\u003ch1\u003eTitle\u003c/h1\u003e\u003cp\u003eContent\u003c/p\u003e\u003ch2\u003eSubtitle\u003c/h2\u003e\u003cp\u003eMore\u003c/p\u003e';\nconst htmlSections = await splitReadableDocs(html, { type: 'html' });\n\n// From URL\nconst urlSections = await splitReadableDocs('https://example.com', { type: 'url' });\n```\n\n### PDF Support\n- For PDF files, convert to HTML first using the included helpers, then use `docToMarkdown` or `splitReadableDocs` with `{ type: 'html' }`.\n\n## API\n- `docToMarkdown(input: string, options: { type: 'url' | 'html' | 'markdown' }): Promise\u003cstring\u003e`\n  - If `type` is `'markdown'`, returns input as-is.\n  - If unsupported type, throws a Not Implemented error.\n- `splitReadableDocs(input: string, options?: { type?: 'markdown' | 'url' | 'html' }): Promise\u003cArray\u003c{ title: string | null, content: string }\u003e\u003e`\n  - If `type` is omitted or `'markdown'`, splits input as markdown.\n  - If `type` is `'html'` or `'url'`, converts to markdown first, then splits.\n- `pdfToHtmlFromBuffer(buffer: ArrayBuffer): Promise\u003cstring\u003e` - Convert PDF buffer to HTML\n\n### PDF Buffer to HTML\n```js\nimport { pdfToHtmlFromBuffer } from 'doc-to-readable';\n\n// Convert PDF buffer to HTML\nconst pdfBuffer = await fetch('document.pdf').then(res =\u003e res.arrayBuffer());\nconst html = await pdfToHtmlFromBuffer(pdfBuffer);\n\n// Then convert to markdown\nconst md = await docToMarkdown(html, { type: 'html' });\n```\n\n## Performance\n\nThe library is optimized for high performance across different file sizes. Here are benchmark results from our test suite:\n\n### Processing Speed\n\n| File Size | docToMarkdown | splitReadableDocs | Memory Usage |\n|-----------|---------------|-------------------|--------------|\n| 1KB       | 265ms         | 0ms               | 33MB RSS     |\n| 10KB      | 43ms          | 0ms               | 2MB RSS      |\n| 100KB     | 237ms         | 1ms               | 23MB RSS     |\n| 1000KB    | 2.7s          | 4ms               | 259MB RSS    |\n| 2MB       | 6.3s          | N/A               | 934MB RSS    |\n\n### Key Performance Features\n\n- **Ultra-fast splitting**: `splitReadableDocs` processes documents in sub-millisecond time\n- **Linear scaling**: Processing time scales linearly with file size\n- **Memory efficient**: Optimized memory usage for large documents\n- **Size limits**: Built-in 2MB limit prevents memory issues\n- **Real-time ready**: Sub-second processing for documents up to 100KB\n\n### Performance Benchmarks\n\nThe library includes comprehensive benchmark tests that validate performance across:\n- **Small documents** (1-10KB): Sub-second processing\n- **Medium documents** (100KB): ~250ms processing\n- **Large documents** (1MB): ~3 seconds processing\n- **Very large documents** (2MB): ~6 seconds processing\n- **Edge cases**: Many sections, long paragraphs, oversized files\n\nRun benchmarks with:\n```sh\nnpm run test:benchmark\n```\n\n## Main Dependencies\n- [@mozilla/readability](https://github.com/mozilla/readability): Extracts main article content from HTML.\n- [turndown](https://github.com/mixmark-io/turndown): Converts HTML to Markdown.\n- [turndown-plugin-gfm](https://github.com/domchristie/turndown-plugin-gfm): GitHub Flavored Markdown support for Turndown.\n- [remark](https://github.com/remarkjs/remark): Markdown processing (used for splitting and parsing).\n- [dompurify](https://github.com/cure53/DOMPurify): Sanitizes HTML input.\n- [jsdom](https://github.com/jsdom/jsdom): Emulates browser DOM in Node.js for HTML parsing.\n- [pdf.js](https://github.com/mozilla/pdf.js): PDF to HTML conversion.\n\n▶️ **[Open Live on StackBlitz](https://stackblitz.com/edit/vitejs-vite-wkr9bmtk)**\n\n## License\nMIT \n\nPatch update: API and types for splitReadableDocs and docToMarkdown improved for clarity and flexibility. \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Filyashusterman%2Fdoc-to-readable","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Filyashusterman%2Fdoc-to-readable","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Filyashusterman%2Fdoc-to-readable/lists"}