https://github.com/ilyashusterman/doc-to-readable
Universal document-to-markdown and section splitter for HTML, URLs, and PDFs.
https://github.com/ilyashusterman/doc-to-readable
docs document-conversion documents file-processing html javascript json markdown nodejs npm rag splitter
Last synced: about 1 month ago
JSON representation
Universal document-to-markdown and section splitter for HTML, URLs, and PDFs.
- Host: GitHub
- URL: https://github.com/ilyashusterman/doc-to-readable
- Owner: ilyashusterman
- Created: 2025-07-10T17:33:55.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-07-15T12:26:29.000Z (8 months ago)
- Last Synced: 2025-10-01T13:39:29.011Z (5 months ago)
- Topics: docs, document-conversion, documents, file-processing, html, javascript, json, markdown, nodejs, npm, rag, splitter
- Language: JavaScript
- Homepage: https://ilyashusterman.github.io/doc-to-readable/
- Size: 36.9 MB
- Stars: 6
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[](https://github.com/ilyashusterman/doc-to-readable/actions)
[](https://www.npmjs.com/package/doc-to-readable)
[](https://www.npmjs.com/package/doc-to-readable)
[](https://opensource.org/licenses/MIT)
[](https://www.typescriptlang.org/)
[](https://nodejs.org/)
# doc-to-readable
Universal document-to-markdown and section splitter for HTML, URLs, and PDFs.
## Features
- **Cross-platform:** Works in both Node.js and browser environments
- Convert HTML, URLs, or PDFs to Markdown
- Split Markdown into logical sections by headers
- Works in Node.js and browser (PDF support is best in Node.js)
- **High Performance**: Sub-second processing for most documents
- **Memory Efficient**: Optimized for large files up to 2MB
## Installation
```sh
npm install doc-to-readable
```
## Usage
### Convert to Markdown
```js
import { docToMarkdown } from 'doc-to-readable';
// From HTML string
const md = await docToMarkdown('
Hello
World
', { type: 'html' });
// From URL
const mdFromUrl = await docToMarkdown('https://example.com', { type: 'url' });
// From Markdown (returns as-is)
const mdFromMarkdown = await docToMarkdown('# Title\nContent', { type: 'markdown' });
```
### Split into Sections
```js
import { splitReadableDocs } from 'doc-to-readable';
// From Markdown
const sections = await splitReadableDocs('# Title\n\nContent here\n\n## Subtitle\n\nMore content');
// sections: [{ title: 'Title', content: 'Content here' }, { title: 'Subtitle', content: 'More content' }]
// From HTML
const html = '
Title
Content
Subtitle
More
';
const htmlSections = await splitReadableDocs(html, { type: 'html' });
// From URL
const urlSections = await splitReadableDocs('https://example.com', { type: 'url' });
```
### PDF Support
- For PDF files, convert to HTML first using the included helpers, then use `docToMarkdown` or `splitReadableDocs` with `{ type: 'html' }`.
## API
- `docToMarkdown(input: string, options: { type: 'url' | 'html' | 'markdown' }): Promise`
- If `type` is `'markdown'`, returns input as-is.
- If unsupported type, throws a Not Implemented error.
- `splitReadableDocs(input: string, options?: { type?: 'markdown' | 'url' | 'html' }): Promise>`
- If `type` is omitted or `'markdown'`, splits input as markdown.
- If `type` is `'html'` or `'url'`, converts to markdown first, then splits.
- `pdfToHtmlFromBuffer(buffer: ArrayBuffer): Promise` - Convert PDF buffer to HTML
### PDF Buffer to HTML
```js
import { pdfToHtmlFromBuffer } from 'doc-to-readable';
// Convert PDF buffer to HTML
const pdfBuffer = await fetch('document.pdf').then(res => res.arrayBuffer());
const html = await pdfToHtmlFromBuffer(pdfBuffer);
// Then convert to markdown
const md = await docToMarkdown(html, { type: 'html' });
```
## Performance
The library is optimized for high performance across different file sizes. Here are benchmark results from our test suite:
### Processing Speed
| File Size | docToMarkdown | splitReadableDocs | Memory Usage |
|-----------|---------------|-------------------|--------------|
| 1KB | 265ms | 0ms | 33MB RSS |
| 10KB | 43ms | 0ms | 2MB RSS |
| 100KB | 237ms | 1ms | 23MB RSS |
| 1000KB | 2.7s | 4ms | 259MB RSS |
| 2MB | 6.3s | N/A | 934MB RSS |
### Key Performance Features
- **Ultra-fast splitting**: `splitReadableDocs` processes documents in sub-millisecond time
- **Linear scaling**: Processing time scales linearly with file size
- **Memory efficient**: Optimized memory usage for large documents
- **Size limits**: Built-in 2MB limit prevents memory issues
- **Real-time ready**: Sub-second processing for documents up to 100KB
### Performance Benchmarks
The library includes comprehensive benchmark tests that validate performance across:
- **Small documents** (1-10KB): Sub-second processing
- **Medium documents** (100KB): ~250ms processing
- **Large documents** (1MB): ~3 seconds processing
- **Very large documents** (2MB): ~6 seconds processing
- **Edge cases**: Many sections, long paragraphs, oversized files
Run benchmarks with:
```sh
npm run test:benchmark
```
## Main Dependencies
- [@mozilla/readability](https://github.com/mozilla/readability): Extracts main article content from HTML.
- [turndown](https://github.com/mixmark-io/turndown): Converts HTML to Markdown.
- [turndown-plugin-gfm](https://github.com/domchristie/turndown-plugin-gfm): GitHub Flavored Markdown support for Turndown.
- [remark](https://github.com/remarkjs/remark): Markdown processing (used for splitting and parsing).
- [dompurify](https://github.com/cure53/DOMPurify): Sanitizes HTML input.
- [jsdom](https://github.com/jsdom/jsdom): Emulates browser DOM in Node.js for HTML parsing.
- [pdf.js](https://github.com/mozilla/pdf.js): PDF to HTML conversion.
▶️ **[Open Live on StackBlitz](https://stackblitz.com/edit/vitejs-vite-wkr9bmtk)**
## License
MIT
Patch update: API and types for splitReadableDocs and docToMarkdown improved for clarity and flexibility.