{"id":32655364,"url":"https://github.com/mensfeld/llm-docs-builder","last_synced_at":"2026-01-20T17:35:53.835Z","repository":{"id":304566899,"uuid":"1019180574","full_name":"mensfeld/llm-docs-builder","owner":"mensfeld","description":"Transform and optimize your markdown documentation for Large Language Models (LLMs) and RAG systems. Generate llms.txt automatically.","archived":false,"fork":false,"pushed_at":"2025-10-17T19:10:58.000Z","size":1756,"stargazers_count":6,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-10-18T16:26:20.468Z","etag":null,"topics":["ai","ai-documentation","context-window","documentation","large-language-models","llms","llms-txt","rag","ruby","text-processing","tokenization"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mensfeld.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-13T23:11:34.000Z","updated_at":"2025-10-17T19:10:32.000Z","dependencies_parsed_at":"2025-07-14T00:06:21.658Z","dependency_job_id":"6260432c-3c8b-4931-8edd-d4de6d0d44f6","html_url":"https://github.com/mensfeld/llm-docs-builder","commit_stats":null,"previous_names":["mensfeld/llms-txt-ruby","mensfeld/llm-docs-builder"],"tags_count":12,"template":false,"template_full_name":null,"purl":"pkg:github/mensfeld/llm-docs-builder","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mensfeld%2Fllm-docs-builder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mensfeld%2Fllm-docs-builder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mensfeld%2Fllm-docs-builder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mensfeld%2Fllm-docs-builder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mensfeld","download_url":"https://codeload.github.com/mensfeld/llm-docs-builder/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mensfeld%2Fllm-docs-builder/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281295811,"owners_count":26476759,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-27T02:00:05.855Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ai-documentation","context-window","documentation","large-language-models","llms","llms-txt","rag","ruby","text-processing","tokenization"],"created_at":"2025-10-31T10:01:12.225Z","updated_at":"2026-01-20T17:35:53.829Z","avatar_url":"https://github.com/mensfeld.png","language":"Ruby","funding_links":[],"categories":["Ruby"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"misc/logo_wide.png\" alt=\"llm-docs-builder logo\"\u003e\n\u003c/p\u003e\n\n# llm-docs-builder\n\n[![CI](https://github.com/mensfeld/llm-docs-builder/actions/workflows/ci.yml/badge.svg)](\n  https://github.com/mensfeld/llm-docs-builder/actions/workflows/ci.yml)\n\n**Optimize your documentation for LLMs and RAG systems. Reduce token consumption by 67-95%.**\n\nllm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, optimizes documents for LLM context windows, and enhances documents for RAG retrieval with hierarchical heading context and metadata.\n\n## The Problem\n\nWhen LLMs fetch documentation, they typically get HTML pages designed for humans - complete with navigation bars, footers, JavaScript, CSS, and other overhead. This wastes 70-90% of your context window on content that doesn't help answer questions.\n\n**Real-world results from [Karafka documentation](https://karafka.io/docs/) (10 pages analyzed):**\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"misc/diff.png\" alt=\"Karafka documentation optimization results\"\u003e\n\u003c/p\u003e\n\n**Average reduction: 83% fewer tokens**\n\n## Quick Start\n\n### Measure Your Current Token Waste\n\n```bash\n# Using Docker (no Ruby installation needed)\ndocker pull mensfeld/llm-docs-builder:latest\n\n# Compare your documentation page\ndocker run mensfeld/llm-docs-builder compare \\\n  --url https://yoursite.com/docs/getting-started.html\n```\n\n**Example output:**\n```\n============================================================\nContext Window Comparison\n============================================================\n\nHuman version:  127.4 KB (~32,620 tokens)\n  Source: https://karafka.io/docs/Pro-Virtual-Partitions/ (User-Agent: human)\n\nAI version:     46.3 KB (~11,854 tokens)\n  Source: https://karafka.io/docs/Pro-Virtual-Partitions/ (User-Agent: AI)\n\n------------------------------------------------------------\nReduction:      81.1 KB (64%)\nToken savings:  20,766 tokens (64%)\nFactor:         2.8x smaller\n============================================================\n```\n\n### Transform Your Documentation\n\n```bash\n# Single file\nllm-docs-builder transform --docs README.md\n\n# Fetch and transform a remote page\nllm-docs-builder transform --url https://yoursite.com/docs/page.html\n\n# Bulk transform with config\nllm-docs-builder bulk-transform --config llm-docs-builder.yml\n```\n\n**HTML to Markdown Conversion:** The transformer automatically detects and converts HTML content to clean markdown format. This works seamlessly with both local files and remote URLs, converting HTML tables, code blocks, and other elements into their markdown equivalents.\n\n## Installation\n\n### Docker (Recommended)\n\n```bash\ndocker pull mensfeld/llm-docs-builder:latest\nalias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'\n```\n\n### RubyGems\n\n```bash\ngem install llm-docs-builder\n```\n\n## Features\n\n### Automatic HTML to Markdown Conversion\n\nThe tool automatically detects and converts HTML content to clean markdown:\n- **HTML Tables** → Markdown tables\n- **HTML Code Blocks** → Fenced code blocks\n- **Figures \u0026 Captions** → Clean markdown equivalents\n- **Seamless Integration** - Works with local files and remote URLs without special configuration\n\n```bash\n# Transform HTML content automatically\nllm-docs-builder transform --docs page-with-html.md\nllm-docs-builder transform --url https://site.com/docs/api.html\n```\n\n### Measure and Compare\n\n```bash\n# Compare what your server sends to humans vs AI\nllm-docs-builder compare --url https://yoursite.com/docs/page.html\n\n# Compare remote HTML with local markdown\nllm-docs-builder compare \\\n  --url https://yoursite.com/docs/api.html \\\n  --file docs/api.md\n```\n\n### Generate llms.txt\n\n```bash\n# Create standardized documentation index\nllm-docs-builder generate --config llm-docs-builder.yml\n```\n\n## Configuration\n\n```yaml\n# llm-docs-builder.yml\ndocs: ./docs\nbase_url: https://myproject.io\ntitle: My Project\ndescription: Brief description\nbody: Optional body content between description and sections\noutput: llms.txt\nsuffix: .llm\nverbose: false\n\n# Basic options\nconvert_urls: true\nremove_comments: true\nremove_badges: true\nremove_frontmatter: true\nnormalize_whitespace: true\n\n# Additional compression options\nremove_code_examples: false\nremove_images: true\nremove_blockquotes: true\nremove_duplicates: true\nremove_stopwords: false\nsimplify_links: true\ngenerate_toc: true\ncustom_instruction: \"This documentation is optimized for AI consumption\"\n\n# RAG enhancement options\nnormalize_headings: true          # Add hierarchical context to headings\nheading_separator: \" / \"          # Separator for heading hierarchy\ninclude_metadata: true            # Enable enhanced llms.txt metadata\ninclude_tokens: true              # Include token counts in llms.txt\ninclude_timestamps: true          # Include update timestamps in llms.txt\ninclude_priority: true            # Include priority labels in llms.txt\ncalculate_compression: false      # Calculate compression ratios (slower)\n\n# Exclusions\nexcludes:\n  - \"**/private/**\"\n  - \"**/drafts/**\"\n```\n\n## CLI Commands\n\n```bash\nllm-docs-builder compare [options]        # Measure token savings\nllm-docs-builder transform [options]      # Transform single file\nllm-docs-builder bulk-transform [options] # Transform directory\nllm-docs-builder generate [options]       # Generate llms.txt\nllm-docs-builder parse [options]          # Parse llms.txt\nllm-docs-builder validate [options]       # Validate llms.txt\nllm-docs-builder version                  # Show version\n```\n\n**Common options:**\n```\n-c, --config PATH    Configuration file\n-d, --docs PATH      Documentation path\n-o, --output PATH    Output file\n-u, --url URL        URL for comparison\n-v, --verbose        Detailed output\n```\n\n## Serving Optimized Docs to AI Bots\n\nAfter using `bulk-transform` with `suffix: .llm`, configure your web server to serve optimized versions to AI bots:\n\n**Apache (.htaccess):**\n```apache\nSetEnvIf User-Agent \"(?i)(openai|anthropic|claude|gpt)\" IS_LLM_BOT\nRewriteCond %{ENV:IS_LLM_BOT} !^$\nRewriteRule ^(.*)\\.md$ $1.llm.md [L]\n```\n\n**Nginx:**\n```nginx\nmap $http_user_agent $is_llm_bot {\n    default 0;\n    \"~*(?i)(openai|anthropic|claude|gpt)\" 1;\n}\n\nlocation ~ ^/docs/(.*)\\.md$ {\n    if ($is_llm_bot) {\n        rewrite ^(.*)\\.md$ $1.llm.md last;\n    }\n}\n```\n\n## Docker Usage\n\n```bash\n# Pull image\ndocker pull mensfeld/llm-docs-builder:latest\n\n# Compare (no volume needed for remote URLs)\ndocker run mensfeld/llm-docs-builder compare \\\n  --url https://yoursite.com/docs\n\n# Transform with volume mount\ndocker run -v $(pwd):/workspace mensfeld/llm-docs-builder \\\n  bulk-transform --config llm-docs-builder.yml\n```\n\n**CI/CD Example (GitHub Actions):**\n```yaml\n- name: Optimize documentation\n  run: |\n    docker run -v ${{ github.workspace }}:/workspace \\\n      mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml\n```\n\n## Compression Examples\n\n**Input markdown:**\n```markdown\n---\nlayout: docs\n---\n\n# API Documentation\n\n[![Build](badge.svg)](https://ci.com)\n\n\u003e Important: This is a note\n\n[Click here to see the complete API documentation](./api.md)\n\napi = API.new\n```\n\n**After transformation (with default options):**\n\n```markdown\n# API Documentation\n\n[complete API documentation](./api.md)\n\napi = API.new\n```\n\n**Token reduction:** ~40-60% depending on configuration\n\n## RAG Enhancement Features\n\n### Heading Normalization\n\nTransform headings to include hierarchical context, making each section self-contained for RAG retrieval:\n\n**Before:**\n```markdown\n# Configuration\n## Consumer Settings\n### auto_offset_reset\n\nControls behavior when no offset exists...\n```\n\n**After (with `normalize_headings: true`):**\n```markdown\n# Configuration\n## Configuration / Consumer Settings\n### Configuration / Consumer Settings / auto_offset_reset\n\nControls behavior when no offset exists...\n```\n\n**Why this matters for RAG:** When documents are chunked and retrieved independently, each section retains full context. An LLM seeing just the `auto_offset_reset` section knows it's about \"Configuration / Consumer Settings / auto_offset_reset\" not just generic \"auto_offset_reset\".\n\n```yaml\n# Enable in config\nnormalize_headings: true\nheading_separator: \" / \"  # Customize separator (default: \" / \")\n```\n\n### Enhanced llms.txt Metadata\n\nGenerate enriched llms.txt files with token counts, timestamps, and priority labels to help AI agents make better decisions:\n\n**Standard llms.txt:**\n```markdown\n- [Getting Started](https://myproject.io/docs/Getting-Started.md)\n- [Configuration](https://myproject.io/docs/Configuration.md)\n```\n\n**Enhanced llms.txt (with metadata enabled):**\n```markdown\n- [Getting Started](https://myproject.io/docs/Getting-Started.md): Quick start guide (tokens:450, updated:2025-10-13, priority:high)\n- [Configuration](https://myproject.io/docs/Configuration.md): Configuration options (tokens:2800, updated:2025-10-12, priority:high)\n- [Advanced Topics](https://myproject.io/docs/Advanced.md): Deep dive topics (tokens:5200, updated:2025-09-15, priority:medium)\n```\n\n**Benefits:**\n- AI agents can see token counts → load multiple small docs vs one large doc\n- Timestamps help prefer recent documentation\n- Priority signals guide which docs to fetch first\n- Compression ratios show optimization effectiveness\n\n```yaml\n# Enable in config\ninclude_metadata: true      # Master switch\ninclude_tokens: true        # Show token counts\ninclude_timestamps: true    # Show last modified dates\ninclude_priority: true      # Show priority labels (high/medium/low)\ncalculate_compression: true # Show compression ratios (slower, requires transformation)\n```\n\n**Note:** Metadata is formatted according to the llms.txt specification, appearing within the description field using parentheses and comma separators for spec compliance.\n\n### Multi-Section Organization\n\nDocuments are automatically organized into multiple sections based on priority, following the llms.txt specification:\n\n**Priority-based categorization:**\n- **Documentation** (priority 1-3): Essential docs like README, getting started guides, user guides\n- **Examples** (priority 4-5): Tutorials and example files\n- **Optional** (priority 6-7): Advanced topics and reference documentation\n\n**Example output:**\n```markdown\n# My Project\n\n\u003e Project description\n\n## Documentation\n\n- [README](README.md): Main documentation\n- [Getting Started](getting-started.md): Quick start guide\n\n## Examples\n\n- [Basic Tutorial](tutorial.md): Step-by-step tutorial\n- [Code Examples](examples.md): Example code\n\n## Optional\n\n- [Advanced Topics](advanced.md): Deep dive into advanced features\n- [API Reference](reference.md): Complete API reference\n```\n\nEmpty sections are automatically omitted. The \"Optional\" section aligns with the llms.txt spec for marking secondary content that can be skipped when context windows are limited.\n\n### Body Content\n\nAdd custom body content between the description and documentation sections:\n\n```yaml\n# llm-docs-builder.yml\ntitle: My Project\ndescription: Brief description\nbody: |\n  This framework is built on Ruby and focuses on performance.\n  Key concepts: streaming, batching, and parallel processing.\ndocs: ./docs\n```\n\nThis produces:\n```markdown\n# My Project\n\n\u003e Brief description\n\nThis framework is built on Ruby and focuses on performance.\nKey concepts: streaming, batching, and parallel processing.\n\n## Documentation\n...\n```\n\n## Advanced Compression Options\n\nAll compression features can be used individually for fine-grained control:\n\n### Content Removal Options\n\n- `remove_frontmatter: true` - Remove YAML/TOML metadata blocks\n- `remove_comments: true` - Remove HTML comments (`\u003c!-- ... --\u003e`)\n- `remove_badges: true` - Remove badge/shield images (CI badges, version badges, etc.)\n- `remove_images: true` - Remove all image syntax\n- `remove_code_examples: true` - Remove fenced code blocks, indented code, and inline code\n- `remove_blockquotes: true` - Remove blockquote formatting (preserves content)\n- `remove_duplicates: true` - Remove duplicate paragraphs using fuzzy matching\n- `remove_stopwords: true` - Remove common stopwords from prose (preserves code blocks)\n\n### Content Enhancement Options\n\n- `generate_toc: true` - Generate table of contents from headings with anchor links\n- `custom_instruction: \"text\"` - Inject AI context message at document top\n- `simplify_links: true` - Simplify verbose link text (e.g., \"Click here to see the docs\" → \"docs\")\n- `convert_urls: true` - Convert `.html`/`.htm` URLs to `.md` format\n- `normalize_whitespace: true` - Reduce excessive blank lines and remove trailing whitespace\n\n## License\n\nAvailable as open source under the [MIT License](https://opensource.org/licenses/MIT).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmensfeld%2Fllm-docs-builder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmensfeld%2Fllm-docs-builder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmensfeld%2Fllm-docs-builder/lists"}