{"id":33926690,"url":"https://github.com/ragaeeb/flappa-doormal","last_synced_at":"2026-04-29T03:08:14.175Z","repository":{"id":328088542,"uuid":"1058995332","full_name":"ragaeeb/flappa-doormal","owner":"ragaeeb","description":"https://flappa-doormal.surge.sh","archived":false,"fork":false,"pushed_at":"2026-04-28T22:14:28.000Z","size":2091,"stargazers_count":0,"open_issues_count":5,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-28T22:16:43.114Z","etag":null,"topics":["arabic","paragraphs","segmentation","segmenting"],"latest_commit_sha":null,"homepage":"https://mintlify.com/ragaeeb/flappa-doormal","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ragaeeb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-09-17T20:55:09.000Z","updated_at":"2026-04-28T22:14:32.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ragaeeb/flappa-doormal","commit_stats":null,"previous_names":["ragaeeb/flappa-doormal"],"tags_count":50,"template":false,"template_full_name":null,"purl":"pkg:github/ragaeeb/flappa-doormal","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragaeeb%2Fflappa-doormal","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragaeeb%2Fflappa-doormal/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragaeeb%2Fflappa-doormal/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragaeeb%2Fflappa-doormal/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ragaeeb","download_url":"https://codeload.github.com/ragaeeb/flappa-doormal/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragaeeb%2Fflappa-doormal/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32408492,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T02:37:21.628Z","status":"ssl_error","status_checked_at":"2026-04-29T02:36:50.947Z","response_time":110,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arabic","paragraphs","segmentation","segmenting"],"created_at":"2025-12-12T10:28:37.995Z","updated_at":"2026-04-29T03:08:14.169Z","avatar_url":"https://github.com/ragaeeb.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# flappa-doormal\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"icon.png\" alt=\"flappa-doormal\" width=\"128\" height=\"128\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eDeclarative Arabic text segmentation library\u003c/strong\u003e\u003cbr/\u003e\n  Split pages of content into logical segments using human-readable patterns.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://flappa-doormal.surge.sh\"\u003e🚀 \u003cstrong\u003eLive Demo\u003c/strong\u003e\u003c/a\u003e •\n  \u003ca href=\"https://www.npmjs.com/package/flappa-doormal\"\u003e📦 npm\u003c/a\u003e •\n  \u003ca href=\"https://github.com/ragaeeb/flappa-doormal\"\u003e📚 GitHub\u003c/a\u003e\n\u003c/p\u003e\n\n[![wakatime](https://wakatime.com/badge/user/a0b906ce-b8e7-4463-8bce-383238df6d4b/project/384fa29d-72e8-4078-980f-45d363f10507.svg)](https://wakatime.com/badge/user/a0b906ce-b8e7-4463-8bce-383238df6d4b/project/384fa29d-72e8-4078-980f-45d363f10507)\n[![Node.js CI](https://github.com/ragaeeb/flappa-doormal/actions/workflows/build.yml/badge.svg)](https://github.com/ragaeeb/flappa-doormal/actions/workflows/build.yml) ![GitHub License](https://img.shields.io/github/license/ragaeeb/flappa-doormal)\n![GitHub Release](https://img.shields.io/github/v/release/ragaeeb/flappa-doormal)\n[![Size](https://deno.bundlejs.com/badge?q=flappa-doormal@latest)](https://bundlejs.com/?q=flappa-doormal%40latest)\n![typescript](https://badgen.net/badge/icon/typescript?icon=typescript\u0026label\u0026color=blue)\n![npm](https://img.shields.io/npm/v/flappa-doormal)\n![npm](https://img.shields.io/npm/dm/flappa-doormal)\n![GitHub issues](https://img.shields.io/github/issues/ragaeeb/flappa-doormal)\n![GitHub stars](https://img.shields.io/github/stars/ragaeeb/flappa-doormal?style=social)\n[![codecov](https://codecov.io/gh/ragaeeb/flappa-doormal/graph/badge.svg?token=RQ2BV4M9IS)](https://codecov.io/gh/ragaeeb/flappa-doormal)\n[![npm version](https://badge.fury.io/js/flappa-doormal.svg)](https://badge.fury.io/js/flappa-doormal)\n\n## Why This Library?\n\n### The Problem\n\nWorking with Arabic hadith and Islamic text collections requires splitting continuous text into segments (individual hadiths, chapters, verses). This traditionally means:\n\n- Writing complex Unicode regex patterns: `^[\\u0660-\\u0669]+\\s*[-–—ـ]\\s*`\n- Handling diacritic variations: `حَدَّثَنَا` vs `حدثنا`\n- Managing multi-page spans and page boundary tracking\n- Manually extracting hadith numbers, volume/page references\n\n### What Exists\n\n- **General regex libraries**: Don't understand Arabic text nuances\n- **NLP tokenizers**: Overkill for pattern-based segmentation\n- **Manual regex**: Error-prone, hard to maintain, no metadata extraction\n\n### The Solution\n\n**flappa-doormal** provides:\n\n✅ **Readable templates**: `{{raqms}} {{dash}}` instead of cryptic regex  \n✅ **Named captures**: `{{raqms:hadithNum}}` auto-extracts to `meta.hadithNum`  \n✅ **Fuzzy matching**: Auto-enabled for `{{bab}}`, `{{kitab}}`, `{{basmalah}}`, `{{fasl}}`, `{{naql}}` (override with `fuzzy: false`)  \n✅ **Content limits**: `maxPages` and `maxContentLength` (safety-hardened) control segment size  \n✅ **Page tracking**: Know which page each segment came from  \n✅ **Declarative rules**: Describe *what* to match, not *how*\n\n## Installation\n\n```bash\nnpm install flappa-doormal\n# or\nbun add flappa-doormal\n# or\nyarn add flappa-doormal\n```\n\n## Quick Start\n\n```typescript\nimport { segmentPages } from 'flappa-doormal';\n\n// Your pages from a hadith book\nconst pages = [\n  { id: 1, content: '٦٦٩٦ - حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ...' },\n  { id: 1, content: '٦٦٩٧ - أَخْبَرَنَا عُمَرُ قَالَ...' },\n  { id: 2, content: '٦٦٩٨ - حَدَّثَنِي مُحَمَّدٌ...' },\n];\n\nconst segments = segmentPages(pages, {\n  rules: [{\n    lineStartsAfter: ['{{raqms:num}} {{dash}} '],\n    split: 'at',\n  }]\n});\n\n// Result:\n// [\n//   { content: 'حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ...', from: 1, meta: { num: '٦٦٩٦' } },\n//   { content: 'أَخْبَرَنَا عُمَرُ قَالَ...', from: 1, meta: { num: '٦٦٩٧' } },\n//   { content: 'حَدَّثَنِي مُحَمَّدٌ...', from: 2, meta: { num: '٦٦٩٨' } }\n// ]\n```\n\n## Segment Validation\n\nUse `validateSegments()` to sanity-check segmentation output against the input pages and options. This is useful for detecting page attribution issues or maxPages violations before sending segments to downstream systems.\n\n```typescript\nimport { segmentPages, validateSegments } from 'flappa-doormal';\n\nconst segments = segmentPages(pages, { rules, maxPages: 0 });\nconst report = validateSegments(pages, { rules, maxPages: 0 }, segments);\n\nif (!report.ok) {\n  console.log(report.summary);\n  console.log(report.issues[0]);\n}\n```\n\nExample issue entry (truncated):\n\n```json\n{\n  \"type\": \"page_attribution_mismatch\",\n  \"severity\": \"error\",\n  \"segmentIndex\": 2,\n  \"expected\": { \"from\": 5 },\n  \"actual\": { \"from\": 4 },\n  \"evidence\": \"Content found in page 5, but segment.from=4.\"\n}\n```\n\n## Features\n\n### 1. Template Tokens\n\nReplace regex with readable tokens:\n\n| Token | Matches | Regex Equivalent |\n|-------|---------|------------------|\n| `{{raqms}}` | Arabic-Indic digits | `[\\\\u0660-\\\\u0669]+` |\n| `{{raqm}}` | Single Arabic digit | `[\\\\u0660-\\\\u0669]` |\n| `{{nums}}` | ASCII digits | `\\\\d+` |\n| `{{num}}` | Single ASCII digit | `\\\\d` |\n| `{{dash}}` | Dash variants | `[-–—ـ]` |\n| `{{harf}}` | Arabic letter | `[أ-ي]` |\n| `{{harfs}}` | Single-letter codes separated by spaces, with optional marks/tatweel on each isolated letter | e.g. `د ت س ي ق`, `هـ ث` |\n| `{{rumuz}}` | Source abbreviations (rijāl/takhrīj rumuz), incl. multi-code blocks | e.g. `خت ٤`, `خ سي`, `خ فق`, `د ت سي ق`, `دت عس ق` |\n| `{{numbered}}` | Hadith numbering `٢٢ - ` | `{{raqms}} {{dash}} ` |\n| `{{fasl}}` | Section markers | `فصل\\|مسألة` |\n| `{{tarqim}}` | Punctuation marks | `[.!?؟؛]` |\n| `{{bullet}}` | Bullet points | `[•*°]` |\n| `{{newline}}` | Newline character | `\\n` |\n| `{{naql}}` | Narrator phrases | `حدثنا\\|أخبرنا\\|...` |\n| `{{kitab}}` | \"كتاب\" (book) | `كتاب` |\n| `{{bab}}` | \"باب\" (chapter) | `باب` |\n| `{{basmalah}}` | \"بسم الله\" | `بسم الله` |\n| `{{hr}}` | Horizontal rule (5+ chars) | `[-–—ـ_=]{5,}` |\n\n#### Token Details\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eStructural markers\u003c/b\u003e\u003c/summary\u003e\n\n- **`{{kitab}}`** – Matches \"كتاب\" (Book). Used in hadith collections to mark major book divisions. Example: `كتاب الإيمان` (Book of Faith).\n- **`{{bab}}`** – Matches \"باب\" (Chapter). Example: `باب ما جاء في الصلاة` (Chapter on what came regarding prayer).\n- **`{{fasl}}`** – Matches \"فصل\" or \"مسألة\" (Section/Issue). Common in fiqh books.\n- **`{{basmalah}}`** – Matches \"بسم الله\" or \"﷽\". Commonly appears at the start of chapters, books, or documents.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eTransmission phrases (\u003ccode\u003enaql\u003c/code\u003e)\u003c/b\u003e\u003c/summary\u003e\n\n**`{{naql}}`** matches common hadith transmission phrases:\n- حدثنا (he narrated to us)\n- أخبرنا (he informed us)\n- حدثني (he narrated to me)\n- وحدثنا (and he narrated to us)\n- أنبأنا (he reported to us)\n- سمعت (I heard)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eSource abbreviations (\u003ccode\u003erumuz\u003c/code\u003e)\u003c/b\u003e\u003c/summary\u003e\n\n**`{{rumuz}}`** matches rijāl/takhrīj source abbreviations used in narrator biography books:\n- **All six books**: ع\n- **The four Sunan**: ٤\n- **Bukhari**: خ / خت / خغ / بخ / عخ / ز / ي\n- **Muslim**: م / مق / مت\n- **Nasa'i**: س / ن / ص / عس / سي / كن\n- **Abu Dawud**: د / مد / قد / خد / ف / فد / ل / دل / كد / غد / صد\n- **Tirmidhi**: ت / تم\n- **Ibn Majah**: ق / فق\n\nMatches blocks of codes separated by whitespace (e.g., `خ سي`, `خ فق`, `خت ٤`, `د ت سي ق`).\n\n\u003e **Note**: Single-letter rumuz like `ع` are only matched when they appear as standalone codes, not as the first letter of words like `عَن`.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eDigits\u003c/b\u003e\u003c/summary\u003e\n\n| Token | Matches | Example |\n|-------|---------|---------|\n| `{{raqms}}` | One or more Arabic-Indic digits (٠-٩) | `٦٦٩٦` in `٦٦٩٦ - حدثنا` |\n| `{{raqm}}` | Single Arabic-Indic digit | `٥` |\n| `{{nums}}` | One or more ASCII digits (0-9) | `123` |\n| `{{num}}` | Single ASCII digit | `5` |\n| `{{numbered}}` | Common hadith format: `{{raqms}} {{dash}} ` | `٢٢ - حدثنا` |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eDash variants\u003c/b\u003e\u003c/summary\u003e\n\n**`{{dash}}`** matches:\n- `-` (hyphen-minus U+002D)\n- `–` (en-dash U+2013)\n- `—` (em-dash U+2014)\n- `ـ` (tatweel U+0640, Arabic elongation character)\n\nExample: `٦٦٩٦ - حدثنا` or `٦٦٩٦ ـ حدثنا`\n\n\u003c/details\u003e\n\n#### Token Constants (TypeScript)\n\nFor better IDE support, use the `Token` constants instead of raw strings:\n\n```typescript\nimport { Token, withCapture } from 'flappa-doormal';\n\n// Instead of:\n{ lineStartsWith: ['{{kitab}}', '{{bab}}'] }\n\n// Use:\n{ lineStartsWith: [Token.KITAB, Token.BAB] }\n\n// With named captures:\nconst pattern = withCapture(Token.RAQMS, 'hadithNum') + ' ' + Token.DASH + ' ';\n// Result: '{{raqms:hadithNum}} {{dash}} '\n\n{ lineStartsAfter: [pattern], split: 'at' }\n// segment.meta.hadithNum will contain the matched number\n```\n\nAvailable constants: `Token.BAB`, `Token.BASMALAH`, `Token.BULLET`, `Token.DASH`, `Token.FASL`, `Token.HARF`, `Token.HARFS`, `Token.HR`, `Token.KITAB`, `Token.NAQL`, `Token.NUM`, `Token.NUMS`, `Token.NUMBERED`, `Token.RAQM`, `Token.RAQMS`, `Token.RUMUZ`, `Token.TARQIM`\n\n\n### 2. Named Capture Groups\n\nExtract metadata automatically with the `{{token:name}}` syntax:\n\n```typescript\n// Capture hadith number\n{ template: '^{{raqms:hadithNum}} {{dash}} ' }\n// Result: meta.hadithNum = '٦٦٩٦'\n\n// Capture volume and page\n{ template: '^{{raqms:vol}}/{{raqms:page}} {{dash}} ' }\n// Result: meta.vol = '٣', meta.page = '٤٥٦'\n\n// Capture rest of content\n{ template: '^{{raqms:num}} {{dash}} {{:text}}' }\n// Result: meta.num = '٦٦٩٦', meta.text = 'حَدَّثَنَا أَبُو بَكْرٍ'\n```\n\n### 3. Fuzzy Matching (Diacritic-Insensitive)\n\nMatch Arabic text regardless of harakat:\n\n```typescript\nconst rules = [{\n  fuzzy: true,\n  lineStartsAfter: ['{{kitab:book}} '],\n  split: 'at',\n}];\n\n// Matches both:\n// - 'كِتَابُ الصلاة' (with diacritics)\n// - 'كتاب الصيام' (without diacritics)\n```\n\n### 4. Pattern Types\n\n| Type | Marker in content? | Use case |\n|------|-------------------|----------|\n| `lineStartsWith` | ✅ Included | Keep marker, segment at boundary |\n| `lineStartsAfter` | ❌ Excluded | Strip marker, capture only content |\n| `lineEndsWith` | ✅ Included | Match patterns at end of line |\n| `template` | Depends | Custom pattern with full control |\n| `regex` | Depends | Raw regex for complex cases |\n| `dictionaryEntry` | ✅ Included | Serializable Arabic dictionary headword rule |\n\n#### Building UIs with Pattern Type Keys\n\nThe library exports `PATTERN_TYPE_KEYS` (a const array) and `PatternTypeKey` (a type) for building UIs that let users select pattern types:\n\n```typescript\nimport { PATTERN_TYPE_KEYS, type PatternTypeKey } from 'flappa-doormal';\n\n// PATTERN_TYPE_KEYS = ['lineStartsWith', 'lineStartsAfter', 'lineEndsWith', 'template', 'regex', 'dictionaryEntry']\n\n// Build a dropdown/select\nPATTERN_TYPE_KEYS.map(key =\u003e \u003coption value={key}\u003e{key}\u003c/option\u003e)\n\n// Type-safe validation\nconst isPatternKey = (k: string): k is PatternTypeKey =\u003e\n  (PATTERN_TYPE_KEYS as readonly string[]).includes(k);\n```\n\n### 4.1 Page-start Guard (avoid page-wrap false positives)\n\nWhen matching at line starts (e.g., `{{naql}}`), a new page can begin with a marker that is actually a **continuation** of the previous page (page wrap), not a true new segment.\n\nUse `pageStartGuard` to allow a rule to match at the start of a page **only if** the previous page’s last non-whitespace character matches a pattern (tokens supported):\n\n```typescript\nconst segments = segmentPages(pages, {\n  rules: [{\n    fuzzy: true,\n    lineStartsWith: ['{{naql}}'],\n    split: 'at',\n    // Only allow a split at the start of a new page if the previous page ended with sentence punctuation:\n    pageStartGuard: '{{tarqim}}'\n  }]\n});\n```\n\nThis guard applies **only at page starts**. Mid-page line starts are unaffected.\n\n#### Previous-Word Page-Start Stoplist\n\nFor dictionary-like content, page wraps can split a phrase across pages and create\nfalse positives at the top of the next page. Example:\n\n- Page N ends with `قال`\n- Page N+1 starts with `العجاج:`\n\nUse `pageStartPrevWordStoplist` to suppress page-start matches when the previous\npage's last Arabic word is in a stoplist. Matching is Arabic-normalized and\ndiacritic-insensitive.\n\n```typescript\nconst segments = segmentPages(pages, {\n  rules: [{\n    regex: '^(?\u003clemma\u003e[ء-غف-ي]+):',\n    split: 'at',\n    pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال']\n  }]\n});\n```\n\nIf the previous page ends with strong sentence punctuation (`.`, `!`, `?`, `؟`, `؛`),\nthe stoplist guard is skipped and the page-start match is allowed.\n\n#### Arabic Dictionary Helper\n\nUse `createArabicDictionaryEntryRule()` to build a conservative rule for Arabic\ndictionaries with lemma capture, stopword filtering, and page-wrap protection.\nThe helper now returns a serializable native `dictionaryEntry` rule rather than\nan eagerly-compiled regex blob:\n\n```typescript\nimport { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';\n\nconst rule = createArabicDictionaryEntryRule({\n  stopWords: ['وقيل', 'ويقال', 'قال', 'العجاج', 'أخاك'],\n  pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال'],\n  samePagePrevWordStoplist: ['جل'],\n  // Optional dictionary-specific shapes:\n  allowParenthesized: true,         // e.g. (عنبر) :\n  allowWhitespaceBeforeColon: true, // e.g. عنبر :\n  allowCommaSeparated: true,        // e.g. سبد، دبس:\n  midLineSubentries: false,         // line/page starts only\n});\n\nconst segments = segmentPages(pages, { rules: [rule] });\n```\n\nEquivalent direct JSON-authored rule:\n\n```typescript\nconst rule = {\n  dictionaryEntry: {\n    stopWords: ['وقيل', 'ويقال', 'قال', 'العجاج', 'أخاك'],\n    allowParenthesized: true,\n    allowWhitespaceBeforeColon: true,\n    allowCommaSeparated: true,\n    midLineSubentries: false,\n  },\n  pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال'],\n  samePagePrevWordStoplist: ['جل'],\n  meta: { type: 'entry' },\n};\n```\n\nBehavior:\n- Keeps the lemma marker in `segment.content`\n- Stores the matched lemma in `segment.meta.lemma`\n- Matches root entries at true line/page starts like `عز:` and `لع:`\n- Matches mid-line subentries conservatively when they begin with `و`\n- Supports disabling mid-line subentries entirely with `midLineSubentries: false`\n- Can match parenthesized headwords like `(عنبر) :` when enabled\n- Can match comma-separated headword lists like `سبد، دبس:` when enabled\n- Can suppress same-page false positives like `جلّ وعزّ:` with `samePagePrevWordStoplist`\n\n#### Dictionary Letter-Code Lines\n\nFor dictionary-specific letter-code lines like `ك ش ن` or `(هـ ث)`, use\n`{{harfs}}` and decide the metadata shape in client code:\n\n```typescript\nimport { getTokenPattern, segmentPages } from 'flappa-doormal';\n\nconst harfCodes = getTokenPattern('harfs').replaceAll('\\\\s+', '[ \\\\t]+');\n\nconst segments = segmentPages(pages, {\n  rules: [{\n    regex: `^(?:\\\\((?\u003churuf\u003e${harfCodes})\\\\)|(?\u003churuf\u003e${harfCodes}))$`,\n    split: 'at',\n    meta: { type: 'C' },\n  }],\n});\n```\n\nHere `huruf` is just a named capture group chosen by the client, not a built-in\nregex primitive.\n\nThis client-side rule can be used for:\n- chapter-adjacent code lines like `(هـ ث)`\n- consecutive bare code lines like `س ط ب` then `س د ر`\n\nThe `replaceAll('\\\\s+', '[ \\\\t]+')` step is intentional:\n- `{{harfs}}` itself uses `\\s+`\n- but when embedding it in a raw full-line regex, horizontal whitespace is usually\n  safer than unrestricted `\\s+`, because it prevents accidental matching across\n  newlines\n\n### 5. Auto-Escaping Brackets\n\nIn `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, and `template` patterns, parentheses `()` and square brackets `[]` are **automatically escaped**. This means you can write intuitive patterns without manual escaping:\n\n```typescript\n// Write this (clean and readable):\n{ lineStartsAfter: ['({{harf}}): '], split: 'at' }\n\n// Instead of this (verbose escaping):\n{ lineStartsAfter: ['\\\\({{harf}}\\\\): '], split: 'at' }\n```\n\n**Important**: Brackets inside `{{tokens}}` are NOT escaped - token patterns like `{{harf}}` which expand to `[أ-ي]` work correctly.\n\nFor full regex control (character classes, capturing groups), use the `regex` pattern type which does NOT auto-escape:\n\n```typescript\n// Character class [أب] matches أ or ب\n{ regex: '^[أب] ', split: 'at' }\n\n// Capturing group (test|text) matches either\n{ regex: '^(test|text) ', split: 'at' }\n\n// Named capture groups extract metadata from raw regex too!\n{ regex: '^(?\u003cnum\u003e[٠-٩]+)\\\\s+[أ-ي\\\\s]+:\\\\s*(.+)' }\n// meta.num = matched number, content = captured (.+) group\n```\n\n### 6. Page Constraints\n\nLimit rules to specific page ranges:\n\n```typescript\n{\n  lineStartsWith: ['## '],\n  split: 'at',\n  min: 10,    // Only pages 10+\n  max: 100,   // Only pages up to 100\n}\n```\n\n### 7. Max Content Length (Safety Hardened)\n\nSplit oversized segments based on character count:\n\n```typescript\n{\n  maxContentLength: 500, // Split after 500 characters\n  prefer: 'longer',      // Try to fill the character bucket\n  breakpoints: ['\\\\.'], // Recommended: split on punctuation within window\n}\n```\n\nThe library implements **safety hardening** for character-based splits:\n- **Safe Fallback**: If no breakpoint matches, it searches backward up to 100 characters for a delimiter (whitespace or punctuation) to avoid chopping words.\n- **Unicode Safety**: Automatically prevents splitting inside Unicode surrogate pairs (e.g., emojis), preventing text corruption.\n- **Validation**: `maxContentLength` must be at least **50**.\n\n### 7.1 Preprocessing\n\nApply text normalization transforms **before** segmentation rules are evaluated:\n\n```typescript\nsegmentPages(pages, {\n  preprocess: [\n    'removeZeroWidth',    // Strip invisible Unicode control characters\n    'condenseEllipsis',   // \"...\" → \"…\" (prevents {{tarqim}} false matches)\n    'fixTrailingWaw',     // \" و \" → \" و\" (joins waw to next word)\n  ],\n  rules: [...],\n});\n```\n\n**Available transforms:**\n\n| Transform | Effect | Use Case |\n|-----------|--------|----------|\n| `removeZeroWidth` | Strips U+200B–U+200F, U+202A–U+202E, U+2060–U+2064, U+FEFF | Invisible chars interfering with patterns |\n| `condenseEllipsis` | `...` → `…` | Prevent `{{tarqim}}` matching inside ellipsis |\n| `fixTrailingWaw` | ` و ` → ` و` | Fix OCR artifacts with detached waw |\n\n**Page constraints:**\n\n```typescript\npreprocess: [\n  'removeZeroWidth',                              // All pages\n  { type: 'condenseEllipsis', min: 100 },        // Pages 100+\n  { type: 'fixTrailingWaw', min: 50, max: 500 }, // Pages 50-500\n]\n```\n\n**`removeZeroWidth` modes:**\n\n```typescript\n// Default: strip entirely\n{ type: 'removeZeroWidth', mode: 'strip' }\n\n// Alternative: replace with space (preserves word boundaries)\n// Note: Won't insert space after existing whitespace (space, newline, tab)\n{ type: 'removeZeroWidth', mode: 'space' }\n```\n\n### 8. Advanced Structural Filters\n\nRefine rule matching with page-specific constraints:\n\n```typescript\n{\n  lineStartsWith: ['### '],\n  split: 'at',\n  // Range constraints\n  min: 10,    // Only match on pages 10 and above\n  max: 500,   // Only match on pages 500 and below\n  exclude: [50, [100, 110]], // Skip page 50 and range 100-110\n\n  // Negative lookahead: skip rule if content matches this pattern\n  // (e.g. skip chapter marker if it appears inside a table/list)\n  skipWhen: '^\\s*- ', \n}\n```\n\n### 9. Debugging \u0026 Logging\n\nPass an optional `logger` to trace segmentation decisions or enable `debug` to attach match metadata to segments:\n\n```typescript\nconst segments = segmentPages(pages, {\n  rules: [...],\n  debug: true, // Enables detailed match metadata\n  logger: {\n    debug: (msg, data) =\u003e console.log(`[DEBUG] ${msg}`, data),\n    info: (msg, data) =\u003e console.info(`[INFO] ${msg}`, data),\n    warn: (msg, data) =\u003e console.warn(`[WARN] ${msg}`, data),\n    error: (msg, data) =\u003e console.error(`[ERROR] ${msg}`, data),\n  logger: {\n    debug: (msg, data) =\u003e console.log(`[DEBUG] ${msg}`, data),\n    info: (msg, data) =\u003e console.info(`[INFO] ${msg}`, data),\n    warn: (msg, data) =\u003e console.warn(`[WARN] ${msg}`, data),\n    error: (msg, data) =\u003e console.error(`[ERROR] ${msg}`, data),\n  }\n});\n\n// Helper to format debug reason\n// import { getSegmentDebugReason } from 'flappa-doormal';\n// console.log(getSegmentDebugReason(segments[0])); // \"Rule #0 (lineStartsWith) [idx:2] (Matched: '{{naql}}')\"\n```\n\n#### Debug Metadata (`_flappa`)\n\nWhen `debug: true` is enabled, the library attaches a `_flappa` object to each segment's `meta` property. This is extremely useful for understanding exactly why a segment was created and which pattern matched.\n\nThe metadata includes different fields based on the split reason:\n\n**1. Rule-based Splits**\nIf a segment was created by one of your `rules`:\n```json\n{\n  \"meta\": {\n    \"_flappa\": {\n      \"rule\": {\n        \"index\": 0,                // Index of the rule in your rules array\n        \"patternType\": \"lineStartsWith\", // The type of pattern that matched\n        \"wordIndex\": 2,            // Index of the specific pattern in the array\n        \"word\": \"{{naql}}\"         // The specific pattern string that matched\n      }\n    }\n  }\n}\n```\n\n**2. Breakpoint-based Splits**\nIf a segment was created by a `breakpoint` pattern (e.g. because it exceeded `maxPages` or `maxContentLength`):\n```json\n{\n  \"meta\": {\n    \"_flappa\": {\n      \"breakpoint\": {\n        \"index\": 0,         // Index of the breakpoint in your array\n        \"pattern\": \"\\\\.\",   // The pattern (or `regex`) that matched\n        \"kind\": \"pattern\",  // \"pattern\", \"regex\", or \"pageBoundary\"\n        \"wordIndex\": 1,     // Index in `words` array (if using `words` field)\n        \"word\": \"ثم \"       // The specific word that matched\n      }\n    }\n  }\n}\n```\n\n**3. Safety Fallback Splits (`maxContentLength`)**\nIf no rule or breakpoint matched and the library was forced to perform a safety fallback split:\n```json\n{\n  \"meta\": {\n    \"_flappa\": {\n      \"contentLengthSplit\": {\n        \"maxContentLength\": 5000,\n        \"splitReason\": \"whitespace\" // \"whitespace\", \"unicode_boundary\", or \"grapheme_cluster\"\n      }\n    }\n  }\n}\n```\n*   `whitespace`: Found a safe space/newline to split at.\n*   `unicode_boundary`: No whitespace found, split at a safe character boundary (avoiding surrogate pairs).\n*   `grapheme_cluster`: Split at a grapheme boundary (avoiding diacritic/ZWJ corruption).\n\n### 10. Page Joiners\n\nControl how text from different pages is stitched together:\n\n```typescript\n// Default: space ' ' joiner\n// Result: \"...end of page 1. Start of page 2...\"\nsegmentPages(pages, { pageJoiner: 'space' });\n\n// Result: \"...end of page 1.\\nStart of page 2...\"\nsegmentPages(pages, { pageJoiner: 'newline' });\n```\n\n### 11. Breakpoint Preferences\n\nWhen a segment exceeds `maxPages` or `maxContentLength`, breakpoints split it at the \"best\" available match:\n\n```typescript\n{\n  maxPages: 1, // Minimum segment size (page span)\n  breakpoints: ['{{tarqim}}'],\n  \n  // 'longer' (default): Greedy. Finds the match furthest in the window.\n  // Result: Segments stay close to the max limit.\n  prefer: 'longer', \n\n  // 'shorter': Conservative. Finds the first available match.\n  // Result: Segments split as early as possible.\n  prefer: 'shorter',\n}\n```\n\n#### Breakpoint Pattern Behavior\n\nWhen a breakpoint pattern matches, the split position is controlled by the `split` option:\n\n\u003e ⚠️ **Split Defaults Differ**: Rules default to `split: 'at'`, while Breakpoints default to `split: 'after'`.\n\n```typescript\n{\n  breakpoints: [\n    // Default: split AFTER the match (match included in previous segment)\n    { pattern: '{{tarqim}}' },  // or { pattern: '{{tarqim}}', split: 'after' }\n    \n    // Alternative: split AT the match (match starts next segment)\n    { pattern: 'ولهذا', split: 'at' },\n  ],\n}\n```\n\n**`split: 'after'` (default)**\n- Previous segment **ENDS WITH** the matched text\n- New segment **STARTS AFTER** the matched text\n\n```typescript\n// Pattern \"ولهذا\" with split: 'after' on \"النص الأول ولهذا النص الثاني\"\n// - Segment 1: \"النص الأول ولهذا\"  (ends WITH match)\n// - Segment 2: \"النص الثاني\"        (starts AFTER match)\n```\n\n**`split: 'at'`**\n- Previous segment **ENDS BEFORE** the matched text\n- New segment **STARTS WITH** the matched text\n\n```typescript\n// Pattern \"ولهذا\" with split: 'at' on \"النص الأول ولهذا النص الثاني\"\n// - Segment 1: \"النص الأول\"         (ends BEFORE match)\n// - Segment 2: \"ولهذا النص الثاني\"  (starts WITH match)\n```\n\n\u003e **Note**: For empty pattern `''` (page boundary fallback), `split` is ignored since there is no matched text to include/exclude.\n\n**Pattern order matters** - the first matching pattern wins:\n\n```typescript\n{\n  // Patterns are tried in order\n  breakpoints: [\n    '\\\\.',        // Try punctuation first (no need for \\\\s* - segments are trimmed)\n    'ولهذا',      // Then try specific word\n    '',           // Finally, fall back to page boundary\n  ],\n}\n// If punctuation is found, \"ولهذا\" is never tried\n```\n\n\u003e **Note on lookahead patterns**: Zero-length patterns like `(?=X)` are not supported for breakpoints because they can cause non-progress scenarios. Use `{ pattern: 'X', split: 'at' }` instead to achieve \"split before X\" behavior.\n\n\u003e **Note on whitespace**: Segments are trimmed by default. With `split:'at'`, if the match consists only of whitespace, it will be trimmed from the start of the next segment. This is usually desirable for delimiter patterns.\n\n\u003e **Tip: `\\s*` after punctuation is redundant**: Because segments are trimmed, `{{tarqim}}\\s*` produces **identical output** to `{{tarqim}}`. The trailing whitespace captured by `\\s*` gets trimmed anyway. Save yourself the extra characters!\n\n#### `pattern` vs `regex` Field\n\nBreakpoints support two pattern fields:\n\n| Field | Bracket escaping | Use case |\n|-------|-----------------|----------|\n| `pattern` | `()[]` auto-escaped | Simple patterns, token-friendly |\n| `regex` | None (raw regex) | Complex regex with groups, lookahead |\n\n```typescript\n// Use `pattern` for simple patterns (brackets are auto-escaped)\n{ pattern: '(a)', split: 'after' }   // Matches literal \"(a)\"\n{ pattern: '{{tarqim}}', split: 'after' }  // Token expansion works\n\n// Use `regex` for complex patterns with regex groups\n{ regex: '\\\\s+(?:ولهذا|وكذلك|فلذلك)', split: 'at' }  // Non-capturing group\n{ regex: '{{tarqim}}', split: 'after' }  // Tokens work in regex too!\n```\n\nIf both `pattern` and `regex` are specified, `regex` takes precedence.\n\n#### ⚠️ Mid-Word Matching Caveat\n\nBreakpoint patterns match **substrings**, not whole words. A pattern like `ولهذا` will match inside `مَولهذا`, causing a mid-word split:\n\n```typescript\n// Content: \"النص الأول مَولهذا النص\"\n// Pattern: { pattern: 'ولهذا', split: 'at' }\n// Result: \n// - Segment 1: \"النص الأول مَ\"  ← orphaned letter!\n// - Segment 2: \"ولهذا النص\"\n```\n\n**Solution**: Require whitespace before the pattern to ensure whole-word matching:\n\n```typescript\n// Single word - require preceding whitespace\n{ pattern: '\\\\s+ولهذا', split: 'at' }\n\n// Multiple words using alternation - each needs whitespace prefix\n{ pattern: '\\\\s+(?:ولهذا|وكذلك|فلذلك)', split: 'at' }\n```\n\n\u003e **Why not `\\b`?** JavaScript's `\\b` word boundary **does not work** with Arabic text. Since Arabic letters aren't considered \"word characters\" (`\\w` = `[a-zA-Z0-9_]`), using `\\b` will match **nothing** - not even standalone words. Always use `\\s+` prefix instead.\n\n#### The `words` Field (Simplified Word Breakpoints)\n\nFor breaking on multiple words, the `words` field provides a simpler syntax with automatic whitespace boundaries:\n\n```typescript\n{\n  breakpoints: [\n    // Instead of manually writing:\n    // { regex: '\\\\s+(?:فهذا|ثم|أقول)', split: 'at' }\n    \n    // Use the `words` field:\n    { words: ['فهذا', 'ثم', 'أقول'], min: 100 }\n  ],\n}\n```\n\n**Features:**\n- **Automatic `\\s+` prefix** for whole-word matching\n- **Defaults to `split: 'at'`** (can be overridden)\n- **Metacharacters auto-escaped** (literals match literally)\n- **Tokens supported** (`{{naql}}` expands as usual)\n- **Longest match first** (words sorted by length descending)\n\n```typescript\n// Override split behavior\n{ words: ['والله أعلم'], split: 'after' }  // Include phrase in previous segment\n\n// Use tokens in words\n{ words: ['{{naql}}', 'وكذلك'] }  // Token expansion works\n\n// Note: `words` cannot be combined with `pattern` or `regex`\n// Note: Empty `words: []` is filtered out (no-op), NOT treated as page-boundary fallback\n```\n\n**⚠️ Partial Word Matching**: The `words` field matches text that *starts with* the word, not complete words only. For example, `words: ['ثم']` will also match `ثمامة` (a name starting with ثم).\n\nTo match only complete words, add a **trailing space**:\n\n```typescript\n// ❌ Matches 'ثم' anywhere, including inside 'ثمامة'\n{ words: ['فهذا', 'ثم', 'أقول'] }\n\n// ✅ Matches only standalone words followed by space\n{ words: ['فهذا ', 'ثم ', 'أقول '] }\n```\n\n**Security note (ReDoS)**: Breakpoints (and raw `regex` rules) compile user-provided regular expressions. **Do not accept untrusted patterns** (e.g. from end users) without validation/sandboxing; some regexes can trigger catastrophic backtracking and hang the process.\n\n### 12. Occurrence Filtering\n\nControl which matches to use:\n\n```typescript\n{\n  lineEndsWith: ['\\\\.'],\n  split: 'after',\n  occurrence: 'last',  // Only split at LAST period on page\n}\n```\n\n## Use Cases\n\n### Simple Hadith Segmentation\n\nUse `{{numbered}}` for the common \"number - content\" format:\n\n```typescript\nconst segments = segmentPages(pages, {\n  rules: [{\n    lineStartsAfter: ['{{numbered}}'],\n    split: 'at',\n    meta: { type: 'hadith' }\n  }]\n});\n\n// Matches: ٢٢ - حدثنا, ٦٦٩٦ – أخبرنا, etc.\n// Content starts AFTER the number and dash\n```\n\n### Hadith Segmentation with Number Extraction\n\nFor capturing the hadith number, use explicit capture syntax:\n\n```typescript\nconst segments = segmentPages(pages, {\n  rules: [{\n    lineStartsAfter: ['{{raqms:hadithNum}} {{dash}} '],\n    split: 'at',\n    meta: { type: 'hadith' }\n  }]\n});\n\n// Each segment has:\n// - content: The hadith text (without number prefix)\n// - from/to: Page range\n// - meta: { type: 'hadith', hadithNum: '٦٦٩٦' }\n```\n\n### Volume/Page Reference Extraction\n\n```typescript\nconst segments = segmentPages(pages, {\n  rules: [{\n    lineStartsAfter: ['{{raqms:vol}}/{{raqms:page}} {{dash}} '],\n    split: 'at'\n  }]\n});\n\n// meta: { vol: '٣', page: '٤٥٦' }\n```\n\n### Chapter Detection with Fuzzy Matching\n\n```typescript\nconst segments = segmentPages(pages, {\n  rules: [{\n    fuzzy: true,\n    lineStartsAfter: ['{{kitab:book}} '],\n    split: 'at',\n    meta: { type: 'chapter' }\n  }]\n});\n\n// Matches \"كِتَابُ\" or \"كتاب\" regardless of diacritics\n```\n\n### Naql (Transmission) Phrase Detection\n\n```typescript\nconst segments = segmentPages(pages, {\n  rules: [{\n    fuzzy: true,\n    lineStartsWith: ['{{naql:phrase}}'],\n    split: 'at'\n  }]\n});\n\n// meta.phrase captures which narrator phrase was matched:\n// 'حدثنا', 'أخبرنا', 'حدثني', etc.\n```\n\n### Mixed Captured and Non-Captured Tokens\n\n```typescript\n// Only capture the number, not the letter\nconst segments = segmentPages(pages, {\n  rules: [{\n    lineStartsWith: ['{{raqms:num}} {{harf}} {{dash}} '],\n    split: 'at'\n  }]\n});\n\n// Input: '٥ أ - البند الأول'\n// meta: { num: '٥' }  // harf not captured (no :name suffix)\n```\n\n### Narrator Abbreviation Codes\n\nUse `{{rumuz}}` for matching rijāl/takhrīj source abbreviations (common in narrator biography books and takhrīj notes):\n\n```typescript\nconst segments = segmentPages(pages, {\n  rules: [{\n    lineStartsAfter: ['{{raqms:num}} {{rumuz}}:'],\n    split: 'at'\n  }]\n});\n\n// Matches: ١١١٨ ع: ...   /   ١١١٨ خ سي: ...  /  ١١١٨ خ فق: ...\n// meta: { num: '١١١٨' }\n// content: '...' (rumuz stripped)\n```\n\n**Supported codes**: Single-letter (`ع`, `خ`, `م`, `د`, etc.), two-letter (`خت`, `عس`, `سي`, etc.), digit `٤`, and the word `تمييز` (used in jarḥ wa taʿdīl books).\n\n\u003e **Note**: Single-letter rumuz like `ع` are only matched when they appear as standalone codes, not as the first letter of words like `عَن`. The pattern is diacritic-safe.\n\nIf your data uses *only single-letter codes separated by spaces* (e.g., `د ت س ي ق`), you can also use `{{harfs}}`.\n\n## Analysis Helpers (no LLM required)\n\nUse `analyzeCommonLineStarts(pages)` to discover common line-start signatures across a book, useful for rule authoring:\n\n```typescript\nimport { analyzeCommonLineStarts } from 'flappa-doormal';\n\nconst patterns = analyzeCommonLineStarts(pages);\n// [{ pattern: \"{{numbered}}\", count: 1234, examples: [...] }, ...]\n```\n\nYou can control **what gets analyzed** and **how results are ranked**:\n\n```typescript\nimport { analyzeCommonLineStarts } from 'flappa-doormal';\n\n// Top 20 most common line-start signatures (by frequency)\nconst topByCount = analyzeCommonLineStarts(pages, {\n  sortBy: 'count',\n  topK: 20,\n});\n\n// Only analyze markdown H2 headings (lines beginning with \"##\")\n// This shows what comes AFTER the heading marker (e.g. \"## {{bab}}\", \"## {{numbered}}\\\\[\", etc.)\nconst headingVariants = analyzeCommonLineStarts(pages, {\n  lineFilter: (line) =\u003e line.startsWith('##'),\n  sortBy: 'count',\n  topK: 40,\n});\n\n// Support additional prefix styles without changing library code\n// (e.g. markdown blockquotes \"\u003e\u003e ...\" + headings)\nconst quotedHeadings = analyzeCommonLineStarts(pages, {\n  lineFilter: (line) =\u003e line.startsWith('\u003e') || line.startsWith('#'),\n  prefixMatchers: [/^\u003e+/u, /^#+/u],\n  sortBy: 'count',\n  topK: 40,\n});\n```\n\nKey options:\n- `sortBy`: `'specificity'` (default) or `'count'` (highest frequency first)\n- `lineFilter`: restrict which lines are counted (e.g. only headings)\n- `prefixMatchers`: consume syntactic prefixes (default includes headings via `/^#+/u`) so you can see variations *after* the prefix\n- `normalizeArabicDiacritics`: `true` by default (helps token matching like `وأَخْبَرَنَا` → `{{naql}}`)\n- `whitespace`: how whitespace is represented in returned patterns:\n  - `'regex'` (default): uses `\\\\s*` placeholders between tokens\n  - `'space'`: uses literal single spaces (`' '`) between tokens (useful if you don't want `\\\\s` to later match newlines when reusing these patterns)\n\n**Note on brackets in returned patterns**:\n- `analyzeCommonLineStarts()` returns **template-like signatures**, not “ready-to-run regex”.\n- It intentionally **does not escape literal `()` / `[]`** in the returned `pattern` (e.g. `(ح)` stays `(ح)`).\n- If you paste these signatures into `lineStartsWith` / `lineStartsAfter` / `template`, that’s fine: those template pattern types **auto-escape `()[]`** outside `{{tokens}}`.\n- If you paste them into a raw `regex` rule, you may need to escape literal brackets yourself.\n\n### Repeating Sequence Analysis (continuous text)\n\nFor texts without line breaks (continuous prose), use `analyzeRepeatingSequences()`:\n\n```typescript\nimport { analyzeRepeatingSequences } from 'flappa-doormal';\n\nconst patterns = analyzeRepeatingSequences(pages, {\n  minElements: 2,\n  maxElements: 4,\n  minCount: 3,\n  topK: 20,\n});\n// [{ pattern: \"{{naql}}\\\\s*{{harf}}\", count: 42, examples: [...] }, ...]\n```\n\nKey options:\n- `minElements` / `maxElements`: N-gram size range (default 1-3)\n- `minCount`: Minimum occurrences to include (default 3)\n- `topK`: Maximum patterns to return (default 20)\n- `requireToken`: Only patterns containing `{{tokens}}` (default true)\n- `normalizeArabicDiacritics`: Ignore diacritics when matching (default true)\n\n## Analysis → Segmentation Workflow\n\nUse analysis functions to discover patterns, then pass to `segmentPages()`.\n\n### Example A: Continuous Text (No Punctuation)\n\nFor prose-like text without structural line breaks:\n\n```typescript\nimport { analyzeRepeatingSequences, segmentPages, type Page } from 'flappa-doormal';\n\n// Continuous Arabic text with narrator phrases\nconst pages: Page[] = [\n  { id: 1, content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي حدثنا خالد بن زيد عن علي' },\n  { id: 2, content: 'حدثنا سعيد بن جبير عن ابن عباس أخبرنا يوسف عن أنس' },\n];\n\n// Step 1: Discover repeating patterns\nconst patterns = analyzeRepeatingSequences(pages, { minCount: 2, topK: 10 });\n// [{ pattern: '{{naql}}', count: 5, examples: [...] }, ...]\n\n// Step 2: Build rules from discovered patterns\nconst rules = patterns.filter(p =\u003e p.count \u003e= 3).map(p =\u003e ({\n  lineStartsWith: [p.pattern],\n  split: 'at' as const,\n  fuzzy: true,\n}));\n\n// Step 3: Segment\nconst segments = segmentPages(pages, { rules });\n// [{ content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي', from: 1 }, ...]\n```\n\n### Example B: Structured Text (With Numbering)\n\nFor hadith-style numbered entries:\n\n```typescript\nimport { analyzeCommonLineStarts, segmentPages, type Page } from 'flappa-doormal';\n\n// Numbered hadith text\nconst pages: Page[] = [\n  { id: 1, content: '٦٦٩٦ - حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ\\n٦٦٩٧ - أَخْبَرَنَا عُمَرُ قَالَ' },\n  { id: 2, content: '٦٦٩٨ - حَدَّثَنِي مُحَمَّدٌ عَنْ عَائِشَةَ' },\n];\n\n// Step 1: Discover common line-start patterns\nconst patterns = analyzeCommonLineStarts(pages, { topK: 10, minCount: 2 });\n// [{ pattern: '{{raqms}}\\\\s*{{dash}}', count: 3, examples: [...] }, ...]\n\n// Step 2: Build rules (add named capture for hadith number)\nconst topPattern = patterns[0]?.pattern ?? '{{raqms}} {{dash}} ';\nconst rules = [{\n  lineStartsAfter: [topPattern.replace('{{raqms}}', '{{raqms:num}}')],\n  split: 'at' as const,\n  meta: { type: 'hadith' }\n}];\n\n// Step 3: Segment\nconst segments = segmentPages(pages, { rules });\n// [\n//   { content: 'حَدَّثَنَا أَبُو بَكْرٍ...', from: 1, meta: { type: 'hadith', num: '٦٦٩٦' } },\n//   { content: 'أَخْبَرَنَا عُمَرُ قَالَ', from: 1, meta: { type: 'hadith', num: '٦٦٩٧' } },\n//   { content: 'حَدَّثَنِي مُحَمَّدٌ...', from: 2, meta: { type: 'hadith', num: '٦٦٩٨' } },\n// ]\n```\n\n## Advanced: Metadata Extraction \u0026 Data Migration\n\nIf you already have pre-segmented data (e.g., records from a database or JSON file) and want to use **flappa-doormal's** token system to extract metadata and clean the content without further splitting, you can use the **Metadata Extraction** pattern.\n\nBy setting `maxPages: 0`, you guarantee a **1:1 mapping**: each input page produces exactly one output segment, regardless of how much text is on the page.\n\n### Example: Extracting multiple fields from pre-split records\n\n```typescript\nimport { segmentPages, type Page } from 'flappa-doormal';\n\nconst excerpts = [\n  { nass: '٧٠١٦ - ١ - ١ - فَقَصَّتْهَا حَفْصَةُ', id: 1 },\n  { nass: '٧٠١٧ (أ) - بَابُ الْقَيْدِ', id: 2 },\n  { nass: 'باب الصلاة - الفصل الأول', id: 3 },\n];\n\n// Convert your data to the Page format\nconst pages: Page[] = excerpts.map(e =\u003e ({ content: e.nass, id: e.id }));\n\nconst result = segmentPages(pages, {\n  maxPages: 0, // IMPORTANT: Guarantees each page stays isolated (no merging/splitting)\n  rules: [\n    // 1. Extract triple numbers: ٧٠١٦ - ١ - ١\n    {\n      lineStartsAfter: ['{{raqms:num}} {{dash}} {{raqms:num2}} {{dash}} {{raqms:num3}} '],\n    },\n    // 2. Extract number + indicator: ٧٠١٧ (أ)\n    {\n      lineStartsAfter: ['{{raqms:num}} ({{harf:indicator}}) {{dash}} '],\n    },\n    // 3. Mark chapters using fuzzy tokens\n    {\n      fuzzy: true,\n      lineStartsWith: ['{{bab}} '],\n      meta: { type: 'Chapter' },\n    },\n  ],\n});\n\n// Segment 0: { content: 'فَقَصَّتْهَا حَفْصَةُ', meta: { num: '٧٠١٦', num2: '١', num3: '١' }, ... }\n// Segment 1: { content: 'بَابُ الْقَيْدِ', meta: { num: '٧٠١٧', indicator: 'أ' }, ... }\n// Segment 2: { content: 'باب الصلاة - الفصل الأول', meta: { type: 'Chapter' }, ... }\n```\n\n### Why use this?\n- **Pattern Robustness**: Use `{{raqms}}`, `{{dash}}`, and `{{harf}}` instead of writing raw regex for every edge case.\n- **Prefix Cleaning**: `lineStartsAfter` automatically removes the matched pattern, leaving only the clean text.\n- **Deduplication**: Named captures like `{{raqms:num}}` automatically populate the `meta` object.\n- **Fuzzy Headers**: Use `fuzzy: true` to match headers like \"Book\" or \"Chapter\" regardless of Arabic diacritics.\n\n## Rule Optimization\n\nUse `optimizeRules()` to automatically merge compatible rules, remove duplicate patterns, and sort rules by specificity (longest patterns first):\n\n```typescript\nimport { optimizeRules } from 'flappa-doormal';\n\nconst rules = [\n  // These will be merged because meta/fuzzy options match\n  { lineStartsWith: ['{{kitab}}'], fuzzy: true, meta: { type: 'header' } },\n  { lineStartsWith: ['{{bab}}'], fuzzy: true, meta: { type: 'header' } },\n  \n  // This will be kept separate\n  { lineStartsAfter: ['{{numbered}}'], meta: { type: 'entry' } },\n];\n\nconst { rules: optimized, mergedCount } = optimizeRules(rules);\n\n// Result:\n// optimized[0] = { \n//   lineStartsWith: ['{{kitab}}', '{{bab}}'], \n//   fuzzy: true, \n//   meta: { type: 'header' } \n// }\n// optimized[1] = { lineStartsAfter: ['{{numbered}}'], ... }\n```\n\n## Rule Validation\n\nUse `validateRules()` to detect common mistakes in rule patterns before running segmentation:\n\n```typescript\nimport { validateRules } from 'flappa-doormal';\n\nconst issues = validateRules([\n  { lineStartsAfter: ['raqms:num'] },       // Missing {{}}\n  { lineStartsWith: ['{{unknown}}'] },      // Unknown token\n  { lineStartsAfter: ['## (rumuz:rumuz)'] } // Typo - should be {{rumuz:rumuz}}\n]);\n\n// issues[0]?.lineStartsAfter?.[0]?.type === 'missing_braces'\n// issues[1]?.lineStartsWith?.[0]?.type === 'unknown_token'\n// issues[2]?.lineStartsAfter?.[0]?.type === 'missing_braces'\n\n// To get a simple list of error strings for UI display:\nimport { formatValidationReport } from 'flappa-doormal';\n\nconst errors = formatValidationReport(issues);\n// [\n//   'Rule 1, lineStartsAfter: Missing {{}} around token \"raqms:num\"',\n//   'Rule 2, lineStartsWith: Unknown token \"{{unknown}}\"',\n//   ...\n// ]\n```\n\n**Checks performed:**\n- **Missing braces**: Detects token names like `raqms:num` without `{{}}`\n- **Unknown tokens**: Flags tokens inside `{{}}` that don't exist (e.g., `{{nonexistent}}`)\n- **Duplicates**: Finds duplicate patterns within the same rule\n\n## Token Mapping Utilities\n\nWhen building UIs for rule editing, it's often useful to separate the *token pattern* (e.g., `{{raqms}}`) from the *capture name* (e.g., `{{raqms:hadithNum}}`).\n\n```typescript\nimport { applyTokenMappings, stripTokenMappings } from 'flappa-doormal';\n\n// 1. Apply user-defined mappings to a raw template\nconst template = '{{raqms}} {{dash}}';\nconst mappings = [{ token: 'raqms', name: 'num' }];\n\nconst result = applyTokenMappings(template, mappings);\n// result = '{{raqms:num}} {{dash}}'\n\n// 2. Strip captures to get back to the canonical pattern\nconst raw = stripTokenMappings(result);\n// raw = '{{raqms}} {{dash}}'\n```\n\n## Prompting LLMs / Agents to Generate Rules (Shamela books)\n\n### Pre-analysis (no LLM required): generate “hints” from the book\n\nBefore prompting an LLM, you can quickly extract **high-signal pattern hints** from the book using:\n- `analyzeCommonLineStarts(pages, options)` (from `src/line-start-analysis.ts`): common **line-start signatures** (tokenized)\n- `analyzeTextForRule(text)` / `detectTokenPatterns(text)` (from `src/pattern-detection.ts`): turn a **single representative line** into a token template suggestion\n\nThese help the LLM avoid guessing and focus on the patterns actually present.\n\n#### Step 1: top line-start signatures (frequency-first)\n\n```typescript\nimport { analyzeCommonLineStarts } from 'flappa-doormal';\n\nconst top = analyzeCommonLineStarts(pages, {\n  sortBy: 'count',\n  topK: 40,\n  minCount: 10,\n});\n\nconsole.log(top.map((p) =\u003e ({ pattern: p.pattern, count: p.count, example: p.examples[0] })));\n```\n\nTypical output (example):\n\n```text\n[\n  { pattern: \"{{numbered}}\", count: 1200, example: { pageId: 50, line: \"١ - حَدَّثَنَا ...\" } },\n  { pattern: \"{{bab}}\",      count:  180, example: { pageId: 66, line: \"باب ...\" } },\n  { pattern: \"##\\\\s*{{bab}}\",count:  140, example: { pageId: 69, line: \"## باب ...\" } }\n]\n```\n\nIf you only want to analyze headings (to see what comes *after* `##`):\n\n```typescript\nconst headingVariants = analyzeCommonLineStarts(pages, {\n  lineFilter: (line) =\u003e line.startsWith('##'),\n  sortBy: 'count',\n  topK: 40,\n});\n```\n\n#### Step 2: convert a few representative lines into token templates\n\nPick 3–10 representative line prefixes from the book (often from the examples returned above) and run:\n\n```typescript\nimport { analyzeTextForRule } from 'flappa-doormal';\n\nconsole.log(analyzeTextForRule(\"٢٩- خ سي: أحمد بن حميد ...\"));\n// -\u003e { template: \"{{raqms}}- {{rumuz}}: أحمد...\", patternType: \"lineStartsAfter\", fuzzy: false, ... }\n```\n\n#### Step 3: paste the “hints” into your LLM prompt\n\nWhen you prompt the LLM, include a short “Hints” section:\n- Top 20–50 `analyzeCommonLineStarts` patterns (with counts + 1–2 examples)\n- 3–10 `analyzeTextForRule(...)` results\n- A small sample of pages (not the full book)\n\nThen instruct the LLM to **prioritize rules that align with those hints**.\n\nYou can use an LLM to generate `SegmentationOptions` by pasting it a random subset of pages and asking it to infer robust segmentation rules. Here’s a ready-to-copy plain-text prompt:\n\n```text\nYou are helping me generate JSON configuration for a text-segmentation function called segmentPages(pages, options).\nIt segments Arabic book pages (e.g., Shamela) into logical segments (books/chapters/sections/entries/hadiths).\n\nI will give you a random subset of pages so you can infer patterns. You must respond with ONLY JSON (no prose).\n\nI will paste a random subset of pages. Each page has:\n- id: page number (not necessarily consecutive)\n- content: plain text; line breaks are \\n\n\nOutput ONLY a JSON object compatible with SegmentationOptions (no prose, no code fences).\n\nSegmentationOptions shape:\n- rules: SplitRule[]\n- optional: maxPages, breakpoints, prefer\n\nSplitRule constraints:\n- Each rule must use exactly ONE of: lineStartsWith, lineStartsAfter, lineEndsWith, template, regex\n- Optional fields: split (\"at\" | \"after\"), meta, min, max, exclude, occurrence (\"first\" | \"last\"), fuzzy\n\nImportant behaviors:\n- lineStartsAfter matches at line start but strips the marker from segment.content.\n- Template patterns (lineStartsWith/After/EndsWith/template) auto-escape ()[] outside tokens.\n- Raw regex patterns do NOT auto-escape and can include groups, named captures, etc.\n\nAvailable tokens you may use in templates:\n- {{basmalah}}  (بسم الله / ﷽)\n- {{kitab}}     (كتاب)\n- {{bab}}       (باب)\n- {{fasl}}      (فصل | مسألة)\n- {{naql}}      (حدثنا/أخبرنا/... narration phrases)\n- {{raqm}}      (single Arabic-Indic digit)\n- {{raqms}}     (Arabic-Indic digits)\n- {{num}}       (single ASCII digit)\n- {{nums}}      (ASCII digits)\n- {{dash}}      (dash variants)\n- {{tarqim}}    (punctuation [. ! ? ؟ ؛])\n- {{harf}}      (Arabic letter)\n- {{harfs}}     (single-letter codes separated by spaces; e.g. \"د ت س ي ق\")\n- {{rumuz}}     (rijāl/takhrīj source abbreviations; matches blocks like \"خت ٤\", \"خ سي\", \"خ فق\")\n\nNamed captures:\n- {{raqms:num}} captures to meta.num\n- {{:name}} captures arbitrary text to meta.name\n\nYour tasks:\n1) Identify document structure from the sample:\n   - book headers (كتاب), chapter headers (باب), sections (فصل/مسألة), hadith numbering, biography entries, etc.\n2) Propose a minimal but robust ordered ruleset:\n   - Put most-specific rules first.\n   - Use fuzzy:true for Arabic headings where diacritics vary.\n   - Use lineStartsAfter when you want to remove the marker (e.g., hadith numbers, rumuz prefixes).\n3) Use constraints:\n   - Use min/max/exclude when front matter differs or specific pages are noisy.\n4) If segments can span many pages:\n   - Set maxPages and breakpoints.\n   - Suggested breakpoints (in order): \"{{tarqim}}\", \"\\\\n\", \"\" (page boundary)\n   - Prefer \"longer\" unless there’s a reason to prefer shorter segments.\n5) Capture useful metadata:\n   - For numbering patterns, capture the number into meta.num (e.g., {{raqms:num}}).\n\nExamples (what good answers look like):\n\nExample A: hadith-style numbered segments\nInput pages:\nPAGE 10:\n٣٤ - حَدَّثَنَا ...\\n... (rest of hadith)\nPAGE 11:\n٣٥ - حَدَّثَنَا ...\\n... (rest of hadith)\n\nGood JSON answer:\n{\n  \"rules\": [\n    {\n      \"lineStartsAfter\": [\"{{raqms:num}} {{dash}}\\\\s*\"],\n      \"split\": \"at\",\n      \"meta\": { \"type\": \"hadith\" }\n    }\n  ]\n}\n\nExample B: chapter markers + hadith numbers\nInput pages:\nPAGE 50:\nكتاب الصلاة\\nباب فضل الصلاة\\n١ - حَدَّثَنَا ...\\n...\nPAGE 51:\n٢ - حَدَّثَنَا ...\\n...\n\nGood JSON answer:\n{\n  \"rules\": [\n    { \"fuzzy\": true, \"lineStartsWith\": [\"{{kitab}}\"], \"split\": \"at\", \"meta\": { \"type\": \"book\" } },\n    { \"fuzzy\": true, \"lineStartsWith\": [\"{{bab}}\"], \"split\": \"at\", \"meta\": { \"type\": \"chapter\" } },\n    { \"lineStartsAfter\": [\"{{raqms:num}}\\\\s*{{dash}}\\\\s*\"], \"split\": \"at\", \"meta\": { \"type\": \"hadith\" } }\n  ]\n}\n\nExample C: narrator/rijāl entries with rumuz (codes) + colon\nInput pages:\nPAGE 257:\n٢٩- خ سي: أحمد بن حميد...\\nوكان من حفاظ الكوفة.\nPAGE 258:\n١٠٢- ق: تمييز ولهم شيخ آخر...\\n...\n\nGood JSON answer:\n{\n  \"rules\": [\n    {\n      \"lineStartsAfter\": [\"{{raqms:num}}\\\\s*{{dash}}\\\\s*{{rumuz}}:\\\\s*\"],\n      \"split\": \"at\",\n      \"meta\": { \"type\": \"entry\" }\n    }\n  ]\n}\n\nNow wait for the pages.\n```\n\n### Sentence-Based Splitting (Last Period Per Page)\n\n```typescript\nconst segments = segmentPages(pages, {\n  rules: [{\n    lineEndsWith: ['\\\\.'],\n    split: 'after',\n    occurrence: 'last',\n  }]\n});\n```\n\n### Multiple Rules with Priority\n\n```typescript\nconst segments = segmentPages(pages, {\n  rules: [\n    // First: Chapter headers (highest priority)\n    { fuzzy: true, lineStartsAfter: ['{{kitab:book}} '], split: 'at', meta: { type: 'chapter' } },\n    // Second: Sub-chapters\n    { fuzzy: true, lineStartsAfter: ['{{bab:section}} '], split: 'at', meta: { type: 'section' } },\n    // Third: Individual hadiths\n    { lineStartsAfter: ['{{raqms:num}} {{dash}} '], split: 'at', meta: { type: 'hadith' } },\n  ]\n});\n```\n\n## API Reference\n\n### `segmentPages(pages, options)`\n\nMain segmentation function.\n\n```typescript\nimport { segmentPages, type Page, type SegmentationOptions, type Segment } from 'flappa-doormal';\n\nconst pages: Page[] = [\n  { id: 1, content: 'First page content...' },\n  { id: 2, content: 'Second page content...' },\n];\n\nconst options: SegmentationOptions = {\n  // Optional preprocessing transforms (run before pattern matching)\n  // See \"7.1 Preprocessing\" section for details\n  preprocess: ['removeZeroWidth', 'condenseEllipsis'],\n  \n  rules: [\n    { lineStartsWith: ['## '], split: 'at' }\n  ],\n  // How to join content across page boundaries in OUTPUT segments:\n  // - 'space' (default): page boundaries become spaces\n  // - 'newline': preserve page boundaries as newlines\n  pageJoiner: 'newline',\n\n  // Breakpoint preferences for resizing oversized segments:\n  // - 'longer' (default): maximizes segment size within limits\n  // - 'shorter': minimizes segment size (splits at first match)\n  prefer: 'longer',\n\n  // Post-structural limit: split if segment spans more than 2 pages\n  maxPages: 2,\n\n  // Post-structural limit: split if segment exceeds 5000 characters\n  maxContentLength: 5000,\n\n  // Enable match metadata in segments (meta.debug)\n  debug: true,\n\n  // Custom logger for tracing\n  logger: {\n    info: (m) =\u003e console.log(m),\n    warn: (m) =\u003e console.warn(m),\n  }\n};\n\nconst segments: Segment[] = segmentPages(pages, options);\n```\n\n### `validateSegments(pages, options, segments, validationOptions?)`\n\nValidates that segments correctly map back to the source pages and adhere to constraints.\n\n```typescript\nimport { validateSegments } from 'flappa-doormal';\n\nconst report = validateSegments(pages, options, segments, {\n  // Optional: Max content length to search before falling back (default: 500)\n  // Segments longer than this are checked via fast path unless issues are found.\n  fullSearchThreshold: 1000, \n});\n```\n\nReturns a `SegmentValidationReport` containing:\n- `ok`: boolean\n- `summary`: counts of errors/warnings\n- `issues`: detailed list of problems (page attribution mismatch, maxPages violation, etc.)\n\n### `stripHtmlTags(html)`\n\nRemove all HTML tags from content, keeping only text.\n\n```typescript\nimport { stripHtmlTags } from 'flappa-doormal';\n\nconst text = stripHtmlTags('\u003cp\u003eHello \u003cb\u003eWorld\u003c/b\u003e\u003c/p\u003e');\n// Returns: 'Hello World'\n```\n\nFor more sophisticated HTML to Markdown conversion (like converting `\u003cspan data-type=\"title\"\u003e` to `## ` headers), you can implement your own function. Here's an example:\n\n```typescript\nconst htmlToMarkdown = (html: string): string =\u003e {\n    return html\n        // Convert title spans to markdown headers\n        .replace(/\u003cspan[^\u003e]*data-type=[\"']title[\"'][^\u003e]*\u003e(.*?)\u003c\\/span\u003e/gi, '## $1')\n        // Strip narrator links but keep text\n        .replace(/\u003ca[^\u003e]*href=[\"']inr:\\/\\/[^\"']*[\"'][^\u003e]*\u003e(.*?)\u003c\\/a\u003e/gi, '$1')\n        // Strip all remaining HTML tags\n        .replace(/\u003c[^\u003e]*\u003e/g, '');\n};\n```\n\n### `expandTokens(template)`\n\nExpand template tokens to regex pattern.\n\n```typescript\nimport { expandTokens } from 'flappa-doormal';\n\nconst pattern = expandTokens('{{raqms}} {{dash}}');\n// Returns: '[\\u0660-\\u0669]+ [-–—ـ]'\n```\n\n### `makeDiacriticInsensitive(text)`\n\nMake Arabic text diacritic-insensitive for fuzzy matching.\n\n```typescript\nimport { makeDiacriticInsensitive } from 'flappa-doormal';\n\nconst pattern = makeDiacriticInsensitive('حدثنا');\n// Returns regex pattern matching 'حَدَّثَنَا', 'حدثنا', etc.\n```\n\n### `TOKEN_PATTERNS`\n\nAccess available token definitions.\n\n```typescript\nimport { TOKEN_PATTERNS } from 'flappa-doormal';\n\nconsole.log(TOKEN_PATTERNS.narrated);\n// 'حدثنا|أخبرنا|حدثني|وحدثنا|أنبأنا|سمعت'\n```\n\n### Pattern Detection Utilities\n\nThese functions help auto-detect tokens in text, useful for building UI tools that suggest rule configurations from user-highlighted text.\n\n#### `detectTokenPatterns(text)`\n\nAnalyzes text and returns all detected token patterns with their positions.\n\n```typescript\nimport { detectTokenPatterns } from 'flappa-doormal';\n\nconst detected = detectTokenPatterns(\"٣٤ - حدثنا\");\n// Returns:\n// [\n//   { token: 'raqms', match: '٣٤', index: 0, endIndex: 2 },\n//   { token: 'dash', match: '-', index: 3, endIndex: 4 },\n//   { token: 'naql', match: 'حدثنا', index: 5, endIndex: 10 }\n// ]\n```\n\n#### `generateTemplateFromText(text, detected)`\n\nConverts text to a template string using detected patterns.\n\n```typescript\nimport { detectTokenPatterns, generateTemplateFromText } from 'flappa-doormal';\n\nconst text = \"٣٤ - \";\nconst detected = detectTokenPatterns(text);\nconst template = generateTemplateFromText(text, detected);\n// Returns: \"{{raqms}} {{dash}} \"\n```\n\n#### `suggestPatternConfig(detected)`\n\nSuggests the best pattern type and options based on detected patterns.\n\n```typescript\nimport { detectTokenPatterns, suggestPatternConfig } from 'flappa-doormal';\n\n// For numbered patterns (hadith-style)\nconst hadithDetected = detectTokenPatterns(\"٣٤ - \");\nsuggestPatternConfig(hadithDetected);\n// Returns: { patternType: 'lineStartsAfter', fuzzy: false, metaType: 'hadith' }\n\n// For structural patterns (chapter markers)\nconst chapterDetected = detectTokenPatterns(\"باب الصلاة\");\nsuggestPatternConfig(chapterDetected);\n// Returns: { patternType: 'lineStartsWith', fuzzy: true, metaType: 'bab' }\n```\n\n#### `analyzeTextForRule(text)`\n\nComplete analysis that combines detection, template generation, and config suggestion.\n\n```typescript\nimport { analyzeTextForRule } from 'flappa-doormal';\n\nconst result = analyzeTextForRule(\"٣٤ - حدثنا\");\n// Returns:\n// {\n//   template: \"{{raqms}} {{dash}} {{naql}}\",\n//   patternType: 'lineStartsAfter',\n//   fuzzy: false,\n//   metaType: 'hadith',\n//   detected: [...]\n// }\n\n// Use the result to build a rule:\nconst rule = {\n  [result.patternType]: [result.template],\n  split: 'at',\n  fuzzy: result.fuzzy,\n  meta: { type: result.metaType }\n};\n```\n\n### Expanding composite tokens (for adding named captures)\n\nSome tokens are **composites** (e.g. `{{numbered}}`), which are great for quick signatures but less convenient when you want to add named captures (e.g. capture the number).\n\nYou can expand composites back into their underlying template form:\n\n```typescript\nimport { expandCompositeTokensInTemplate } from 'flappa-doormal';\n\nconst base = expandCompositeTokensInTemplate('{{numbered}}');\n// base === '{{raqms}} {{dash}} '\n\n// Now you can add a named capture:\nconst withCapture = base.replace('{{raqms}}', '{{raqms:num}}');\n// withCapture === '{{raqms:num}} {{dash}} '\n```\n\n## Types\n\n### `SplitRule`\n\n```typescript\ntype SplitRule = {\n  // Pattern (choose one)\n  lineStartsWith?: string[];\n  lineStartsAfter?: string[];\n  lineEndsWith?: string[];\n  template?: string;\n  regex?: string;\n\n  // Split behavior\n  split?: 'at' | 'after';  // Default: 'at'\n  occurrence?: 'first' | 'last' | 'all';\n  fuzzy?: boolean;\n\n  // Constraints\n  min?: number;\n  max?: number;\n  exclude?: (number | [number, number])[]; // Single page or [start, end] range\n  pageStartGuard?: string;\n  pageStartPrevWordStoplist?: string[];\n  samePagePrevWordStoplist?: string[];\n  meta?: Record\u003cstring, unknown\u003e;\n};\n```\n\n### `Segment`\n\n```typescript\ntype Segment = {\n  content: string;\n  from: number;\n  to?: number;\n  meta?: Record\u003cstring, unknown\u003e;\n};\n```\n\n### `DetectedPattern`\n\nResult from pattern detection utilities.\n\n```typescript\ntype DetectedPattern = {\n  token: string;    // Token name (e.g., 'raqms', 'dash')\n  match: string;    // The matched text\n  index: number;    // Start index in original text\n  endIndex: number; // End index (exclusive)\n};\n```\n\n## Usage with Next.js / Node.js\n\n```typescript\n// app/api/segment/route.ts (Next.js App Router)\nimport { segmentPages } from 'flappa-doormal';\nimport { NextResponse } from 'next/server';\n\nexport async function POST(request: Request) {\n  const { pages, rules } = await request.json();\n  \n  const segments = segmentPages(pages, { rules });\n  \n  return NextResponse.json({ segments });\n}\n```\n\n```typescript\n// Node.js script\nimport { segmentPages, stripHtmlTags } from 'flappa-doormal';\n\nconst pages = rawPages.map((p, i) =\u003e ({\n  id: i + 1,\n  content: stripHtmlTags(p.html)\n}));\n\nconst segments = segmentPages(pages, {\n  rules: [{\n    lineStartsAfter: ['{{raqms:num}} {{dash}} '],\n    split: 'at'\n  }]\n});\n\nconsole.log(`Found ${segments.length} segments`);\n```\n\n## Development\n\n```bash\n# Install dependencies\nbun install\n\n# Run tests\nbun test\n\n# Build\nbun run build\n\n# Run performance test (generates 50K pages, measures segmentation speed/memory)\nbun run perf\n\n# Lint\nbunx biome lint .\n\n# Format\nbunx biome format --write .\n```\n\n## Design Decisions\n\n### Double-Brace Syntax `{{token}}`\n\nSingle braces conflict with regex quantifiers `{n,m}`. Double braces are visually distinct and match common template syntax (Handlebars, Mustache).\n\n### `lineStartsAfter` vs `lineStartsWith`\n\n- `lineStartsWith`: Keep marker in content (for detection only)\n- `lineStartsAfter`: Strip marker, capture only content (for clean extraction)\n\n### Fuzzy Applied at Token Level\n\nFuzzy transforms are applied to raw Arabic text *before* wrapping in regex groups. This prevents corruption of regex metacharacters like `(`, `)`, `|`.\n\n### Extracted Utilities\n\nComplex logic is intentionally split into small, independently testable modules:\n\n- `src/segmentation/match-utils.ts`: match filtering + capture extraction\n- `src/segmentation/rule-regex.ts`: SplitRule → compiled regex builder (`buildRuleRegex`, `processPattern`)\n- `src/segmentation/breakpoint-utils.ts`: breakpoint windowing/exclusion helpers, page boundary join normalization, and progressive prefix page detection for accurate `from`/`to` attribution\n- `src/segmentation/breakpoint-processor.ts`: breakpoint post-processing engine (applies breakpoints after structural segmentation)\n\n## Performance Notes\n\n### Memory Requirements\n\nThe library concatenates all pages into a single string for pattern matching across page boundaries. Memory usage scales linearly with total content size:\n\n| Pages | Avg Page Size | Approximate Memory |\n|-------|---------------|-------------------|\n| 1,000 | 5 KB | ~5 MB |\n| 6,000 | 5 KB | ~30 MB |\n| 40,000 | 5 KB | ~200 MB |\n\nFor typical book processing (up to 6,000 pages), memory usage is well within Node.js defaults. For very large books (40,000+ pages), ensure adequate heap size.\n\n## For AI Agents\n\nSee [AGENTS.md](./AGENTS.md) for:\n- Architecture details and design patterns\n- Adding new tokens and pattern types\n- Algorithm explanations\n- Lessons learned during development\n\n## Demo\n\nAn interactive demo is available at [flappa-doormal.surge.sh](https://flappa-doormal.surge.sh).\n\nThe demo source code is located in the `demo/` directory and includes:\n- **Analysis**: Discover common line-start patterns in your text\n- **Pattern Detection**: Auto-detect tokens in text and get template suggestions\n- **Segmentation**: Apply rules and see segmented output with metadata\n\nTo run the demo locally:\n\n```bash\ncd demo\nbun install\nbun run dev\n```\n\nTo deploy updates:\n\n```bash\ncd demo\nbun run deploy\n```\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fragaeeb%2Fflappa-doormal","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fragaeeb%2Fflappa-doormal","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fragaeeb%2Fflappa-doormal/lists"}