{"id":36947015,"url":"https://github.com/linkml/linkml-reference-validator","last_synced_at":"2026-01-27T04:10:35.463Z","repository":{"id":327096650,"uuid":"1098504176","full_name":"linkml/linkml-reference-validator","owner":"linkml","description":"Validate that supporting text quotes in your data actually appear in their cited references","archived":false,"fork":false,"pushed_at":"2026-01-08T16:27:40.000Z","size":1280,"stargazers_count":10,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-13T12:22:34.285Z","etag":null,"topics":["agentic-ai","ai-guardrails","ai4curation","linkml","monarchinitiative","pubmed"],"latest_commit_sha":null,"homepage":"http://linkml.io/linkml-reference-validator/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/linkml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-17T19:22:56.000Z","updated_at":"2026-01-08T16:26:59.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/linkml/linkml-reference-validator","commit_stats":null,"previous_names":["linkml/linkml-reference-validator"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/linkml/linkml-reference-validator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkml%2Flinkml-reference-validator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkml%2Flinkml-reference-validator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkml%2Flinkml-reference-validator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkml%2Flinkml-reference-validator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/linkml","download_url":"https://codeload.github.com/linkml/linkml-reference-validator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkml%2Flinkml-reference-validator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28398154,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-13T14:36:09.778Z","status":"ssl_error","status_checked_at":"2026-01-13T14:35:19.697Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","ai-guardrails","ai4curation","linkml","monarchinitiative","pubmed"],"created_at":"2026-01-13T11:36:36.556Z","updated_at":"2026-01-16T08:47:45.967Z","avatar_url":"https://github.com/linkml.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# linkml-reference-validator\n\n[![Tests](https://img.shields.io/badge/tests-91%20passing-brightgreen)]()\n[![Coverage](https://img.shields.io/badge/coverage-86%25-green)]()\n[![Python](https://img.shields.io/badge/python-3.10%2B-blue)]()\n\nValidate that supporting text quotes in your data actually appear in their cited references.\n\nThis tool fetches scientific publications (currently PubMed/PMC) and verifies that quoted text (`supporting_text`) can be found in the referenced document using deterministic substring matching.\n\n---\n\n## Quick Start\n\n### Installation\n\n```bash\n# Using uv (recommended)\nuv pip install linkml-reference-validator\n\n# Using pip\npip install linkml-reference-validator\n```\n\n### Basic Usage\n\n```bash\n# Validate a single quote against a reference\nlinkml-reference-validator validate text \\\n  \"protein functions in cell cycle regulation\" \\\n  PMID:12345678\n\n# Validate a data file using LinkML validation\nlinkml-validate -s schema.yaml data.yaml \\\n  --validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin\n```\n\n---\n\n## Why Use This Tool?\n\nScientific data often includes claims supported by quotes from publications. But how do you know the quotes are accurate?\n\n**Before:**\n```yaml\ngene_function:\n  gene: TP53\n  function: \"regulates cell cycle\"\n  evidence:\n    reference: PMID:12345678\n    supporting_text: \"TP53 is critical for cell cycle regulation\"  # Is this really in the paper?\n```\n\n**After validation:**\n```bash\n$ linkml-reference-validator validate text \\\n    \"TP53 is critical for cell cycle regulation\" \\\n    PMID:12345678\n\n✓ Valid: True\n✓ Supporting text validated successfully in PMID:12345678\n```\n\n---\n\n## CLI Commands\n\n\u003e **Note:** The CLI was restructured in v1.x to use nested commands (`validate text`, `validate data`, `cache reference`). The old hyphenated commands (`validate-text`, `validate-data`, `cache-reference`) still work for backward compatibility but are deprecated.\n\n### 1. `validate text` - Quick Single Quote Validation\n\nValidate a single quote against a reference without needing a schema.\n\n```bash\nlinkml-reference-validator validate text \u003cTEXT\u003e \u003cREFERENCE_ID\u003e [OPTIONS]\n```\n\n**Example:**\n```bash\n# Basic validation\nlinkml-reference-validator validate text \\\n  \"protein functions in cell cycle regulation\" \\\n  PMID:12345678\n\n# With editorial notes (ignored in matching)\nlinkml-reference-validator validate text \\\n  \"protein [X] functions in cell cycle regulation\" \\\n  PMID:12345678\n\n# Multi-part quote with omitted text\nlinkml-reference-validator validate text \\\n  \"protein functions ... cell cycle regulation\" \\\n  PMID:12345678\n```\n\n**Options:**\n- `--cache-dir PATH` - Directory for caching references (default: `references_cache`)\n- `--verbose` - Show detailed validation information\n- `--help` - Show help message\n\n**Exit Codes:**\n- `0` - Validation successful\n- `1` - Validation failed\n\n---\n\n### 2. `validate data` - Full Data File Validation\n\nValidate an entire data file using a LinkML schema.\n\n```bash\nlinkml-reference-validator validate data \u003cDATA_FILE\u003e --schema \u003cSCHEMA\u003e [OPTIONS]\n```\n\n**Example:**\n\n**Schema** (`gene_schema.yaml`):\n```yaml\nid: https://example.org/genes\nname: gene-schema\n\nclasses:\n  GeneFunction:\n    attributes:\n      gene:\n        range: string\n      function:\n        range: string\n      evidence:\n        range: Evidence\n\n  Evidence:\n    attributes:\n      reference:\n        range: Reference\n        implements:\n          - linkml:authoritative_reference  # Marks this as a reference field\n      supporting_text:\n        range: string\n        implements:\n          - linkml:excerpt  # Marks this as text to validate\n\n  Reference:\n    attributes:\n      id:\n        identifier: true\n        range: string\n      title:\n        range: string\n```\n\n**Data** (`gene_data.yaml`):\n```yaml\ngene: TP53\nfunction: \"regulates cell cycle\"\nevidence:\n  reference:\n    id: PMID:12345678\n    title: \"TP53 in cell cycle control\"\n  supporting_text: \"TP53 protein functions in cell cycle regulation\"\n```\n\n**Validation:**\n```bash\nlinkml-reference-validator validate data \\\n  gene_data.yaml \\\n  --schema gene_schema.yaml\n\n# Output:\nValidating gene_data.yaml against schema gene_schema.yaml\nCache directory: references_cache\n\n✓ All validations passed!\n```\n\n**Options:**\n- `--schema PATH` (required) - Path to LinkML schema\n- `--target-class CLASS` - Specific class to validate\n- `--cache-dir PATH` - Directory for caching references\n- `--verbose` - Show detailed output\n- `--help` - Show help message\n\n---\n\n### 3. `repair` - Automated Repair of Validation Errors\n\nAutomatically fix or flag supporting text validation errors based on confidence thresholds.\n\n```bash\n# Repair a single quote\nlinkml-reference-validator repair text \u003cTEXT\u003e \u003cREFERENCE_ID\u003e [OPTIONS]\n\n# Repair a data file (dry run by default)\nlinkml-reference-validator repair data \u003cDATA_FILE\u003e --schema \u003cSCHEMA\u003e [OPTIONS]\n```\n\n**Example - Single Quote:**\n```bash\n# Try to repair a quote with ASCII subscript\nlinkml-reference-validator repair text \"CO2 levels were measured\" PMID:12345678\n\n# Output:\n# ✓ Repaired successfully\n#   Original: CO2 levels were measured\n#   Repaired: CO₂ levels were measured\n#   Action: CHARACTER_NORMALIZATION\n#   Confidence: HIGH\n```\n\n**Example - Data File:**\n```bash\n# Dry run - show what would be changed\nlinkml-reference-validator repair data disease.yaml \\\n  --schema schema.yaml \\\n  --dry-run\n\n# Apply auto-fixes (creates backup)\nlinkml-reference-validator repair data disease.yaml \\\n  --schema schema.yaml \\\n  --no-dry-run\n\n# Custom output file\nlinkml-reference-validator repair data disease.yaml \\\n  --schema schema.yaml \\\n  --no-dry-run \\\n  --output repaired.yaml\n```\n\n**Repair Report Output:**\n```\n============================================================\nRepair Report\n============================================================\n\nHIGH CONFIDENCE FIXES (auto-applicable):\n  PMID:12345678 at evidence[0]:\n    Character normalization fix\n    'CO2 levels...' → 'CO₂ levels...'\n\nSUGGESTED FIXES (review recommended):\n  PMID:23456789 at evidence[1]:\n    Inserted ellipsis between non-contiguous parts\n\nRECOMMENDED REMOVALS (low confidence):\n  PMID:34567890 at evidence[2]:\n    Similarity: 8%\n    Snippet: 'Fabricated text that...'\n\n------------------------------------------------------------\nSummary:\n  Total items: 5\n  Already valid: 2\n  Auto-fixes: 1\n  Suggestions: 1\n  Removals: 1\n  Unverifiable: 0\n```\n\n**Repair Strategies:**\n\n| Strategy | Confidence | Description |\n|----------|------------|-------------|\n| Character Normalization | HIGH | Fix Unicode/symbol differences (CO2→CO₂, +/-→±) |\n| Ellipsis Insertion | MEDIUM | Insert `...` between non-contiguous text parts |\n| Fuzzy Correction | VARIES | Suggest closest matching text from reference |\n| Removal | VERY_LOW | Flag fabricated/not-found text for manual removal |\n\n**Options:**\n- `--dry-run / --no-dry-run` - Show changes without applying (default: dry-run)\n- `--auto-fix-threshold FLOAT` - Minimum similarity for auto-fixes (default: 0.95)\n- `--output PATH` - Output file path (default: overwrite with backup)\n- `--config PATH` - Path to repair configuration file\n- `--cache-dir PATH` - Directory for caching references\n- `--verbose` - Show detailed output\n\n---\n\n### 4. `cache reference` - Pre-cache References\n\nDownload and cache references for offline use.\n\n```bash\nlinkml-reference-validator cache reference \u003cREFERENCE_ID\u003e [OPTIONS]\n```\n\n**Example:**\n```bash\n# Cache a single reference\nlinkml-reference-validator cache reference PMID:12345678\n\n# Output:\nFetching PMID:12345678...\n✓ Successfully cached PMID:12345678\n  Title: TP53 in cell cycle control\n  Authors: Smith J, Doe A, Johnson K\n  Content type: full_text_xml\n  Content length: 45231 characters\n```\n\n**Use Cases:**\n- Pre-fetch references before validation\n- Build offline reference library\n- Verify reference availability\n\n---\n\n## Integration with `linkml-validate`\n\nThe recommended way to use this tool is as a **LinkML validation plugin** with the standard `linkml-validate` command.\n\n### Setup\n\n**1. Install both packages:**\n```bash\nuv pip install linkml linkml-reference-validator\n```\n\n**2. Create your schema with interface markers:**\n```yaml\n# my_schema.yaml\nid: https://example.org/my-schema\nname: my-schema\n\nprefixes:\n  linkml: https://w3id.org/linkml/\n\nclasses:\n  Evidence:\n    attributes:\n      reference:\n        range: Reference\n        implements:\n          - linkml:authoritative_reference  # \u003c-- This marks it as a reference\n      supporting_text:\n        range: string\n        implements:\n          - linkml:excerpt  # \u003c-- This marks it as text to validate\n```\n\n**3. Validate using linkml-validate:**\n```bash\nlinkml-validate \\\n  --schema my_schema.yaml \\\n  --validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin \\\n  my_data.yaml\n```\n\n### Why Use the Plugin?\n\n✅ **Integrated validation** - Combines schema validation + reference validation in one command\n✅ **Standard LinkML workflow** - Uses familiar LinkML tools\n✅ **Flexible schema design** - Works with any schema using the interface pattern\n✅ **Rich error reporting** - Shows exactly where validation fails in your data\n\n---\n\n## Supported Reference Formats\n\n### Currently Supported\n\n#### PubMed IDs (PMID)\n```yaml\nreference:\n  id: PMID:12345678\n  supporting_text: \"protein functions in cells\"\n```\n\n**Fetches:**\n- Abstract (always)\n- Full text from PMC (when available)\n- Metadata (title, authors, journal, year, DOI)\n\n**ID Formats:**\n- `PMID:12345678`\n- `12345678` (assumes PMID)\n\n### Coming Soon\n\n- **DOI** - `DOI:10.1038/nature12345`\n- **URLs** - Web pages and online documents\n\n---\n\n## Schema Requirements\n\nFor the validator to work, your LinkML schema must:\n\n### 1. Mark Reference Fields\n\nUse `implements: [linkml:authoritative_reference]` on slots that contain references:\n\n```yaml\nclasses:\n  Evidence:\n    attributes:\n      reference:           # Can be nested object\n        range: Reference\n        implements:\n          - linkml:authoritative_reference\n```\n\n**OR** use a flat structure:\n\n```yaml\nclasses:\n  Evidence:\n    attributes:\n      reference_id:        # Can be flat string\n        range: string\n        implements:\n          - linkml:authoritative_reference\n```\n\n### 2. Mark Excerpt Fields\n\nUse `implements: [linkml:excerpt]` on slots containing quoted text:\n\n```yaml\nclasses:\n  Evidence:\n    attributes:\n      supporting_text:     # The quote to validate\n        range: string\n        implements:\n          - linkml:excerpt\n```\n\n### 3. Define Reference Structure\n\nIf using nested references, define the Reference class:\n\n```yaml\nclasses:\n  Reference:\n    attributes:\n      id:\n        identifier: true\n        range: string\n      title:              # Optional: validates if provided\n        range: string\n```\n\n---\n\n## Data Formats\n\n### Nested Reference (Recommended)\n\n```yaml\nevidence:\n  reference:\n    id: PMID:12345678\n    title: \"Study of Protein X\"\n  supporting_text: \"protein functions in cell cycle regulation\"\n```\n\n### Flat Reference ID\n\n```yaml\nevidence:\n  reference_id: PMID:12345678\n  supporting_text: \"protein functions in cell cycle regulation\"\n```\n\n### Multiple Evidence Items\n\n```yaml\nstatement:\n  text: \"Protein X has multiple functions\"\n  evidence:\n    - reference:\n        id: PMID:11111111\n      supporting_text: \"protein functions in cell cycle\"\n    - reference:\n        id: PMID:22222222\n      supporting_text: \"protein regulates DNA repair\"\n```\n\n---\n\n## Text Matching Syntax\n\n### Editorial Notes `[...]`\n\nUse square brackets for editorial insertions that should be ignored during matching:\n\n```yaml\nsupporting_text: \"protein [X] functions in cell cycle regulation\"\n# Matches: \"protein functions in cell cycle regulation\"\n# Ignores: \"X\"\n```\n\n**Use cases:**\n- `[sic]` - Original spelling\n- `[emphasis added]` - Added emphasis\n- `[gene name]` - Clarifications\n- `[...]` - Omitted content markers\n\n### Omitted Text `...`\n\nUse ellipsis for gaps in quoted text:\n\n```yaml\nsupporting_text: \"protein functions ... in cell cycle regulation\"\n# Matches both parts independently:\n# - \"protein functions\"\n# - \"in cell cycle regulation\"\n```\n\n**Requirements:**\n- Both parts must appear in the reference (order independent)\n- Each part must be a substring match after normalization\n\n### Text Normalization\n\nBefore matching, text is normalized:\n- Greek letters spelled out (α→alpha, β→beta, etc.)\n- Lowercased\n- Punctuation removed\n- Extra whitespace collapsed\n\n**Examples:**\n```\n\"T-Cell Receptor\"     → \"t cell receptor\"\n\"TP53 (p53) protein\"  → \"tp53 p53 protein\"\n\"α-catenin\"          → \"alpha catenin\"\n\"β-actin\"            → \"beta actin\"\n\"γ-tubulin\"          → \"gamma tubulin\"\n```\n\n**Greek Letter Support:**\n\nAll Greek letters (both uppercase and lowercase) are converted to their spelled-out English equivalents. This ensures:\n- **Bidirectional matching**: \"α-catenin\" in a query matches \"alpha-catenin\" in the reference, and vice versa\n- **Preserved distinctions**: \"α-catenin\" and \"β-catenin\" remain distinct (not collapsed to just \"catenin\")\n- **Consistent behavior**: Works with any Greek letter commonly used in biomedical nomenclature\n\n---\n\n## Caching\n\nReferences are automatically cached to disk to:\n- Speed up repeated validations\n- Reduce API calls to PubMed\n- Enable offline validation\n\n### Cache Structure\n\n```\nreferences_cache/\n├── PMID_12345678.md\n├── PMID_98765432.md\n└── PMC_7654321.md\n```\n\n### Cache File Format\n\nCache files are stored as Markdown with YAML frontmatter for easy readability and compatibility:\n\n```markdown\n---\nreference_id: PMID:12345678\ntitle: TP53 in cell cycle control\nauthors:\n- Smith J\n- Doe A\n- Johnson K\njournal: Nature\nyear: '2024'\ndoi: 10.1038/nature12345\ncontent_type: full_text_xml\n---\n\n# TP53 in cell cycle control\n**Authors:** Smith J, Doe A, Johnson K\n**Journal:** Nature (2024)\n**DOI:** [10.1038/nature12345](https://doi.org/10.1038/nature12345)\n\n## Content\n\n[Full text content follows...]\n```\n\n**Note:** The validator still supports reading legacy `.txt` format cache files for backward compatibility.\n\n### Cache Management\n\n```bash\n# Use custom cache directory\nlinkml-reference-validator validate text \\\n  \"quote\" PMID:123 \\\n  --cache-dir /path/to/cache\n\n# Pre-cache references\nlinkml-reference-validator cache reference PMID:12345678\n\n# Force re-fetch (bypass cache)\nlinkml-reference-validator cache reference PMID:12345678 --force\n```\n\n---\n\n## Examples\n\n### Example 1: Simple Gene Function Validation\n\n**Schema** (`gene.yaml`):\n```yaml\nid: https://example.org/genes\nname: gene-schema\n\nclasses:\n  GeneFunctionStatement:\n    tree_root: true\n    attributes:\n      gene_symbol:\n        range: string\n      function_description:\n        range: string\n      evidence:\n        range: Evidence\n\n  Evidence:\n    attributes:\n      reference:\n        range: Reference\n        implements:\n          - linkml:authoritative_reference\n      supporting_text:\n        range: string\n        implements:\n          - linkml:excerpt\n\n  Reference:\n    attributes:\n      id:\n        identifier: true\n```\n\n**Data** (`tp53.yaml`):\n```yaml\ngene_symbol: TP53\nfunction_description: \"tumor suppressor\"\nevidence:\n  reference:\n    id: PMID:12345678\n  supporting_text: \"TP53 functions as a tumor suppressor\"\n```\n\n**Validation:**\n```bash\nlinkml-validate \\\n  --schema gene.yaml \\\n  --validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin \\\n  tp53.yaml\n```\n\n---\n\n### Example 2: Quick Text Check\n\n```bash\n# Check if a quote is in a paper\nlinkml-reference-validator validate text \\\n  \"protein kinase activity regulates cell proliferation\" \\\n  PMID:12345678\n\n# With editorial note\nlinkml-reference-validator validate text \\\n  \"protein kinase [PKA] activity regulates cell proliferation\" \\\n  PMID:12345678\n\n# Multi-part quote\nlinkml-reference-validator validate text \\\n  \"protein kinase activity ... regulates cell proliferation\" \\\n  PMID:12345678\n```\n\n---\n\n### Example 3: Batch Validation with Multiple References\n\n**Data** (`gene_annotations.yaml`):\n```yaml\n- gene_symbol: BRCA1\n  annotations:\n    - function: \"DNA repair\"\n      evidence:\n        reference:\n          id: PMID:11111111\n        supporting_text: \"BRCA1 plays a critical role in DNA repair\"\n    - function: \"tumor suppressor\"\n      evidence:\n        reference:\n          id: PMID:22222222\n        supporting_text: \"BRCA1 functions as a tumor suppressor\"\n\n- gene_symbol: TP53\n  annotations:\n    - function: \"cell cycle regulation\"\n      evidence:\n        reference:\n          id: PMID:33333333\n        supporting_text: \"TP53 regulates cell cycle checkpoints\"\n```\n\n**Validation:**\n```bash\nlinkml-reference-validator validate data \\\n  gene_annotations.yaml \\\n  --schema gene_schema.yaml \\\n  --verbose\n\n# Output shows validation for each reference:\n# ✓ PMID:11111111 - \"BRCA1 plays a critical role in DNA repair\"\n# ✓ PMID:22222222 - \"BRCA1 functions as a tumor suppressor\"\n# ✓ PMID:33333333 - \"TP53 regulates cell cycle checkpoints\"\n```\n\n---\n\n## Validation Rules\n\n### What Passes\n\n✅ **Exact substring match** (after normalization)\n```yaml\nsupporting_text: \"protein functions in cells\"\nreference_content: \"The protein functions in cells during mitosis.\"\n# ✓ PASS - exact substring found\n```\n\n✅ **Multi-part match**\n```yaml\nsupporting_text: \"protein functions ... during mitosis\"\nreference_content: \"The protein functions in cells during mitosis.\"\n# ✓ PASS - both parts found\n```\n\n✅ **Editorial notes ignored**\n```yaml\nsupporting_text: \"protein [X] functions\"\nreference_content: \"The protein functions in cells.\"\n# ✓ PASS - [X] ignored in matching\n```\n\n✅ **Case and punctuation normalized**\n```yaml\nsupporting_text: \"T-Cell Receptor\"\nreference_content: \"The t cell receptor binds antigens.\"\n# ✓ PASS - normalized to \"t cell receptor\"\n```\n\n### What Fails\n\n❌ **Text not in reference**\n```yaml\nsupporting_text: \"protein inhibits apoptosis\"\nreference_content: \"The protein functions in cells.\"\n# ✗ FAIL - \"inhibits apoptosis\" not found\n```\n\n❌ **Partial multi-part match**\n```yaml\nsupporting_text: \"protein functions ... inhibits apoptosis\"\nreference_content: \"The protein functions in cells.\"\n# ✗ FAIL - second part not found\n```\n\n❌ **Reference not accessible**\n```yaml\nsupporting_text: \"any quote\"\nreference_id: PMID:99999999\n# ✗ FAIL - reference doesn't exist or can't be fetched\n```\n\n❌ **Title mismatch** (when title provided)\n```yaml\nreference:\n  id: PMID:12345678\n  title: \"Wrong Title\"\nsupporting_text: \"correct quote\"\n# ✗ FAIL - title doesn't match fetched reference\n```\n\n---\n\n## Configuration\n\n### Configuration File\n\nYou can create a `.linkml-reference-validator.yaml` file in your project root to configure validation behavior:\n\n```yaml\nvalidation:\n  cache_dir: references_cache\n  rate_limit_delay: 0.5\n\n  # Skip validation for specific prefixes (useful for unsupported reference types)\n  skip_prefixes:\n    - SRA           # Sequence Read Archive\n    - MGNIFY        # MGnify database\n    - BIOPROJECT    # NCBI BioProject (currently has API issues)\n\n  # Control severity for unfetchable references\n  unknown_prefix_severity: WARNING  # Options: ERROR, WARNING, INFO\n\n  # Map alternate prefixes to canonical ones\n  reference_prefix_map:\n    geo: GEO\n    NCBIGeo: GEO\n```\n\n### Configuration Options\n\n#### `skip_prefixes` (list of strings)\n\nList of reference prefixes to skip during validation. References with these prefixes will return `is_valid=True` with `INFO` severity, allowing validation to pass without blocking your workflow.\n\n**Use cases:**\n- Unsupported reference types (SRA, MGnify, etc.)\n- References that are temporarily unavailable\n- Third-party databases without registered handlers\n\n**Example:**\n```yaml\nvalidation:\n  skip_prefixes:\n    - SRA\n    - MGNIFY\n    - BIOPROJECT\n```\n\nWith this configuration:\n```bash\n# These will pass validation with INFO severity\nlinkml-reference-validator validate text \"some text\" SRA:PRJNA290729\n# ✓ Valid: True (INFO) - Skipping validation for reference with prefix 'SRA'\n\nlinkml-reference-validator validate text \"some text\" MGNIFY:MGYS00000596\n# ✓ Valid: True (INFO) - Skipping validation for reference with prefix 'MGNIFY'\n```\n\n#### `unknown_prefix_severity` (ERROR | WARNING | INFO)\n\nControl the severity level for references that cannot be fetched (unsupported prefix, network error, etc.). Default: `ERROR`\n\n**Options:**\n- `ERROR` (default) - Validation fails, blocking workflow\n- `WARNING` - Validation fails but with lower severity\n- `INFO` - Validation fails but logged as informational\n\n**Note:** `skip_prefixes` takes precedence over `unknown_prefix_severity`. If a prefix is in `skip_prefixes`, it will return `is_valid=True` with `INFO` severity regardless of this setting.\n\n**Example:**\n```yaml\nvalidation:\n  skip_prefixes:\n    - SRA              # These will be skipped (is_valid=True, INFO)\n  unknown_prefix_severity: WARNING  # Other unfetchable refs get WARNING\n```\n\nWith this configuration:\n```bash\n# SRA is skipped (from skip_prefixes)\nlinkml-reference-validator validate text \"text\" SRA:PRJNA290729\n# ✓ Valid: True (INFO) - Skipping validation\n\n# UNKNOWN prefix gets WARNING severity\nlinkml-reference-validator validate text \"text\" UNKNOWN:12345\n# ✗ Valid: False (WARNING) - Could not fetch reference\n```\n\n### Cache Directory\n\nDefault: `references_cache/` in current directory\n\n```bash\n# Custom cache location\nexport REFERENCE_CACHE_DIR=/path/to/cache\nlinkml-reference-validator validate text \"quote\" PMID:123\n\n# Or use CLI option\nlinkml-reference-validator validate text \"quote\" PMID:123 \\\n  --cache-dir /path/to/cache\n```\n\n### NCBI API Settings\n\nThe tool respects NCBI API rate limits (3 requests/second without API key).\n\n**Optional: Set email for NCBI Entrez (recommended):**\n```bash\nexport NCBI_EMAIL=\"your.email@example.com\"\n```\n\n**Optional: Use NCBI API key for higher rate limits:**\n```bash\nexport NCBI_API_KEY=\"your_api_key_here\"\n```\n\n---\n\n## Troubleshooting\n\n### Common Issues\n\n#### \"Could not fetch reference: PMID:12345678\"\n\n**Causes:**\n- PMID doesn't exist\n- Network connectivity issues\n- NCBI API temporarily unavailable\n\n**Solutions:**\n```bash\n# Verify PMID exists on PubMed\n# Check network connection\n# Try again later (NCBI may be down)\n```\n\n#### \"No content available for reference\"\n\n**Causes:**\n- Abstract not available\n- Article behind paywall (no PMC access)\n- Retracted article\n\n**Solutions:**\n```bash\n# Check if article has abstract on PubMed\n# Look for PMC full text availability\n# Try a different reference\n```\n\n#### \"Supporting text not found\"\n\n**Causes:**\n- Quote is incorrect or paraphrased\n- Text only in figures/tables (not extracted)\n- Text uses different terminology\n- Unicode characters normalized out\n\n**Solutions:**\n```bash\n# Verify exact quote from PDF/HTML\n# Try shorter, more specific quote\n# Check if text is in figure caption\n# Use editorial notes for differences: \"protein [X] functions\"\n```\n\n#### \"Query is empty after removing brackets\"\n\n**Cause:**\n- Entire supporting_text is in brackets: `\"[editorial note]\"`\n\n**Solution:**\n```yaml\n# Include actual quote text\nsupporting_text: \"protein functions [in cells]\"\n# Not just: \"[editorial note]\"\n```\n\n---\n\n## Performance\n\n### Benchmarks\n\n- **First validation:** ~2-3 seconds (includes fetch + cache)\n- **Cached validation:** ~10-50ms\n- **Batch validation:** ~50ms per reference (cached)\n\n### Tips for Speed\n\n1. **Pre-cache references:**\n   ```bash\n   # Cache all references before validation\n   for pmid in PMID:111 PMID:222 PMID:333; do\n     linkml-reference-validator cache reference $pmid\n   done\n   ```\n\n2. **Reuse cache directory:**\n   ```bash\n   # Share cache across projects\n   export REFERENCE_CACHE_DIR=~/.reference_cache\n   ```\n\n3. **Use verbose mode to see what's slow:**\n   ```bash\n   linkml-reference-validator validate data data.yaml \\\n     --schema schema.yaml \\\n     --verbose\n   ```\n\n---\n\n## Development\n\n### Installation for Development\n\n```bash\n# Clone repository\ngit clone https://github.com/linkml/linkml-reference-validator\ncd linkml-reference-validator\n\n# Install with dev dependencies\nuv sync --group dev\n\n# Run tests\njust test\n\n# Run specific test\nuv run pytest tests/test_cli.py::test_validate_text_command_success\n```\n\n### Project Structure\n\n```\nlinkml-reference-validator/\n├── src/linkml_reference_validator/\n│   ├── cli.py                    # CLI commands\n│   ├── models.py                 # Data models\n│   ├── validation/\n│   │   └── supporting_text_validator.py  # Core validation logic\n│   ├── etl/\n│   │   └── reference_fetcher.py  # Reference fetching\n│   └── plugins/\n│       └── reference_validation_plugin.py  # LinkML plugin\n├── tests/\n│   ├── fixtures/                 # Test reference files\n│   ├── test_cli.py              # CLI tests\n│   ├── test_e2e_integration.py  # End-to-end tests\n│   └── ...\n├── justfile                      # Development commands\n└── pyproject.toml               # Project configuration\n```\n\n### Running Tests\n\n```bash\n# All tests\njust test\n\n# Just pytest\njust pytest\n\n# With coverage\nuv run pytest --cov=src/linkml_reference_validator\n\n# Specific test file\nuv run pytest tests/test_cli.py\n\n# Doctests\njust doctest\n```\n\n---\n\n## API Usage (Python)\n\nWhile the CLI is recommended, you can also use the Python API:\n\n```python\nfrom linkml_reference_validator.validation.supporting_text_validator import (\n    SupportingTextValidator\n)\nfrom linkml_reference_validator.models import ReferenceValidationConfig\n\n# Create validator\nconfig = ReferenceValidationConfig(cache_dir=\"my_cache\")\nvalidator = SupportingTextValidator(config)\n\n# Validate text\nresult = validator.validate(\n    supporting_text=\"protein functions in cell cycle regulation\",\n    reference_id=\"PMID:12345678\",\n)\n\nprint(result.is_valid)  # True/False\nprint(result.message)   # Validation message\n```\n\n---\n\n## Limitations\n\n### Current Limitations\n\n1. **PubMed only** - Currently only supports PMID references (DOI and URLs coming soon)\n2. **Text extraction** - Only extracts text from abstracts and main article text (not figures, tables, or supplementary materials)\n3. **Unicode normalization** - Greek letters and special symbols are removed during normalization (e.g., α → a, β → b)\n4. **No fuzzy matching** - Uses deterministic substring matching only (intentional design choice)\n5. **English-centric** - Text normalization assumes English text\n\n### Known Edge Cases\n\n- **Greek letters:** \"α-catenin\" matches \"a catenin\" or \"catenin\"\n- **Chemical formulas:** \"H₂O\" becomes \"h o\" or \"h2o\"\n- **Hyphens:** \"T-cell\" matches \"t cell\"\n- **Abbreviations:** Must match exactly as they appear (normalized)\n\n---\n\n## Comparison to Other Tools\n\n### vs. Manual Verification\n\n| Manual | linkml-reference-validator |\n|--------|----------------------------|\n| ❌ Time consuming | ✅ Automated |\n| ❌ Error prone | ✅ Consistent |\n| ❌ Not scalable | ✅ Validates 100s of quotes |\n| ❌ Not reproducible | ✅ Cached, versioned |\n\n### vs. Fuzzy Matching Tools\n\nlinkml-reference-validator uses **deterministic substring matching**, not fuzzy matching:\n\n✅ **Predictable** - Same input always gives same result\n✅ **Explainable** - Easy to understand why validation passed/failed\n✅ **No false positives** - Won't accept paraphrased text\n✅ **Fast** - No complex similarity calculations\n\n---\n\n## Contributing\n\nContributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n### Priority Areas\n\n- [ ] DOI support\n- [ ] URL/webpage support\n- [ ] Better Unicode handling\n- [ ] Performance improvements for large batches\n- [ ] More comprehensive error messages\n\n---\n\n## Citation\n\nIf you use this tool in your research, please cite:\n\n```bibtex\n@software{linkml_reference_validator,\n  title = {linkml-reference-validator: Validation of supporting text from references},\n  author = {Mungall, Chris},\n  year = {2024},\n  url = {https://github.com/linkml/linkml-reference-validator}\n}\n```\n\n---\n\n## License\n\nApache 2.0 - see [LICENSE](LICENSE)\n\n---\n\n## Support\n\n- **Issues:** [GitHub Issues](https://github.com/linkml/linkml-reference-validator/issues)\n- **Discussions:** [GitHub Discussions](https://github.com/linkml/linkml-reference-validator/discussions)\n- **Documentation:** [Full Documentation](https://linkml.github.io/linkml-reference-validator)\n\n---\n\n## Related Projects\n\n- [LinkML](https://linkml.io/) - Modeling language for linked data\n- [linkml-validator](https://github.com/linkml/linkml) - Core LinkML validation\n- [ai-gene-reviews](https://github.com/monarch-initiative/ai-gene-reviews) - Inspiration for this project\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinkml%2Flinkml-reference-validator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flinkml%2Flinkml-reference-validator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinkml%2Flinkml-reference-validator/lists"}