An open API service indexing awesome lists of open source software.

https://github.com/linkml/linkml-reference-validator

Validate that supporting text quotes in your data actually appear in their cited references
https://github.com/linkml/linkml-reference-validator

agentic-ai ai-guardrails ai4curation linkml monarchinitiative pubmed

Last synced: 4 months ago
JSON representation

Validate that supporting text quotes in your data actually appear in their cited references

Awesome Lists containing this project

README

          

# linkml-reference-validator

[![Tests](https://img.shields.io/badge/tests-91%20passing-brightgreen)]()
[![Coverage](https://img.shields.io/badge/coverage-86%25-green)]()
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)]()

Validate that supporting text quotes in your data actually appear in their cited references.

This tool fetches scientific publications (currently PubMed/PMC) and verifies that quoted text (`supporting_text`) can be found in the referenced document using deterministic substring matching.

---

## Quick Start

### Installation

```bash
# Using uv (recommended)
uv pip install linkml-reference-validator

# Using pip
pip install linkml-reference-validator
```

### Basic Usage

```bash
# Validate a single quote against a reference
linkml-reference-validator validate text \
"protein functions in cell cycle regulation" \
PMID:12345678

# Validate a data file using LinkML validation
linkml-validate -s schema.yaml data.yaml \
--validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin
```

---

## Why Use This Tool?

Scientific data often includes claims supported by quotes from publications. But how do you know the quotes are accurate?

**Before:**
```yaml
gene_function:
gene: TP53
function: "regulates cell cycle"
evidence:
reference: PMID:12345678
supporting_text: "TP53 is critical for cell cycle regulation" # Is this really in the paper?
```

**After validation:**
```bash
$ linkml-reference-validator validate text \
"TP53 is critical for cell cycle regulation" \
PMID:12345678

✓ Valid: True
✓ Supporting text validated successfully in PMID:12345678
```

---

## CLI Commands

> **Note:** The CLI was restructured in v1.x to use nested commands (`validate text`, `validate data`, `cache reference`). The old hyphenated commands (`validate-text`, `validate-data`, `cache-reference`) still work for backward compatibility but are deprecated.

### 1. `validate text` - Quick Single Quote Validation

Validate a single quote against a reference without needing a schema.

```bash
linkml-reference-validator validate text [OPTIONS]
```

**Example:**
```bash
# Basic validation
linkml-reference-validator validate text \
"protein functions in cell cycle regulation" \
PMID:12345678

# With editorial notes (ignored in matching)
linkml-reference-validator validate text \
"protein [X] functions in cell cycle regulation" \
PMID:12345678

# Multi-part quote with omitted text
linkml-reference-validator validate text \
"protein functions ... cell cycle regulation" \
PMID:12345678
```

**Options:**
- `--cache-dir PATH` - Directory for caching references (default: `references_cache`)
- `--verbose` - Show detailed validation information
- `--help` - Show help message

**Exit Codes:**
- `0` - Validation successful
- `1` - Validation failed

---

### 2. `validate data` - Full Data File Validation

Validate an entire data file using a LinkML schema.

```bash
linkml-reference-validator validate data --schema [OPTIONS]
```

**Example:**

**Schema** (`gene_schema.yaml`):
```yaml
id: https://example.org/genes
name: gene-schema

classes:
GeneFunction:
attributes:
gene:
range: string
function:
range: string
evidence:
range: Evidence

Evidence:
attributes:
reference:
range: Reference
implements:
- linkml:authoritative_reference # Marks this as a reference field
supporting_text:
range: string
implements:
- linkml:excerpt # Marks this as text to validate

Reference:
attributes:
id:
identifier: true
range: string
title:
range: string
```

**Data** (`gene_data.yaml`):
```yaml
gene: TP53
function: "regulates cell cycle"
evidence:
reference:
id: PMID:12345678
title: "TP53 in cell cycle control"
supporting_text: "TP53 protein functions in cell cycle regulation"
```

**Validation:**
```bash
linkml-reference-validator validate data \
gene_data.yaml \
--schema gene_schema.yaml

# Output:
Validating gene_data.yaml against schema gene_schema.yaml
Cache directory: references_cache

✓ All validations passed!
```

**Options:**
- `--schema PATH` (required) - Path to LinkML schema
- `--target-class CLASS` - Specific class to validate
- `--cache-dir PATH` - Directory for caching references
- `--verbose` - Show detailed output
- `--help` - Show help message

---

### 3. `repair` - Automated Repair of Validation Errors

Automatically fix or flag supporting text validation errors based on confidence thresholds.

```bash
# Repair a single quote
linkml-reference-validator repair text [OPTIONS]

# Repair a data file (dry run by default)
linkml-reference-validator repair data --schema [OPTIONS]
```

**Example - Single Quote:**
```bash
# Try to repair a quote with ASCII subscript
linkml-reference-validator repair text "CO2 levels were measured" PMID:12345678

# Output:
# ✓ Repaired successfully
# Original: CO2 levels were measured
# Repaired: CO₂ levels were measured
# Action: CHARACTER_NORMALIZATION
# Confidence: HIGH
```

**Example - Data File:**
```bash
# Dry run - show what would be changed
linkml-reference-validator repair data disease.yaml \
--schema schema.yaml \
--dry-run

# Apply auto-fixes (creates backup)
linkml-reference-validator repair data disease.yaml \
--schema schema.yaml \
--no-dry-run

# Custom output file
linkml-reference-validator repair data disease.yaml \
--schema schema.yaml \
--no-dry-run \
--output repaired.yaml
```

**Repair Report Output:**
```
============================================================
Repair Report
============================================================

HIGH CONFIDENCE FIXES (auto-applicable):
PMID:12345678 at evidence[0]:
Character normalization fix
'CO2 levels...' → 'CO₂ levels...'

SUGGESTED FIXES (review recommended):
PMID:23456789 at evidence[1]:
Inserted ellipsis between non-contiguous parts

RECOMMENDED REMOVALS (low confidence):
PMID:34567890 at evidence[2]:
Similarity: 8%
Snippet: 'Fabricated text that...'

------------------------------------------------------------
Summary:
Total items: 5
Already valid: 2
Auto-fixes: 1
Suggestions: 1
Removals: 1
Unverifiable: 0
```

**Repair Strategies:**

| Strategy | Confidence | Description |
|----------|------------|-------------|
| Character Normalization | HIGH | Fix Unicode/symbol differences (CO2→CO₂, +/-→±) |
| Ellipsis Insertion | MEDIUM | Insert `...` between non-contiguous text parts |
| Fuzzy Correction | VARIES | Suggest closest matching text from reference |
| Removal | VERY_LOW | Flag fabricated/not-found text for manual removal |

**Options:**
- `--dry-run / --no-dry-run` - Show changes without applying (default: dry-run)
- `--auto-fix-threshold FLOAT` - Minimum similarity for auto-fixes (default: 0.95)
- `--output PATH` - Output file path (default: overwrite with backup)
- `--config PATH` - Path to repair configuration file
- `--cache-dir PATH` - Directory for caching references
- `--verbose` - Show detailed output

---

### 4. `cache reference` - Pre-cache References

Download and cache references for offline use.

```bash
linkml-reference-validator cache reference [OPTIONS]
```

**Example:**
```bash
# Cache a single reference
linkml-reference-validator cache reference PMID:12345678

# Output:
Fetching PMID:12345678...
✓ Successfully cached PMID:12345678
Title: TP53 in cell cycle control
Authors: Smith J, Doe A, Johnson K
Content type: full_text_xml
Content length: 45231 characters
```

**Use Cases:**
- Pre-fetch references before validation
- Build offline reference library
- Verify reference availability

---

## Integration with `linkml-validate`

The recommended way to use this tool is as a **LinkML validation plugin** with the standard `linkml-validate` command.

### Setup

**1. Install both packages:**
```bash
uv pip install linkml linkml-reference-validator
```

**2. Create your schema with interface markers:**
```yaml
# my_schema.yaml
id: https://example.org/my-schema
name: my-schema

prefixes:
linkml: https://w3id.org/linkml/

classes:
Evidence:
attributes:
reference:
range: Reference
implements:
- linkml:authoritative_reference # <-- This marks it as a reference
supporting_text:
range: string
implements:
- linkml:excerpt # <-- This marks it as text to validate
```

**3. Validate using linkml-validate:**
```bash
linkml-validate \
--schema my_schema.yaml \
--validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin \
my_data.yaml
```

### Why Use the Plugin?

✅ **Integrated validation** - Combines schema validation + reference validation in one command
✅ **Standard LinkML workflow** - Uses familiar LinkML tools
✅ **Flexible schema design** - Works with any schema using the interface pattern
✅ **Rich error reporting** - Shows exactly where validation fails in your data

---

## Supported Reference Formats

### Currently Supported

#### PubMed IDs (PMID)
```yaml
reference:
id: PMID:12345678
supporting_text: "protein functions in cells"
```

**Fetches:**
- Abstract (always)
- Full text from PMC (when available)
- Metadata (title, authors, journal, year, DOI)

**ID Formats:**
- `PMID:12345678`
- `12345678` (assumes PMID)

### Coming Soon

- **DOI** - `DOI:10.1038/nature12345`
- **URLs** - Web pages and online documents

---

## Schema Requirements

For the validator to work, your LinkML schema must:

### 1. Mark Reference Fields

Use `implements: [linkml:authoritative_reference]` on slots that contain references:

```yaml
classes:
Evidence:
attributes:
reference: # Can be nested object
range: Reference
implements:
- linkml:authoritative_reference
```

**OR** use a flat structure:

```yaml
classes:
Evidence:
attributes:
reference_id: # Can be flat string
range: string
implements:
- linkml:authoritative_reference
```

### 2. Mark Excerpt Fields

Use `implements: [linkml:excerpt]` on slots containing quoted text:

```yaml
classes:
Evidence:
attributes:
supporting_text: # The quote to validate
range: string
implements:
- linkml:excerpt
```

### 3. Define Reference Structure

If using nested references, define the Reference class:

```yaml
classes:
Reference:
attributes:
id:
identifier: true
range: string
title: # Optional: validates if provided
range: string
```

---

## Data Formats

### Nested Reference (Recommended)

```yaml
evidence:
reference:
id: PMID:12345678
title: "Study of Protein X"
supporting_text: "protein functions in cell cycle regulation"
```

### Flat Reference ID

```yaml
evidence:
reference_id: PMID:12345678
supporting_text: "protein functions in cell cycle regulation"
```

### Multiple Evidence Items

```yaml
statement:
text: "Protein X has multiple functions"
evidence:
- reference:
id: PMID:11111111
supporting_text: "protein functions in cell cycle"
- reference:
id: PMID:22222222
supporting_text: "protein regulates DNA repair"
```

---

## Text Matching Syntax

### Editorial Notes `[...]`

Use square brackets for editorial insertions that should be ignored during matching:

```yaml
supporting_text: "protein [X] functions in cell cycle regulation"
# Matches: "protein functions in cell cycle regulation"
# Ignores: "X"
```

**Use cases:**
- `[sic]` - Original spelling
- `[emphasis added]` - Added emphasis
- `[gene name]` - Clarifications
- `[...]` - Omitted content markers

### Omitted Text `...`

Use ellipsis for gaps in quoted text:

```yaml
supporting_text: "protein functions ... in cell cycle regulation"
# Matches both parts independently:
# - "protein functions"
# - "in cell cycle regulation"
```

**Requirements:**
- Both parts must appear in the reference (order independent)
- Each part must be a substring match after normalization

### Text Normalization

Before matching, text is normalized:
- Greek letters spelled out (α→alpha, β→beta, etc.)
- Lowercased
- Punctuation removed
- Extra whitespace collapsed

**Examples:**
```
"T-Cell Receptor" → "t cell receptor"
"TP53 (p53) protein" → "tp53 p53 protein"
"α-catenin" → "alpha catenin"
"β-actin" → "beta actin"
"γ-tubulin" → "gamma tubulin"
```

**Greek Letter Support:**

All Greek letters (both uppercase and lowercase) are converted to their spelled-out English equivalents. This ensures:
- **Bidirectional matching**: "α-catenin" in a query matches "alpha-catenin" in the reference, and vice versa
- **Preserved distinctions**: "α-catenin" and "β-catenin" remain distinct (not collapsed to just "catenin")
- **Consistent behavior**: Works with any Greek letter commonly used in biomedical nomenclature

---

## Caching

References are automatically cached to disk to:
- Speed up repeated validations
- Reduce API calls to PubMed
- Enable offline validation

### Cache Structure

```
references_cache/
├── PMID_12345678.md
├── PMID_98765432.md
└── PMC_7654321.md
```

### Cache File Format

Cache files are stored as Markdown with YAML frontmatter for easy readability and compatibility:

```markdown
---
reference_id: PMID:12345678
title: TP53 in cell cycle control
authors:
- Smith J
- Doe A
- Johnson K
journal: Nature
year: '2024'
doi: 10.1038/nature12345
content_type: full_text_xml
---

# TP53 in cell cycle control
**Authors:** Smith J, Doe A, Johnson K
**Journal:** Nature (2024)
**DOI:** [10.1038/nature12345](https://doi.org/10.1038/nature12345)

## Content

[Full text content follows...]
```

**Note:** The validator still supports reading legacy `.txt` format cache files for backward compatibility.

### Cache Management

```bash
# Use custom cache directory
linkml-reference-validator validate text \
"quote" PMID:123 \
--cache-dir /path/to/cache

# Pre-cache references
linkml-reference-validator cache reference PMID:12345678

# Force re-fetch (bypass cache)
linkml-reference-validator cache reference PMID:12345678 --force
```

---

## Examples

### Example 1: Simple Gene Function Validation

**Schema** (`gene.yaml`):
```yaml
id: https://example.org/genes
name: gene-schema

classes:
GeneFunctionStatement:
tree_root: true
attributes:
gene_symbol:
range: string
function_description:
range: string
evidence:
range: Evidence

Evidence:
attributes:
reference:
range: Reference
implements:
- linkml:authoritative_reference
supporting_text:
range: string
implements:
- linkml:excerpt

Reference:
attributes:
id:
identifier: true
```

**Data** (`tp53.yaml`):
```yaml
gene_symbol: TP53
function_description: "tumor suppressor"
evidence:
reference:
id: PMID:12345678
supporting_text: "TP53 functions as a tumor suppressor"
```

**Validation:**
```bash
linkml-validate \
--schema gene.yaml \
--validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin \
tp53.yaml
```

---

### Example 2: Quick Text Check

```bash
# Check if a quote is in a paper
linkml-reference-validator validate text \
"protein kinase activity regulates cell proliferation" \
PMID:12345678

# With editorial note
linkml-reference-validator validate text \
"protein kinase [PKA] activity regulates cell proliferation" \
PMID:12345678

# Multi-part quote
linkml-reference-validator validate text \
"protein kinase activity ... regulates cell proliferation" \
PMID:12345678
```

---

### Example 3: Batch Validation with Multiple References

**Data** (`gene_annotations.yaml`):
```yaml
- gene_symbol: BRCA1
annotations:
- function: "DNA repair"
evidence:
reference:
id: PMID:11111111
supporting_text: "BRCA1 plays a critical role in DNA repair"
- function: "tumor suppressor"
evidence:
reference:
id: PMID:22222222
supporting_text: "BRCA1 functions as a tumor suppressor"

- gene_symbol: TP53
annotations:
- function: "cell cycle regulation"
evidence:
reference:
id: PMID:33333333
supporting_text: "TP53 regulates cell cycle checkpoints"
```

**Validation:**
```bash
linkml-reference-validator validate data \
gene_annotations.yaml \
--schema gene_schema.yaml \
--verbose

# Output shows validation for each reference:
# ✓ PMID:11111111 - "BRCA1 plays a critical role in DNA repair"
# ✓ PMID:22222222 - "BRCA1 functions as a tumor suppressor"
# ✓ PMID:33333333 - "TP53 regulates cell cycle checkpoints"
```

---

## Validation Rules

### What Passes

✅ **Exact substring match** (after normalization)
```yaml
supporting_text: "protein functions in cells"
reference_content: "The protein functions in cells during mitosis."
# ✓ PASS - exact substring found
```

✅ **Multi-part match**
```yaml
supporting_text: "protein functions ... during mitosis"
reference_content: "The protein functions in cells during mitosis."
# ✓ PASS - both parts found
```

✅ **Editorial notes ignored**
```yaml
supporting_text: "protein [X] functions"
reference_content: "The protein functions in cells."
# ✓ PASS - [X] ignored in matching
```

✅ **Case and punctuation normalized**
```yaml
supporting_text: "T-Cell Receptor"
reference_content: "The t cell receptor binds antigens."
# ✓ PASS - normalized to "t cell receptor"
```

### What Fails

❌ **Text not in reference**
```yaml
supporting_text: "protein inhibits apoptosis"
reference_content: "The protein functions in cells."
# ✗ FAIL - "inhibits apoptosis" not found
```

❌ **Partial multi-part match**
```yaml
supporting_text: "protein functions ... inhibits apoptosis"
reference_content: "The protein functions in cells."
# ✗ FAIL - second part not found
```

❌ **Reference not accessible**
```yaml
supporting_text: "any quote"
reference_id: PMID:99999999
# ✗ FAIL - reference doesn't exist or can't be fetched
```

❌ **Title mismatch** (when title provided)
```yaml
reference:
id: PMID:12345678
title: "Wrong Title"
supporting_text: "correct quote"
# ✗ FAIL - title doesn't match fetched reference
```

---

## Configuration

### Configuration File

You can create a `.linkml-reference-validator.yaml` file in your project root to configure validation behavior:

```yaml
validation:
cache_dir: references_cache
rate_limit_delay: 0.5

# Skip validation for specific prefixes (useful for unsupported reference types)
skip_prefixes:
- SRA # Sequence Read Archive
- MGNIFY # MGnify database
- BIOPROJECT # NCBI BioProject (currently has API issues)

# Control severity for unfetchable references
unknown_prefix_severity: WARNING # Options: ERROR, WARNING, INFO

# Map alternate prefixes to canonical ones
reference_prefix_map:
geo: GEO
NCBIGeo: GEO
```

### Configuration Options

#### `skip_prefixes` (list of strings)

List of reference prefixes to skip during validation. References with these prefixes will return `is_valid=True` with `INFO` severity, allowing validation to pass without blocking your workflow.

**Use cases:**
- Unsupported reference types (SRA, MGnify, etc.)
- References that are temporarily unavailable
- Third-party databases without registered handlers

**Example:**
```yaml
validation:
skip_prefixes:
- SRA
- MGNIFY
- BIOPROJECT
```

With this configuration:
```bash
# These will pass validation with INFO severity
linkml-reference-validator validate text "some text" SRA:PRJNA290729
# ✓ Valid: True (INFO) - Skipping validation for reference with prefix 'SRA'

linkml-reference-validator validate text "some text" MGNIFY:MGYS00000596
# ✓ Valid: True (INFO) - Skipping validation for reference with prefix 'MGNIFY'
```

#### `unknown_prefix_severity` (ERROR | WARNING | INFO)

Control the severity level for references that cannot be fetched (unsupported prefix, network error, etc.). Default: `ERROR`

**Options:**
- `ERROR` (default) - Validation fails, blocking workflow
- `WARNING` - Validation fails but with lower severity
- `INFO` - Validation fails but logged as informational

**Note:** `skip_prefixes` takes precedence over `unknown_prefix_severity`. If a prefix is in `skip_prefixes`, it will return `is_valid=True` with `INFO` severity regardless of this setting.

**Example:**
```yaml
validation:
skip_prefixes:
- SRA # These will be skipped (is_valid=True, INFO)
unknown_prefix_severity: WARNING # Other unfetchable refs get WARNING
```

With this configuration:
```bash
# SRA is skipped (from skip_prefixes)
linkml-reference-validator validate text "text" SRA:PRJNA290729
# ✓ Valid: True (INFO) - Skipping validation

# UNKNOWN prefix gets WARNING severity
linkml-reference-validator validate text "text" UNKNOWN:12345
# ✗ Valid: False (WARNING) - Could not fetch reference
```

### Cache Directory

Default: `references_cache/` in current directory

```bash
# Custom cache location
export REFERENCE_CACHE_DIR=/path/to/cache
linkml-reference-validator validate text "quote" PMID:123

# Or use CLI option
linkml-reference-validator validate text "quote" PMID:123 \
--cache-dir /path/to/cache
```

### NCBI API Settings

The tool respects NCBI API rate limits (3 requests/second without API key).

**Optional: Set email for NCBI Entrez (recommended):**
```bash
export NCBI_EMAIL="your.email@example.com"
```

**Optional: Use NCBI API key for higher rate limits:**
```bash
export NCBI_API_KEY="your_api_key_here"
```

---

## Troubleshooting

### Common Issues

#### "Could not fetch reference: PMID:12345678"

**Causes:**
- PMID doesn't exist
- Network connectivity issues
- NCBI API temporarily unavailable

**Solutions:**
```bash
# Verify PMID exists on PubMed
# Check network connection
# Try again later (NCBI may be down)
```

#### "No content available for reference"

**Causes:**
- Abstract not available
- Article behind paywall (no PMC access)
- Retracted article

**Solutions:**
```bash
# Check if article has abstract on PubMed
# Look for PMC full text availability
# Try a different reference
```

#### "Supporting text not found"

**Causes:**
- Quote is incorrect or paraphrased
- Text only in figures/tables (not extracted)
- Text uses different terminology
- Unicode characters normalized out

**Solutions:**
```bash
# Verify exact quote from PDF/HTML
# Try shorter, more specific quote
# Check if text is in figure caption
# Use editorial notes for differences: "protein [X] functions"
```

#### "Query is empty after removing brackets"

**Cause:**
- Entire supporting_text is in brackets: `"[editorial note]"`

**Solution:**
```yaml
# Include actual quote text
supporting_text: "protein functions [in cells]"
# Not just: "[editorial note]"
```

---

## Performance

### Benchmarks

- **First validation:** ~2-3 seconds (includes fetch + cache)
- **Cached validation:** ~10-50ms
- **Batch validation:** ~50ms per reference (cached)

### Tips for Speed

1. **Pre-cache references:**
```bash
# Cache all references before validation
for pmid in PMID:111 PMID:222 PMID:333; do
linkml-reference-validator cache reference $pmid
done
```

2. **Reuse cache directory:**
```bash
# Share cache across projects
export REFERENCE_CACHE_DIR=~/.reference_cache
```

3. **Use verbose mode to see what's slow:**
```bash
linkml-reference-validator validate data data.yaml \
--schema schema.yaml \
--verbose
```

---

## Development

### Installation for Development

```bash
# Clone repository
git clone https://github.com/linkml/linkml-reference-validator
cd linkml-reference-validator

# Install with dev dependencies
uv sync --group dev

# Run tests
just test

# Run specific test
uv run pytest tests/test_cli.py::test_validate_text_command_success
```

### Project Structure

```
linkml-reference-validator/
├── src/linkml_reference_validator/
│ ├── cli.py # CLI commands
│ ├── models.py # Data models
│ ├── validation/
│ │ └── supporting_text_validator.py # Core validation logic
│ ├── etl/
│ │ └── reference_fetcher.py # Reference fetching
│ └── plugins/
│ └── reference_validation_plugin.py # LinkML plugin
├── tests/
│ ├── fixtures/ # Test reference files
│ ├── test_cli.py # CLI tests
│ ├── test_e2e_integration.py # End-to-end tests
│ └── ...
├── justfile # Development commands
└── pyproject.toml # Project configuration
```

### Running Tests

```bash
# All tests
just test

# Just pytest
just pytest

# With coverage
uv run pytest --cov=src/linkml_reference_validator

# Specific test file
uv run pytest tests/test_cli.py

# Doctests
just doctest
```

---

## API Usage (Python)

While the CLI is recommended, you can also use the Python API:

```python
from linkml_reference_validator.validation.supporting_text_validator import (
SupportingTextValidator
)
from linkml_reference_validator.models import ReferenceValidationConfig

# Create validator
config = ReferenceValidationConfig(cache_dir="my_cache")
validator = SupportingTextValidator(config)

# Validate text
result = validator.validate(
supporting_text="protein functions in cell cycle regulation",
reference_id="PMID:12345678",
)

print(result.is_valid) # True/False
print(result.message) # Validation message
```

---

## Limitations

### Current Limitations

1. **PubMed only** - Currently only supports PMID references (DOI and URLs coming soon)
2. **Text extraction** - Only extracts text from abstracts and main article text (not figures, tables, or supplementary materials)
3. **Unicode normalization** - Greek letters and special symbols are removed during normalization (e.g., α → a, β → b)
4. **No fuzzy matching** - Uses deterministic substring matching only (intentional design choice)
5. **English-centric** - Text normalization assumes English text

### Known Edge Cases

- **Greek letters:** "α-catenin" matches "a catenin" or "catenin"
- **Chemical formulas:** "H₂O" becomes "h o" or "h2o"
- **Hyphens:** "T-cell" matches "t cell"
- **Abbreviations:** Must match exactly as they appear (normalized)

---

## Comparison to Other Tools

### vs. Manual Verification

| Manual | linkml-reference-validator |
|--------|----------------------------|
| ❌ Time consuming | ✅ Automated |
| ❌ Error prone | ✅ Consistent |
| ❌ Not scalable | ✅ Validates 100s of quotes |
| ❌ Not reproducible | ✅ Cached, versioned |

### vs. Fuzzy Matching Tools

linkml-reference-validator uses **deterministic substring matching**, not fuzzy matching:

✅ **Predictable** - Same input always gives same result
✅ **Explainable** - Easy to understand why validation passed/failed
✅ **No false positives** - Won't accept paraphrased text
✅ **Fast** - No complex similarity calculations

---

## Contributing

Contributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

### Priority Areas

- [ ] DOI support
- [ ] URL/webpage support
- [ ] Better Unicode handling
- [ ] Performance improvements for large batches
- [ ] More comprehensive error messages

---

## Citation

If you use this tool in your research, please cite:

```bibtex
@software{linkml_reference_validator,
title = {linkml-reference-validator: Validation of supporting text from references},
author = {Mungall, Chris},
year = {2024},
url = {https://github.com/linkml/linkml-reference-validator}
}
```

---

## License

Apache 2.0 - see [LICENSE](LICENSE)

---

## Support

- **Issues:** [GitHub Issues](https://github.com/linkml/linkml-reference-validator/issues)
- **Discussions:** [GitHub Discussions](https://github.com/linkml/linkml-reference-validator/discussions)
- **Documentation:** [Full Documentation](https://linkml.github.io/linkml-reference-validator)

---

## Related Projects

- [LinkML](https://linkml.io/) - Modeling language for linked data
- [linkml-validator](https://github.com/linkml/linkml) - Core LinkML validation
- [ai-gene-reviews](https://github.com/monarch-initiative/ai-gene-reviews) - Inspiration for this project