{"id":35621814,"url":"https://github.com/linkml/linkml-term-validator","last_synced_at":"2026-04-01T17:57:49.855Z","repository":{"id":327786738,"uuid":"1098500382","full_name":"linkml/linkml-term-validator","owner":"linkml","description":"Validating LinkML schemas and datasets that depend on external terms","archived":false,"fork":false,"pushed_at":"2026-03-19T01:17:25.000Z","size":1318,"stargazers_count":5,"open_issues_count":12,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-19T08:38:19.896Z","etag":null,"topics":["linkml","obofoundry","ontologies","validation"],"latest_commit_sha":null,"homepage":"http://linkml.io/linkml-term-validator/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/linkml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-17T19:16:04.000Z","updated_at":"2026-03-17T03:09:59.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/linkml/linkml-term-validator","commit_stats":null,"previous_names":["linkml/linkml-term-validator"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/linkml/linkml-term-validator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkml%2Flinkml-term-validator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkml%2Flinkml-term-validator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkml%2Flinkml-term-validator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkml%2Flinkml-term-validator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/linkml","download_url":"https://codeload.github.com/linkml/linkml-term-validator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkml%2Flinkml-term-validator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31290718,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T13:12:26.723Z","status":"ssl_error","status_checked_at":"2026-04-01T13:12:25.102Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["linkml","obofoundry","ontologies","validation"],"created_at":"2026-01-05T07:00:28.028Z","updated_at":"2026-04-01T17:57:49.842Z","avatar_url":"https://github.com/linkml.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# linkml-term-validator\n\nValidating LinkML schemas and datasets that depend on external terms\n\nA collection of [LinkML ValidationPlugin](https://linkml.io/linkml/code/validator.html) implementations for validating ontology term references:\n\n1. **Schema Validation**: Validate `meaning` fields in enum permissible values\n2. **Data Validation**: Validate data against dynamic enums and binding constraints\n\n## Features\n\n* ✅ Three composable validation plugins for LinkML validator framework\n* ✅ Validates `meaning` fields in `permissible_values` in LinkML schemas\n* ✅ Validates data against dynamic enums (reachable_from, matches, concepts)\n* ✅ Validates binding constraints on nested object fields\n* ✅ Supports multiple ontology sources via [OAK (Ontology Access Kit)](https://github.com/INCATools/ontology-access-kit)\n* ✅ Multi-level caching (in-memory + file-based) for fast repeated validation\n* ✅ Configurable per-prefix validation via `oak_config.yaml`\n* ✅ Standalone CLI + LinkML validator integration\n* ✅ Tracks unknown ontology prefixes\n\n## Installation\n\n```bash\npip install linkml-term-validator\n```\n\nOr with `uv`:\n\n```bash\nuv add linkml-term-validator\n```\n\n\n## Quick Start\n\nFor interactive tutorials, see the [Jupyter notebooks](notebooks/) in the `notebooks/` directory.\n\n### Validate Schemas\n\nCheck that `meaning` fields in your schema reference valid ontology terms:\n\n```bash\nlinkml-term-validator validate-schema schema.yaml\n```\n\n### Validate Data\n\nValidate data instances against dynamic enums and binding constraints:\n\n```bash\nlinkml-term-validator validate-data data.yaml --schema schema.yaml\n```\n\nThe `validate-data` command checks:\n- **Dynamic enums** - values match `reachable_from`, `matches`, or `concepts` definitions\n- **Binding constraints** - nested object fields satisfy binding ranges\n- **Labels** (optional with `--labels`) - ontology term labels match\n\n## Examples\n\n### Schema Validation\n\nHere's a LinkML schema that uses ontology terms:\n\n```yaml\nid: https://example.org/my-schema\nname: my-schema\nprefixes:\n  GO: http://purl.obolibrary.org/obo/GO_\n  CHEBI: http://purl.obolibrary.org/obo/CHEBI_\n\nenums:\n  BiologicalProcessEnum:\n    description: Examples of biological processes\n    permissible_values:\n      BIOLOGICAL_PROCESS:\n        title: biological process\n        meaning: GO:0008150\n      CELL_CYCLE:\n        title: cell cycle\n        meaning: GO:0007049\n\n  ChemicalEntityEnum:\n    description: Examples of chemical entities\n    permissible_values:\n      WATER:\n        title: water\n        meaning: CHEBI:15377\n      GLUCOSE:\n        title: glucose\n        meaning: CHEBI:17234\n```\n\nWhen you run validation:\n\n```bash\nlinkml-term-validator my-schema.yaml\n```\n\nThe validator will:\n1. Check that `GO:0008150` exists and has label \"biological_process\" (or \"biological process\")\n2. Check that `GO:0007049` exists and has label \"cell cycle\"\n3. Check that `CHEBI:15377` exists and has label \"water\"\n4. Check that `CHEBI:17234` exists and has label \"glucose\"\n5. Report any mismatches or missing terms\n\n### Example Output\n\n```\nValidation Results for my-schema.yaml\n============================================================\nEnums checked: 2\nValues checked: 4\nMeanings validated: 4\n\n✅ No issues found!\n```\n\nOr if there's an issue:\n\n```\n⚠️  WARNING: Label mismatch\n    Enum: BiologicalProcessEnum\n    Value: BIOLOGICAL_PROCESS\n    Expected label: biological process\n    Found label: biological_process\n    Meaning: GO:0008150\n\nValidation Results for my-schema.yaml\n============================================================\nEnums checked: 2\nValues checked: 4\nMeanings validated: 4\n\nIssues found: 1\n  Warnings: 1\n  Errors: 0\n```\n\n### Data Validation\n\n#### Example 1: Dynamic Enums\n\nSchema with a dynamic enum using `reachable_from`:\n\n```yaml\nenums:\n  NeuronTypeEnum:\n    description: Any neuron type\n    reachable_from:\n      source_ontology: obo:cl\n      source_nodes:\n        - CL:0000540  # neuron\n      relationship_types:\n        - rdfs:subClassOf\n```\n\nData file with neuron instances:\n\n```yaml\nneurons:\n  - id: \"1\"\n    cell_type: CL:0000540  # neuron - valid\n  - id: \"2\"\n    cell_type: CL:0000100  # neuron associated cell - valid (descendant)\n  - id: \"3\"\n    cell_type: GO:0008150  # biological process - INVALID\n```\n\nValidate:\n\n```bash\nlinkml-term-validator validate-data neurons.yaml --schema schema.yaml\n```\n\nOutput:\n```\n❌ Validation failed with 1 issue(s):\n\n❌ ERROR: Value 'GO:0008150' not in dynamic enum NeuronTypeEnum\n    Expected one of the descendants of CL:0000540\n```\n\n#### Example 2: Binding Constraints\n\nSchema with binding constraints:\n\n```yaml\nclasses:\n  GeneAnnotation:\n    slots:\n      - gene\n      - go_term\n    slot_usage:\n      go_term:\n        range: GOTerm\n        bindings:\n          - binds_value_of: id\n            range: BiologicalProcessEnum\n\n  GOTerm:\n    slots:\n      - id\n      - label\n```\n\nData file:\n\n```yaml\nannotations:\n  - gene: BRCA1\n    go_term:\n      id: GO:0008150  # biological_process\n      label: biological process\n```\n\nValidate with label checking:\n\n```bash\nlinkml-term-validator validate-data annotations.yaml --schema schema.yaml --labels\n```\n\n## Caching\n\nThe validator uses multi-level caching to speed up repeated validations:\n\n### In-Memory Cache\nDuring a single validation run, ontology labels are cached in memory. This means if multiple permissible values use the same ontology term, it's only looked up once.\n\n### File-Based Cache\nLabels are persisted to CSV files in the cache directory (default: `cache/`). The cache is organized by ontology prefix:\n\n```\ncache/\n├── go/\n│   └── terms.csv      # GO term labels\n├── chebi/\n│   └── terms.csv      # CHEBI term labels\n└── uberon/\n    └── terms.csv      # UBERON term labels\n```\n\nEach CSV contains:\n```csv\ncurie,label,retrieved_at\nGO:0008150,biological_process,2025-11-15T10:30:00\nGO:0007049,cell cycle,2025-11-15T10:30:01\n```\n\n### Cache Behavior\n\n- **First run**: Queries ontology databases, saves to cache\n- **Subsequent runs**: Loads from cache files (very fast!)\n- **Cache location**: Configurable via `--cache-dir` flag\n- **Disable caching**: Use `--no-cache` flag\n\n### When to Clear Cache\n\nYou might want to clear the cache if:\n- Ontology databases have been updated\n- You suspect stale or incorrect labels\n\n```bash\n# Clear cache for specific ontology\nrm -rf cache/go/\n\n# Clear entire cache\nrm -rf cache/\n```\n\n## Advanced Configuration\n\n### Per-Prefix Adapter Configuration\n\nCreate an `oak_config.yaml` to control which ontologies are validated:\n\n```yaml\nontology_adapters:\n  GO: sqlite:obo:go           # Use local GO database\n  CHEBI: sqlite:obo:chebi     # Use local CHEBI database\n  UBERON: sqlite:obo:uberon   # Use local UBERON database\n  CUSTOM: \"\"                   # Skip validation for CUSTOM prefix\n```\n\nThen validate with this config:\n\n```bash\nlinkml-term-validator schema.yaml --config oak_config.yaml\n```\n\n**Important**: When using `oak_config.yaml`, ONLY the prefixes listed in the config will be validated. Any prefix not in the config will be tracked as \"unknown\" and reported at the end of validation.\n\n### Default Behavior (No Config File)\n\nWithout an `oak_config.yaml`, the validator uses `sqlite:obo:` as the default adapter. This automatically creates per-prefix adapters:\n\n- `GO:0008150` → uses `sqlite:obo:go`\n- `CHEBI:15377` → uses `sqlite:obo:chebi`\n- `UBERON:0000468` → uses `sqlite:obo:uberon`\n\nThis works for any OBO ontology that has been downloaded via OAK.\n\n## Usage\n\n**linkml-term-validator** supports two main validation use cases:\n\n#### 1. Schema Validation\n\nValidates `meaning` fields in enum permissible values.\n\n**CLI:**\n```bash\n# Validate schema permissible values\nlinkml-term-validator validate-schema schema.yaml\n\n# With strict mode (warnings become errors)\nlinkml-term-validator validate-schema --strict schema.yaml\n\n# With custom config\nlinkml-term-validator validate-schema --config oak_config.yaml schema.yaml\n```\n\n**Python API:**\n```python\nfrom linkml.validator import Validator\nfrom linkml_term_validator.plugins import PermissibleValueMeaningPlugin\n\nplugin = PermissibleValueMeaningPlugin(\n    oak_adapter_string=\"sqlite:obo:\",\n    strict_mode=False\n)\n\nvalidator = Validator(schema=\"schema.yaml\", validation_plugins=[plugin])\nreport = validator.validate(\"schema.yaml\")\n\nif len(report.results) == 0:\n    print(\"Valid!\")\nelse:\n    for result in report.results:\n        print(f\"{result.severity}: {result.message}\")\n```\n\n#### 2. Data Validation\n\nValidates data instances against dynamic enums and binding constraints.\n\n**CLI:**\n```bash\n# Validate data (checks both dynamic enums and bindings)\nlinkml-term-validator validate-data data.yaml --schema schema.yaml\n\n# With specific target class\nlinkml-term-validator validate-data data.yaml -s schema.yaml -t Person\n\n# Also validate labels match ontology\nlinkml-term-validator validate-data data.yaml -s schema.yaml --labels\n\n# Only check bindings, skip dynamic enums\nlinkml-term-validator validate-data data.yaml -s schema.yaml --no-dynamic-enums\n\n# Only check dynamic enums, skip bindings\nlinkml-term-validator validate-data data.yaml -s schema.yaml --no-bindings\n```\n\nData validation includes two aspects:\n\n##### Dynamic Enums\n\nValidates against enums defined via `reachable_from`, `matches`, `concepts`.\n\nExample schema:\n```yaml\nenums:\n  NeuronTypeEnum:\n    reachable_from:\n      source_ontology: obo:cl\n      source_nodes: [CL:0000540]  # neuron\n      relationship_types: [rdfs:subClassOf]\n```\n\n**Python API:**\n```python\nfrom linkml.validator import Validator\nfrom linkml_term_validator.plugins import DynamicEnumPlugin\n\nplugin = DynamicEnumPlugin(oak_adapter_string=\"sqlite:obo:\")\nvalidator = Validator(schema=\"schema.yaml\", validation_plugins=[plugin])\nreport = validator.validate(\"data.yaml\")\n```\n\n##### Binding Constraints\n\nValidates nested object fields against binding constraints.\n\nExample schema:\n```yaml\nclasses:\n  Annotation:\n    slots:\n      - term\n    slot_usage:\n      term:\n        range: Term\n        bindings:\n          - binds_value_of: id\n            range: GOTermEnum\n```\n\n**Python API:**\n```python\nfrom linkml.validator import Validator\nfrom linkml_term_validator.plugins import BindingValidationPlugin\n\nplugin = BindingValidationPlugin(\n    validate_labels=True  # Also check labels match ontology\n)\nvalidator = Validator(schema=\"schema.yaml\", validation_plugins=[plugin])\nreport = validator.validate(\"data.yaml\")\n```\n\n### Combining Multiple Validations\n\n**CLI:**\n```bash\n# Validate data with both dynamic enums and bindings (default)\nlinkml-term-validator validate-data data.yaml --schema schema.yaml\n\n# With label validation enabled\nlinkml-term-validator validate-data data.yaml -s schema.yaml --labels\n```\n\n**Python API:**\n```python\nfrom linkml.validator import Validator\nfrom linkml.validator.plugins import JsonschemaValidationPlugin\nfrom linkml_term_validator.plugins import (\n    DynamicEnumPlugin,\n    BindingValidationPlugin,\n)\n\n# Comprehensive validation pipeline\nplugins = [\n    JsonschemaValidationPlugin(closed=True),  # Structural validation\n    DynamicEnumPlugin(),                       # Dynamic enum validation\n    BindingValidationPlugin(validate_labels=True),  # Binding validation\n]\n\nvalidator = Validator(schema=\"schema.yaml\", validation_plugins=plugins)\nreport = validator.validate(\"data.yaml\")\n```\n\n## Integration with linkml-validate\n\nThe **linkml-term-validator** plugins can be used directly with the standard `linkml-validate` command via configuration files.\n\n### Using Config Files\n\nCreate a validation config file (e.g., `validation_config.yaml`):\n\n```yaml\n# Validation configuration for linkml-validate\nschema: schema.yaml\ntarget_class: Person\n\ndata_sources:\n  - data.yaml\n\nplugins:\n  # Standard JSON Schema validation\n  JsonschemaValidationPlugin:\n    closed: true\n\n  # Ontology term validation for dynamic enums\n  \"linkml_term_validator.plugins.DynamicEnumPlugin\":\n    oak_adapter_string: \"sqlite:obo:\"\n    cache_labels: true\n    cache_dir: cache\n\n  # Binding constraint validation\n  \"linkml_term_validator.plugins.BindingValidationPlugin\":\n    oak_adapter_string: \"sqlite:obo:\"\n    validate_labels: true\n    cache_labels: true\n    cache_dir: cache\n```\n\nThen run validation:\n\n```bash\nlinkml-validate --config validation_config.yaml\n```\n\n### Example Files\n\nSee the [examples/](examples/) directory for complete examples:\n- [simple_config.yaml](examples/simple_config.yaml) - Basic validation config\n- [linkml_validate_config.yaml](examples/linkml_validate_config.yaml) - Full config with ontology plugins\n- [simple_schema.yaml](examples/simple_schema.yaml) - Example schema\n- [simple_data.yaml](examples/simple_data.yaml) - Example data\n\n### Plugin Configuration Options\n\n#### DynamicEnumPlugin\n\n```yaml\n\"linkml_term_validator.plugins.DynamicEnumPlugin\":\n  oak_adapter_string: \"sqlite:obo:\"  # OAK adapter (default: sqlite:obo:)\n  cache_labels: true                  # Enable label caching (default: true)\n  cache_dir: cache                    # Cache directory (default: cache)\n  oak_config_path: oak_config.yaml    # Optional: custom OAK config\n```\n\n#### BindingValidationPlugin\n\n```yaml\n\"linkml_term_validator.plugins.BindingValidationPlugin\":\n  oak_adapter_string: \"sqlite:obo:\"  # OAK adapter (default: sqlite:obo:)\n  validate_labels: true               # Check labels match ontology (default: true)\n  cache_labels: true                  # Enable label caching (default: true)\n  cache_dir: cache                    # Cache directory (default: cache)\n  oak_config_path: oak_config.yaml    # Optional: custom OAK config\n```\n\n### Programmatic Usage\n\nYou can also use the plugins programmatically:\n\n```python\nfrom linkml.validator import Validator\nfrom linkml.validator.plugins import JsonschemaValidationPlugin\nfrom linkml_term_validator.plugins import (\n    DynamicEnumPlugin,\n    BindingValidationPlugin,\n)\n\n# Build validation pipeline\nplugins = [\n    JsonschemaValidationPlugin(closed=True),\n    DynamicEnumPlugin(oak_adapter_string=\"sqlite:obo:\"),\n    BindingValidationPlugin(validate_labels=True),\n]\n\n# Create validator\nvalidator = Validator(\n    schema=\"schema.yaml\",\n    validation_plugins=plugins,\n)\n\n# Validate\nreport = validator.validate(\"data.yaml\")\n\n# Check results\nif len(report.results) == 0:\n    print(\"✅ Validation passed\")\nelse:\n    for result in report.results:\n        print(f\"{result.severity.name}: {result.message}\")\n```\n\n## Repository Structure\n\n* [docs/](docs/) - mkdocs-managed documentation\n* [src/](src/) - source files (edit these)\n  * [linkml_term_validator](src/linkml_term_validator)\n* [tests/](tests/) - Python tests\n  * [data/](tests/data) - Example data\n\n## Developer Tools\n\nThere are several pre-defined command-recipes available.\nThey are written for the command runner [just](https://github.com/casey/just/). To list all pre-defined commands, run `just` or `just --list`.\n\n## Anti-Hallucination Guardrails for Agentic AI\n\nWhile **linkml-term-validator** is designed for standard data validation, it serves a crucial role as an **anti-hallucination guardrail** for agentic AI pipelines that generate ontology term references.\n\n### The Problem: LLMs Hallucinate Identifiers\n\nLanguage models frequently hallucinate identifiers like gene IDs, ontology terms, and other structured references. These fake identifiers often appear structurally correct (e.g., `GO:9999999`, `CHEBI:88888`) but don't actually exist in the source ontologies.\n\n### The Solution: Dual Validation Pattern\n\nA robust guardrail requires **dual validation**—forcing the AI to provide both the identifier and its canonical label, then validating that they match:\n\n**Instead of accepting:**\n```yaml\nterm: GO:0005515  # Single piece of information - easy to hallucinate\n```\n\n**Require and validate:**\n```yaml\nterm:\n  id: GO:0005515\n  label: protein binding  # Must match canonical label in ontology\n```\n\nThis dramatically reduces hallucinations because the AI must get **two interdependent facts correct simultaneously**, which is significantly harder to fake convincingly than inventing a single plausible-looking identifier.\n\n### Implementation in AI Pipelines\n\nUse **linkml-term-validator** to embed validation directly into your agentic workflow:\n\n**1. Define schemas with binding constraints:**\n\n```yaml\nclasses:\n  GeneAnnotation:\n    slots:\n      - gene\n      - go_term\n    slot_usage:\n      go_term:\n        range: GOTerm\n        bindings:\n          - binds_value_of: id\n            range: BiologicalProcessEnum\n\n  GOTerm:\n    slots:\n      - id        # AI must provide both\n      - label     # fields correctly\n```\n\n**2. Validate AI-generated outputs before committing:**\n\n```python\nfrom linkml.validator import Validator\nfrom linkml_term_validator.plugins import BindingValidationPlugin\n\n# Create validator with label checking enabled\nplugin = BindingValidationPlugin(validate_labels=True)\nvalidator = Validator(schema=\"schema.yaml\", validation_plugins=[plugin])\n\n# Validate AI-generated data\nreport = validator.validate(ai_generated_data)\n\nif len(report.results) \u003e 0:\n    # Reject hallucinated terms, prompt AI to regenerate\n    raise ValueError(\"Invalid ontology terms detected\")\n```\n\n**3. Use validation during generation (not just post-hoc):**\n\nThe most effective approach embeds validation **during AI generation** rather than treating it as a filtering step afterward. This transforms hallucination resistance from a detection problem into a generation constraint.\n\n### Real-World Benefits\n\n- **Prevents fake identifiers** from entering curated datasets\n- **Catches label mismatches** where AI uses real IDs but wrong labels\n- **Validates dynamic constraints** (e.g., only disease terms, only neuron types)\n- **Enables reliable automation** of curation tasks traditionally requiring human experts\n\n### Learn More\n\nFor detailed patterns and best practices on making ontology IDs hallucination-resistant in AI workflows, see:\n\n- [Make IDs Hallucination Resistant](https://ai4curation.io/aidocs/how-tos/make-ids-hallucination-resistant/) - Comprehensive guide from the AI for Curation project\n- [Jupyter Notebooks](notebooks/) - Interactive tutorials demonstrating validation workflows\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinkml%2Flinkml-term-validator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flinkml%2Flinkml-term-validator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinkml%2Flinkml-term-validator/lists"}