{"id":15657411,"url":"https://github.com/cmungall/semantic-llama","last_synced_at":"2025-05-05T15:45:59.789Z","repository":{"id":66914524,"uuid":"577992290","full_name":"cmungall/semantic-llama","owner":"cmungall","description":"A knowledge extraction tool that uses a large language model to extract semantic information from text","archived":false,"fork":false,"pushed_at":"2023-01-03T16:56:13.000Z","size":1593,"stargazers_count":28,"open_issues_count":2,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-30T22:06:12.653Z","etag":null,"topics":["ai","knowledge-extraction","language-models","linkml","oaklib","obofoundry"],"latest_commit_sha":null,"homepage":"https://cmungall.github.io/semantic-llama/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cmungall.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-12-14T02:00:41.000Z","updated_at":"2024-12-02T12:35:11.000Z","dependencies_parsed_at":"2023-05-20T11:00:22.740Z","dependency_job_id":null,"html_url":"https://github.com/cmungall/semantic-llama","commit_stats":{"total_commits":16,"total_committers":1,"mean_commits":16.0,"dds":0.0,"last_synced_commit":"783084253bb81e9d1ef09a6913336b6b40ea8d1a"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmungall%2Fsemantic-llama","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmungall%2Fsemantic-llama/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmungall%2Fsemantic-llama/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmungall%2Fsemantic-llama/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cmungall","download_url":"https://codeload.github.com/cmungall/semantic-llama/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252525604,"owners_count":21762331,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","knowledge-extraction","language-models","linkml","oaklib","obofoundry"],"created_at":"2024-10-03T13:06:44.388Z","updated_at":"2025-05-05T15:45:59.766Z","avatar_url":"https://github.com/cmungall.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NEW NAME/REPO: https://github.com/monarch-initiative/ontogpt\n\nsemantic-llama is now ontogpt\n\nThis repo is ARCHIVED, use the repo above!!!\n\n# Semantic llama\n\nSemantic Large LAnguage Model Annotation\n\nA knowledge extraction tool that uses a large language model to extract semantic information from text.\n\nThis exploits the ability of ultra-LLMs such as GPT-3 to return user-defined data structures\nas a response.\n\n## Usage\n\nGiven a short text `abstract.txt` with content such as:\n\n   \u003e The cGAS/STING-mediated DNA-sensing signaling pathway is crucial\n   for interferon (IFN) production and host antiviral\n   responses\n   \u003e \n   \u003e ...\n   \u003e ...\n   \u003e \n   \u003e The underlying mechanism was the\n   interaction of US3 with β-catenin and its hyperphosphorylation of\n   β-catenin at Thr556 to block its nuclear translocation\n   \u003e ...\n   \u003e ...\n\n(see [full input](tests/input/cases/gocam-betacat.txt))\n\nWe can extract this into the [GO pathway datamodel](src/semantic_llama/templates/gocam.yaml):\n\n```bash\nsemllama extract -t gocam.GoCamAnnotations abstract.txt\n```\n\nGiving schema-compliant yaml such as:\n\n```yaml\ngenes:\n- HGNC:2514\n- HGNC:21367\n- HGNC:27962\n- US3\n- FPLX:Interferon\n- ISG\ngene_gene_interactions:\n- gene1: US3\n  gene2: HGNC:2514\ngene_localizations:\n- gene: HGNC:2514\n  location: Nuclear\ngene_functions:\n- gene: HGNC:2514\n  molecular_activity: Transcription\n- gene: HGNC:21367\n  molecular_activity: Production\n...\n```\n\nSee [full output](tests/output/gocam-betacat.yaml)\n\nnote in the above the grounding is very preliminary and can be improved. Ungrounded NamedEntities appear as test.\n\n## How it works\n\n1. You provide an arbitrary data model, describing the structure you want to extract text into\n    - this can be nested (but see limitations below)\n2. provide your preferred annotations for grounding NamedEntity fields\n3. semantic-llama will:\n    - generate a prompt\n    - feed the prompt to a language model (currently OpenAI)\n    - parse the results into a dictionary structure\n    - ground the results using a preferred annotator\n\n## Pre-requisites\n\n- python 3.9+\n- an OpenAI account\n- a BioPortal account (optional)\n\nYou will need to set both API keys using OAK\n\n```\npoetry run runoak set-apikey openai \u003cyour openai api key\u003e\npoetry run runoak set-apikey bioportal \u003cyour bioportal api key\u003e\n```\n\n## How to define your own extraction data model\n\n### Step 1: Define a schema\n\nSee [src/semantic_llama/templates/](src/semantic_llama/templates/) for examples.\n\nDefine a schema (using a subset of LinkML) that describes the structure you want to extract from your text.\n\n```yaml\nclasses:\n  MendelianDisease:\n    attributes:\n      name:\n        description: the name of the disease\n        examples:\n          - value: peroxisome biogenesis disorder\n        identifier: true  ## needed for inlining\n      description:\n        description: a description of the disease\n        examples:\n          - value: \u003e-\n             Peroxisome biogenesis disorders, Zellweger syndrome spectrum (PBD-ZSS) is a group of autosomal recessive disorders affecting the formation of functional peroxisomes, characterized by sensorineural hearing loss, pigmentary retinal degeneration, multiple organ dysfunction and psychomotor impairment\n      synonyms:\n        multivalued: true\n        examples:\n          - value: Zellweger syndrome spectrum\n          - value: PBD-ZSS\n      subclass_of:\n        multivalued: true\n        range: MendelianDisease\n        examples:\n          - value: lysosomal disease\n          - value: autosomal recessive disorder\n      symptoms:\n        range: Symptom\n        multivalued: true\n        examples:\n          - value: sensorineural hearing loss\n          - value: pigmentary retinal degeneration\n      inheritance:\n        range: Inheritance\n        examples:\n          - value: autosomal recessive\n      genes:\n        range: Gene\n        multivalued: true\n        examples:\n          - value: PEX1\n          - value: PEX2\n          - value: PEX3\n\n  Gene:\n    is_a: NamedThing\n    id_prefixes:\n      - HGNC\n    annotations:\n      annotators: gilda:, bioportal:hgnc-nr\n\n  Symptom:\n    is_a: NamedThing\n    id_prefixes:\n      - HP\n    annotations:\n      annotators: sqlite:obo:hp\n\n  Inheritance:\n    is_a: NamedThing\n    annotations:\n      annotators: sqlite:obo:hp\n```\n\n- the schema is defined in LinkML\n- prompt hints can be specified using the `prompt` annotation (otherwise description is used)\n- multivalued fields are supported\n- the default range is string - these are not grounded. E.g. disease name, synonyms\n- define a class for each NamedEntity\n- for any NamedEntity, you can specify a preferred annotator using the `annotators` annotation\n\nWe recommend following an established schema like biolink, but you can define your own.\n\n### Step 2: Compile the schema\n\nRun the `make` command at the top level. This will compile the schema to pedantic\n\n### Step 3: Run the command line\n\ne.g.\n\n```\nemllama extract -t  mendelian_disease.MendelianDisease marfan-wikipedia.txt\n```\n\n## Web Application\n\nThere is a bare bones web application\n\n```\npoetry run webllama\n```\n\nNote that the agent running uvicorn must have the API key set, so for obvious reasons\ndon't host this publicaly without authentication unless you want your credits drained. \n\n## Features\n\n### Multiple Levels of nesting\n\nCurrently only two levels of nesting are supported\n\nIf a field has a range which is itself a class and not a primitive, it will attempt to nest\n\nE.g. the gocam schema has an attribute:\n\n```yaml\n  attributes:\n      ...\n      gene_functions:\n        description: semicolon-separated list of gene to molecular activity relationships\n        multivalued: true\n        range: GeneMolecularActivityRelationship\n```\n\nBecause GeneMolecularActivityRelationship is *inlined* it will nest\n\nThe generated prompt is:\n\n`gene_functions : \u003csemicolon-separated list of gene to molecular activities relationships\u003e`\n\nThe output of this is then passed through further llama iterations.\n\n## Limitations\n\n### Non-deterministic\n\nThis relies on an existing LLM, and LLMs can be fickle in their responses.\n\n### Coupled to OpenAI\n\nYou will need an openai account. In theory any LLM can be used but in practice the parser is tuned for OpenAI\n\n\n\n# Acknowledgements\n\nThis [cookiecutter](https://cookiecutter.readthedocs.io/en/stable/README.html) project was developed from the [sphintoxetry-cookiecutter](https://github.com/hrshdhgd/sphintoxetry-cookiecutter) template and will be kept up-to-date using [cruft](https://cruft.github.io/cruft/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcmungall%2Fsemantic-llama","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcmungall%2Fsemantic-llama","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcmungall%2Fsemantic-llama/lists"}