Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cmungall/semantic-llama
A knowledge extraction tool that uses a large language model to extract semantic information from text
https://github.com/cmungall/semantic-llama
ai knowledge-extraction language-models linkml oaklib obofoundry
Last synced: 3 months ago
JSON representation
A knowledge extraction tool that uses a large language model to extract semantic information from text
- Host: GitHub
- URL: https://github.com/cmungall/semantic-llama
- Owner: cmungall
- License: bsd-3-clause
- Created: 2022-12-14T02:00:41.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-01-03T16:56:13.000Z (about 2 years ago)
- Last Synced: 2024-10-04T13:06:36.576Z (4 months ago)
- Topics: ai, knowledge-extraction, language-models, linkml, oaklib, obofoundry
- Language: Python
- Homepage: https://cmungall.github.io/semantic-llama/
- Size: 1.52 MB
- Stars: 26
- Watchers: 2
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# NEW NAME/REPO: https://github.com/monarch-initiative/ontogpt
semantic-llama is now ontogpt
This repo is ARCHIVED, use the repo above!!!
# Semantic llama
Semantic Large LAnguage Model Annotation
A knowledge extraction tool that uses a large language model to extract semantic information from text.
This exploits the ability of ultra-LLMs such as GPT-3 to return user-defined data structures
as a response.## Usage
Given a short text `abstract.txt` with content such as:
> The cGAS/STING-mediated DNA-sensing signaling pathway is crucial
for interferon (IFN) production and host antiviral
responses
>
> ...
> ...
>
> The underlying mechanism was the
interaction of US3 with β-catenin and its hyperphosphorylation of
β-catenin at Thr556 to block its nuclear translocation
> ...
> ...(see [full input](tests/input/cases/gocam-betacat.txt))
We can extract this into the [GO pathway datamodel](src/semantic_llama/templates/gocam.yaml):
```bash
semllama extract -t gocam.GoCamAnnotations abstract.txt
```Giving schema-compliant yaml such as:
```yaml
genes:
- HGNC:2514
- HGNC:21367
- HGNC:27962
- US3
- FPLX:Interferon
- ISG
gene_gene_interactions:
- gene1: US3
gene2: HGNC:2514
gene_localizations:
- gene: HGNC:2514
location: Nuclear
gene_functions:
- gene: HGNC:2514
molecular_activity: Transcription
- gene: HGNC:21367
molecular_activity: Production
...
```See [full output](tests/output/gocam-betacat.yaml)
note in the above the grounding is very preliminary and can be improved. Ungrounded NamedEntities appear as test.
## How it works
1. You provide an arbitrary data model, describing the structure you want to extract text into
- this can be nested (but see limitations below)
2. provide your preferred annotations for grounding NamedEntity fields
3. semantic-llama will:
- generate a prompt
- feed the prompt to a language model (currently OpenAI)
- parse the results into a dictionary structure
- ground the results using a preferred annotator## Pre-requisites
- python 3.9+
- an OpenAI account
- a BioPortal account (optional)You will need to set both API keys using OAK
```
poetry run runoak set-apikey openai
poetry run runoak set-apikey bioportal
```## How to define your own extraction data model
### Step 1: Define a schema
See [src/semantic_llama/templates/](src/semantic_llama/templates/) for examples.
Define a schema (using a subset of LinkML) that describes the structure you want to extract from your text.
```yaml
classes:
MendelianDisease:
attributes:
name:
description: the name of the disease
examples:
- value: peroxisome biogenesis disorder
identifier: true ## needed for inlining
description:
description: a description of the disease
examples:
- value: >-
Peroxisome biogenesis disorders, Zellweger syndrome spectrum (PBD-ZSS) is a group of autosomal recessive disorders affecting the formation of functional peroxisomes, characterized by sensorineural hearing loss, pigmentary retinal degeneration, multiple organ dysfunction and psychomotor impairment
synonyms:
multivalued: true
examples:
- value: Zellweger syndrome spectrum
- value: PBD-ZSS
subclass_of:
multivalued: true
range: MendelianDisease
examples:
- value: lysosomal disease
- value: autosomal recessive disorder
symptoms:
range: Symptom
multivalued: true
examples:
- value: sensorineural hearing loss
- value: pigmentary retinal degeneration
inheritance:
range: Inheritance
examples:
- value: autosomal recessive
genes:
range: Gene
multivalued: true
examples:
- value: PEX1
- value: PEX2
- value: PEX3Gene:
is_a: NamedThing
id_prefixes:
- HGNC
annotations:
annotators: gilda:, bioportal:hgnc-nrSymptom:
is_a: NamedThing
id_prefixes:
- HP
annotations:
annotators: sqlite:obo:hpInheritance:
is_a: NamedThing
annotations:
annotators: sqlite:obo:hp
```- the schema is defined in LinkML
- prompt hints can be specified using the `prompt` annotation (otherwise description is used)
- multivalued fields are supported
- the default range is string - these are not grounded. E.g. disease name, synonyms
- define a class for each NamedEntity
- for any NamedEntity, you can specify a preferred annotator using the `annotators` annotationWe recommend following an established schema like biolink, but you can define your own.
### Step 2: Compile the schema
Run the `make` command at the top level. This will compile the schema to pedantic
### Step 3: Run the command line
e.g.
```
emllama extract -t mendelian_disease.MendelianDisease marfan-wikipedia.txt
```## Web Application
There is a bare bones web application
```
poetry run webllama
```Note that the agent running uvicorn must have the API key set, so for obvious reasons
don't host this publicaly without authentication unless you want your credits drained.## Features
### Multiple Levels of nesting
Currently only two levels of nesting are supported
If a field has a range which is itself a class and not a primitive, it will attempt to nest
E.g. the gocam schema has an attribute:
```yaml
attributes:
...
gene_functions:
description: semicolon-separated list of gene to molecular activity relationships
multivalued: true
range: GeneMolecularActivityRelationship
```Because GeneMolecularActivityRelationship is *inlined* it will nest
The generated prompt is:
`gene_functions : `
The output of this is then passed through further llama iterations.
## Limitations
### Non-deterministic
This relies on an existing LLM, and LLMs can be fickle in their responses.
### Coupled to OpenAI
You will need an openai account. In theory any LLM can be used but in practice the parser is tuned for OpenAI
# Acknowledgements
This [cookiecutter](https://cookiecutter.readthedocs.io/en/stable/README.html) project was developed from the [sphintoxetry-cookiecutter](https://github.com/hrshdhgd/sphintoxetry-cookiecutter) template and will be kept up-to-date using [cruft](https://cruft.github.io/cruft/).