https://github.com/dnlbauer/bh24de_ontogpt
https://github.com/dnlbauer/bh24de_ontogpt
Last synced: 5 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/dnlbauer/bh24de_ontogpt
- Owner: dnlbauer
- License: cc0-1.0
- Created: 2024-12-09T16:35:47.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-12T13:51:08.000Z (over 1 year ago)
- Last Synced: 2025-12-07T23:59:28.338Z (6 months ago)
- Size: 20.5 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Biohackathon Germany 2024
Investigated the application of [OntoGPT](https://github.com/monarch-initiative/ontogpt) for extracting terms and information from description of [Senckenbergs collection](https://search.senckenberg.de/).
## Templates
### Simple Schema: example to extract genus and species from text
```bash
ontogpt extract -i examples/lathyrus.txt -t templates/simple_schema.yaml -m ollama/llama3
input_text: |
Lathyrus is a pea and a pea is a plant. Lathyrus vestitus is a Lathyrus.
raw_completion_output: |-
Here are the extracted entities in the desired format:
genus: Lathyrus
species: vestitus
prompt: |+
From the text below, extract the following entities in the following format:
Text:
Lathyrus is a pea and a pea is a plant. Lathyrus vestitus is a Lathyrus.
===
extracted_object:
genus: NCBITaxon:3853
species: AUTO:vestitus
named_entities:
- id: NCBITaxon:3853
label: Lathyrus
original_spans:
- 0:7
- 40:47
- 63:70
- id: AUTO:vestitus
label: vestitus
original_spans:
- 49:56
```
### Habitat schema
Tries to extract genus, species and environmental terms independent of the input language.
There are 3 "versions":
- [templates/habitat.yaml] was an initial attempt
- [templates/habitat_v2.yaml] aimes to improve the extraction accuracy by giving more examples
and more details about how to extract ENVO terms.
- [templates/habitat_noprompt.yaml] has no custom prompts as a base line.
## results
All templates were applied to 3 examples "habitat", "frischwiese" and "pine forest". See folder [results](results).
All results were calculated using ollama/mistral:latest (f974a74358d6):
```bash
ontogpt extract -i -t --show-prompt -m ollama/mistral -o
```
### Lessons learned
- Prompt engineering is hard: tiny changes to the prompt or even the order of extracted attributes can change the output.
- The model choice is crucial and the template needs to be tailored to the model. Generally mistral outperformed llama **for me**.
- The output is language specific. It works better when input is translated to english first.
- Most LLMs are trained on common terms. It would be interesting to see if a model trained with more scientific datasets of a related topic would perform better.