{"id":23933910,"url":"https://github.com/monarch-initiative/curategpt","last_synced_at":"2025-04-05T00:08:05.672Z","repository":{"id":169908133,"uuid":"645996391","full_name":"monarch-initiative/curategpt","owner":"monarch-initiative","description":"LLM-driven curation assist tool","archived":false,"fork":false,"pushed_at":"2025-03-26T20:39:39.000Z","size":12247,"stargazers_count":81,"open_issues_count":36,"forks_count":13,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-28T23:08:42.588Z","etag":null,"topics":["ai","biocuration","curation","gpt","llm","monarchinitiative","obofoundry","ontogpt","ontologies","ontology-tools"],"latest_commit_sha":null,"homepage":"https://monarch-initiative.github.io/curategpt/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/monarch-initiative.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-27T00:56:51.000Z","updated_at":"2025-03-26T20:36:01.000Z","dependencies_parsed_at":null,"dependency_job_id":"b9b15d32-99e5-4131-8d6b-ca101245a098","html_url":"https://github.com/monarch-initiative/curategpt","commit_stats":{"total_commits":44,"total_committers":4,"mean_commits":11.0,"dds":0.2272727272727273,"last_synced_commit":"03066a89dd12d7754082e378176a61c217e434f1"},"previous_names":["cmungall/curate-gpt","monarch-initiative/curate-gpt","monarch-initiative/curategpt"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monarch-initiative%2Fcurategpt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monarch-initiative%2Fcurategpt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monarch-initiative%2Fcurategpt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monarch-initiative%2Fcurategpt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/monarch-initiative","download_url":"https://codeload.github.com/monarch-initiative/curategpt/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247266564,"owners_count":20910836,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","biocuration","curation","gpt","llm","monarchinitiative","obofoundry","ontogpt","ontologies","ontology-tools"],"created_at":"2025-01-06T00:29:58.877Z","updated_at":"2025-04-05T00:08:05.648Z","avatar_url":"https://github.com/monarch-initiative.png","language":"Jupyter Notebook","funding_links":[],"categories":["Building"],"sub_categories":["Tools"],"readme":"# CurateGPT\n\n[![DOI](https://zenodo.org/badge/645996391.svg)](https://zenodo.org/doi/10.5281/zenodo.8293691)\n\n\nCurateGPT is a prototype web application and framework for performing general purpose AI-guided curation\nand curation-related operations over *collections* of objects.\n\n\nSee also the app on [curategpt.io](https://curategpt.io) (note: this is sometimes down, and may only have a\nsubset of the functionality of the local app)\n\n\n## Getting started\n\n### User installation\n\nCurateGPT is available on Pypi and may be installed with `pip`:\n\n`pip install curategpt`\n\n### Developer installation\n\nYou will first need to [install Poetry](https://python-poetry.org/docs/#installation).\n\nThen clone this repo.\n\n```\ngit clone https://github.com/monarch-initiative/curategpt.git\ncd curategpt\n```\n\nand install the dependencies:\n\n\n```\npoetry install\n```\n\n### API keys\n\nIn order to get the best performance from CurateGPT, we recommend getting an OpenAI API key, and setting it:\n\n```\nexport OPENAI_API_KEY=\u003cyour key\u003e\n```\n\n(for members of Monarch: ask on Slack if you would like to use the group key)\n\nCurateGPT will also work with other large language models - see \"Selecting models\" below.\n\n## Loading example data and running the app\n\nYou initially start with an empty database. You can load whatever you like into this\ndatabase! Any JSON, YAML, or CSV is accepted.\nCurateGPT comes with *wrappers* for some existing local and remote sources, including\nontologies. The [Makefile](Makefile) contains some examples of how to load these. You can\nload any ontology using the `ont-\u003cname\u003e` target, e.g.:\n\n```\nmake ont-cl\n```\n\nThis loads CL (via OAK) into a collection called `ont_cl`\n\nNote that by default this loads into a collection set stored at `stagedb`, whereas the app works off\nof `db`. You can copy the collection set to the db with:\n\n```\ncp -r stagedb/* db/\n```\n\n\nYou can then run the streamlit app with:\n\n```\nmake app\n```\n\n## Building Indexes\n\nCurateGPT depends on vector database indexes of the databases/ontologies you want to curate.\n\nThe flagship application is ontology curation, so to build an index for an OBO ontology like CL:\n\n```\nmake ont-cl\n```\n\nThis requires an OpenAI key.\n\n(You can build indexes using an open embedding model, modify the command to leave off\nthe `-m` option, but this is not recommended as currently oai embeddings seem to work best).\n\n\nTo load the default ontologies:\n\n```\nmake all\n```\n\n(this may take some time)\n\nTo load different databases:\n\n```\nmake load-db-hpoa\nmake load-db-reactome\n```\n\n\n\nYou can load an arbitrary json, yaml, or csv file:\n\n```\ncurategpt view index -c my_foo foo.json\n```\n\n(you will need to do this in the poetry shell)\n\nTo load a GitHub repo of issues:\n\n```\ncurategpt -v view index -c gh_uberon -m openai:  --view github --init-with \"{repo: obophenotype/uberon}\"\n```\n\nThe following are also supported:\n\n- Google Drives\n- Google Sheets\n- Markdown files\n- LinkML Schemas\n- HPOA files\n- GOCAMs\n- MAXOA files\n- Many more\n\n## Notebooks\n\n- See [notebooks](notebooks) for examples.\n\n## Selecting models\n\nCurrently this tool works best with the OpenAI gpt-4 model (for instruction tasks) and OpenAI `ada-text-embedding-002` for embedding.\n\nCurateGPT is layered on top of [simonw/llm](https://github.com/simonw/llm) which has a plugin\narchitecture for using alternative models. In theory you can use any of these plugins.\n\nAdditionally, you can set up an openai-emulating proxy using [litellm](https://github.com/BerriAI/litellm/).\n\nThe `litellm` proxy may be installed with `pip` as `pip install litellm[proxy]`.\n\nLet's say you want to run mixtral locally using ollama. You start up ollama (you may have to run `ollama serve` first):\n\n```\nollama run mixtral\n```\n\nThen start up litellm:\n\n```\nlitellm -m ollama/mixtral\n```\n\nNext edit your `extra-openai-models.yaml` as detailed in [the llm docs](https://llm.datasette.io/en/stable/other-models.html):\n\n```\n- model_name: ollama/mixtral\n  model_id: litellm-mixtral\n  api_base: \"http://0.0.0.0:8000\"\n```\n\nYou can now use this:\n\n```yaml\ncurategpt ask -m litellm-mixtral -c ont_cl \"What neurotransmitter is released by the hippocampus?\"\n```\n\nBut be warned that many of the prompts in curategpt were engineered\nagainst openai models, and they may give suboptimal results or fail\nentirely on other models. As an example, `ask` seems to work quite\nwell with mixtral, but `complete` works horribly. We haven't yet\ninvestigated if the issue is the model or our prompts or the overall\napproach.\n\nWelcome to the world of AI engineering!\n\n## Using the command line\n\n```bash\ncurategpt --help\n```\n\nYou will see various commands for working with indexes, searching, extracting, generating, etc.\n\nThese functions are generally available through the UI, and the current priority is documenting these.\n\n### Chatting with a knowledge base\n\n```\ncurategpt ask -c ont_cl \"What neurotransmitter is released by the hippocampus?\"\n```\n\nmay yield something like:\n\n```\nThe hippocampus releases gamma-aminobutyric acid (GABA) as a neurotransmitter [1](#ref-1).\n\n...\n\n## 1\n\nid: GammaAminobutyricAcidSecretion_neurotransmission\nlabel: gamma-aminobutyric acid secretion, neurotransmission\ndefinition: The regulated release of gamma-aminobutyric acid by a cell, in which the\n  gamma-aminobutyric acid acts as a neurotransmitter.\n...\n```\n\n### Chatting with pubmed\n\n```\ncurategpt view ask -V pubmed \"what neurons express VIP?\"\n```\n\n### Chatting with a GitHub issue tracker\n\n```\ncurategpt ask -c gh_obi \"what are some new term requests for electrophysiology terms?\"\n```\n\n### Term Autocompletion (DRAGON-AI)\n\n```\ncurategpt complete -c ont_cl  \"mesenchymal stem cell of the apical papilla\"\n```\n\nyields\n\n```yaml\nid: MesenchymalStemCellOfTheApicalPapilla\ndefinition: A mesenchymal cell that is part of the apical papilla of a tooth and has\n  the ability to self-renew and differentiate into various cell types such as odontoblasts,\n  fibroblasts, and osteoblasts.\nrelationships:\n- predicate: PartOf\n  target: ApicalPapilla\n- predicate: subClassOf\n  target: MesenchymalCell\n- predicate: subClassOf\n  target: StemCell\noriginal_id: CL:0007045\nlabel: mesenchymal stem cell of the apical papilla\n```\n\n### All-by-all comparisons\n\nYou can compare all objects in one collection \n\n`curategpt all-by-all --threshold 0.80 -c ont_hp -X ont_mp --ids-only -t csv \u003e ~/tmp/allxall.mp.hp.csv`\n\nThis takes 1-2s, as it involves comparison over pre-computed vectors. It reports top hits above a threshold.\n\nResults may vary. You may want to try different texts for embeddings\n(the default is the entire json object; for ontologies it is\nconcatenation of labels, definition, aliases).\n\nsample:\n\n```\nHP:5200068,Socially innappropriate questioning,MP:0001361,social withdrawal,0.844015132437909\nHP:5200069,Spinning,MP:0001411,spinning,0.9077306606290237\nHP:5200071,Delayed Echolalia,MP:0013140,excessive vocalization,0.8153252835818089\nHP:5200072,Immediate Echolalia,MP:0001410,head bobbing,0.8348177036912526\nHP:5200073,Excessive cleaning,MP:0001412,excessive scratching,0.8699103725005582\nHP:5200104,Abnormal play,MP:0020437,abnormal social play behavior,0.8984862078522344\nHP:5200105,Reduced imaginative play skills,MP:0001402,decreased locomotor activity,0.85571629684631\nHP:5200108,Nonfunctional or atypical use of objects in play,MP:0003908,decreased stereotypic behavior,0.8586700411012859\nHP:5200129,Abnormal rituals,MP:0010698,abnormal impulsive behavior control,0.8727804272023427\nHP:5200134,Jumping,MP:0001401,jumpy,0.9011393233129765\n```\n\nNote that CurateGPT has a separate component for using an LLM to evaluate candidate matches (see also https://arxiv.org/abs/2310.03666); this is\nnot enabled by default, this would be expensive to run for a whole ontology.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmonarch-initiative%2Fcurategpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmonarch-initiative%2Fcurategpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmonarch-initiative%2Fcurategpt/lists"}