Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sr5434/CodegebraGPT
Finetuning multimodal LLMs on STEM datasets
https://github.com/sr5434/CodegebraGPT
Last synced: 3 days ago
JSON representation
Finetuning multimodal LLMs on STEM datasets
- Host: GitHub
- URL: https://github.com/sr5434/CodegebraGPT
- Owner: sr5434
- License: mit
- Created: 2023-12-17T19:50:58.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-29T18:41:08.000Z (about 1 year ago)
- Last Synced: 2023-12-30T15:33:43.560Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 189 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome_ai_agents - Codegebragpt - Finetuning multimodal LLMs on STEM datasets (Building / Datasets)
- awesome_ai_agents - Codegebragpt - Finetuning multimodal LLMs on STEM datasets (Building / Datasets)
README
# CodegebraGPT
Finetuning multimodal LLMs on STEM datasets.
## Planned Procedure
- [x] Compile and preprocess multiple datasets
- [ ] Finetune [SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) on this data using [QLoRA](https://arxiv.org/abs/2310.03744)
- [ ] Release on Huggingface so that anybody can use it!
## Datasets
There are 100k samples which I will be using to train this model. The total combination of all these datasets is about 1 million samples, but I only use about 100k samples to save costs. Those samples all come from the datasets listed below:
- [MetaMath](https://huggingface.co/datasets/meta-math/MetaMathQA)
- [Camel AI Math](https://huggingface.co/datasets/camel-ai/math)
- [ArXiv Math](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k)
- [Camel AI Chemistry](https://huggingface.co/datasets/camel-ai/chemistry)
- [Camel AI Physics](https://huggingface.co/datasets/camel-ai/physics)
- [Camel AI Biology](https://huggingface.co/datasets/camel-ai/biology)
- [ArXiv Physics](https://huggingface.co/datasets/ArtifactAI/arxiv-physics-instruct-tune-30k)
- [GSM8K](https://huggingface.co/datasets/gsm8k)
- [MMLU](https://huggingface.co/datasets/cais/mmlu)
- [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1)
- [GlaiveAI Code Assistant v2](https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v2)
- [ArXiv Computer Science and ML](https://huggingface.co/datasets/ArtifactAI/arxiv-cs-ml-instruct-tune-50k)
- [ScienceQA](https://huggingface.co/datasets/cnut1648/ScienceQA-LLAVA)
The dataset can be found [here](https://huggingface.co/datasets/sr5434/CodegebraGPT_data). During training, I used the ```100k-text``` subset.
## Name
This LLM is named after [Codegebra](https://github.com/sr5434/codegebra), which is a program I made to solve equations, perform Fourier transforms, etc. It is intended to be Codegebra's successor, with a more natural interface and expanded abilities.