https://github.com/sr5434/CodegebraGPT

Finetuning multimodal LLMs on STEM datasets
https://github.com/sr5434/CodegebraGPT

Last synced: 4 months ago
JSON representation

Finetuning multimodal LLMs on STEM datasets

Host: GitHub
URL: https://github.com/sr5434/CodegebraGPT
Owner: sr5434
License: mit
Created: 2023-12-17T19:50:58.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-12-29T18:41:08.000Z (over 1 year ago)
Last Synced: 2023-12-30T15:33:43.560Z (over 1 year ago)
Language: Jupyter Notebook
Size: 189 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

awesome_ai_agents - Codegebragpt - Finetuning multimodal LLMs on STEM datasets (Building / Datasets)
awesome_ai_agents - Codegebragpt - Finetuning multimodal LLMs on STEM datasets (Building / Datasets)

README

        # CodegebraGPT

Finetuning multimodal LLMs on STEM datasets.

## Planned Procedure

 - [x] Compile and preprocess multiple datasets

 - [ ] Finetune [SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) on this data using [QLoRA](https://arxiv.org/abs/2310.03744)

 - [ ] Release on Huggingface so that anybody can use it!

## Datasets

There are 100k samples which I will be using to train this model. The total combination of all these datasets is about 1 million samples, but I only use about 100k samples to save costs. Those samples all come from the datasets listed below:

 - [MetaMath](https://huggingface.co/datasets/meta-math/MetaMathQA)

 - [Camel AI Math](https://huggingface.co/datasets/camel-ai/math)

 - [ArXiv Math](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k)

 - [Camel AI Chemistry](https://huggingface.co/datasets/camel-ai/chemistry)

 - [Camel AI Physics](https://huggingface.co/datasets/camel-ai/physics)

 - [Camel AI Biology](https://huggingface.co/datasets/camel-ai/biology)

 - [ArXiv Physics](https://huggingface.co/datasets/ArtifactAI/arxiv-physics-instruct-tune-30k)

 - [GSM8K](https://huggingface.co/datasets/gsm8k)

 - [MMLU](https://huggingface.co/datasets/cais/mmlu)

 - [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1)

 - [GlaiveAI Code Assistant v2](https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v2)

 - [ArXiv Computer Science and ML](https://huggingface.co/datasets/ArtifactAI/arxiv-cs-ml-instruct-tune-50k)

 - [ScienceQA](https://huggingface.co/datasets/cnut1648/ScienceQA-LLAVA)

The dataset can be found [here](https://huggingface.co/datasets/sr5434/CodegebraGPT_data). During training, I used the ```100k-text``` subset.

## Name

This LLM is named after [Codegebra](https://github.com/sr5434/codegebra), which is a program I made to solve equations, perform Fourier transforms, etc. It is intended to be Codegebra's successor, with a more natural interface and expanded abilities.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sr5434/CodegebraGPT

Awesome Lists containing this project

README