https://github.com/jiasenlu/ll3m

LL3M: Large Language and Multi-Modal Model in Jax
https://github.com/jiasenlu/ll3m

Last synced: 6 months ago
JSON representation

LL3M: Large Language and Multi-Modal Model in Jax

Host: GitHub
URL: https://github.com/jiasenlu/ll3m
Owner: jiasenlu
Created: 2024-04-02T07:50:18.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-04-23T05:54:42.000Z (over 1 year ago)
Last Synced: 2025-03-26T23:04:42.865Z (7 months ago)
Language: Python
Size: 2.53 MB
Stars: 71
Watchers: 1
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # LL3M: Large Language and Multi-Modal Model in Jax / Flax

The goal of this repo is to build a Large Language / Multi-Modal Model and MoE Model that easily trains and finetunes in Jax / Flax.

### Installing on GPU Host

The GPU environment can be installed via [Anaconda](https://www.anaconda.com/products/distribution).

``` shell

conda env create -f scripts/gpu_environment.yml

conda activate LL3M

```

### Installing on Cloud TPU Host

The TPU host VM comes with Python and PIP pre-installed. Run the following

script to set up the TPU host.

``` shell

bash ./tpu_startup_script_local.sh

```

Activate the environment

```

. $HOME/.LL3M/bin/activate

```

## Model

### Large Language Model (LLM)

Currently, the codebase supports LLaMA, Mistral, Phi, OpenLLaMA, and TinyLLaMA models for training and inference. 

## Dataset

### LLM Dataset

The Dolma dataset contains high-quality data from different sources. The OLMo model just concatenated all the tokens without any sampling. 

Here, we use seqio to sample different data based on heuristic factors as below

| Source            | Doc Type      | Bytes     | Percentage    | factor | byte | sample ratio | 

| ------------------| -------       | -------   | --------      | -------| -------- | ------    |

| Common Crawl      | web pages     | 9,022     | 78.46%        | 0.5x | 4,511 | 46.23% | 

| The Stack         | code          | 1,043     | 9.07%         | 2x| 2,086 | 21.37% |

| C4                | web pages     | 790       | 6.87%         | 2x | 1580 | 16.19% |

| Reddit            | social media  | 339       | 2.94%         | 2x | 678 | 6.94% |

| peS2o             | STEM papers   | 268       | 2.33%         | 2x | 536 | 5.49% |

|Project Gutenberg  | books         | 20.4      | 0.17%         | 5x | 204 | 2.10% | 

|Wikipedia, Wikibooks|encyclopedic  | 16.2      | 0.14%         | 5x | 162 | 1.66% |

For more information, please refer to the doc

## Release Plan

- [x] Language Model and Seqio Dataloader for Dolma dataset.

- [x] Multimodal Model that supports LLava, caption, and others. 

- [x] The shaped model combines different variances that can serve as an initial MoE model. 

- [ ] A mixtral type of MoE model can be trained from scratch or existing dense models.

- [ ] DPO and RLHF on LLM, LMM and MoE. 

## Credits

A large portion of the code is borrowed from [EazyLM](https://github.com/young-geng/EasyLM)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jiasenlu/ll3m

Awesome Lists containing this project

README