Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/esmail-ibraheem/tinyllamas

TinyllamasπŸ¦™ is an Extensible advanced language model framework, inspired by the original Llama model.
https://github.com/esmail-ibraheem/tinyllamas

attention-mechanism llama llama2 llms paper-implementations pytorch

Last synced: 2 days ago
JSON representation

TinyllamasπŸ¦™ is an Extensible advanced language model framework, inspired by the original Llama model.

Awesome Lists containing this project

README

        

# TinyllamasπŸ¦™: Extensible Language Model inspired by the original Llama model.

https://github.com/user-attachments/assets/d998b84d-9835-4555-93f0-af7460297e9f

Tinyllamas is an advanced language model framework, inspired by the original Llama model but enhanced with additional features such as Grouped Query Attention (GQA), Multi-Head Attention (MHA), and more. This project aims to provide a flexible and extensible platform for experimenting with various attention mechanisms and building state-of-the-art natural language processing models.

**_project structure:_**
The [model](https://github.com/Esmail-ibraheem/Tinyllamas/blob/main/model.py) was constructed in approximately ~500 lines of code, and you have the model's [configuration](https://github.com/Esmail-ibraheem/Tinyllamas/blob/main/config.py).
```
Tinyllamas/
β”‚
β”œβ”€β”€ images/
β”‚
β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ attentions/
β”‚ β”œβ”€β”€ rotary_embeddings/
β”‚ └── transformer/
β”‚
β”œβ”€β”€ model
β”‚
└── config
β”‚
└── inference

```

---

## Features:
- **`Rotary Embeddings`:**
- Rotary Embeddings.
- Linear Scaling Rotary Embeddings.
- Dynamic NTK Scaling Rotary Embeddings.


Your Image Description

```python
LLAMA_ROTARY_EMBEDDINGS_CLASSES = {
"rotary": LlamaRotaryEmbeddings,
"linear": LlamaLinearScalingRotaryEmbeddings,
"dynamic": LlamaDynamicNTKScalingRotaryEmbeddings,
}
```

- **`LlamaChat: interfaces`.**
- **_using streamlit_: running adult llama on streamlit interface**
```
streamlit run app.py
```
![__-Llama-2-Chatbot-by-Esmail-Gumaan-and-2-more-pages-Personal-Microsoft_-Edge-2024-05-15-17-19-03](https://github.com/Esmail-ibraheem/Tinyllamas-Pytorch/assets/113830751/b52b5b68-3f5e-4cfb-9719-b0fae5fa4678)


- **_using gradio_: running baby llama on gradio interface**

```
python llama_interface.py
```
![image](https://github.com/user-attachments/assets/a84ef9da-bed4-4a28-bb85-baefac593034)

- **_using fastAPI_:running baby llama on the browser (this feature for javascript devs)**


Image 1
Image 2

- **`Attentions:`**
The standard practice for autoregressive decoding is to cache the keys and values of the previous tokens in the sequence to speed up attention computation. However, as the context window or batch size increases, the memory cost associated with the size of the key-value cache(kv cache) in the multi-head attention(MHA) model significantly increases.
- **`Multi-Head Attention(MHA)`:**\
[Self-attention](https://github.com/Esmail-ibraheem/Tinyllamas-Pytorch/blob/main/models/transformer.py) is calculated by taking the dot product of the query and key, scaled by a factor, and applying a softmax function to obtain attention weights. These [attention](https://github.com/Esmail-ibraheem/Tinyllamas-Pytorch/blob/main/models/attentions.py) weights determine the importance of each word's value for the current word.
$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$


Your Image Description

---

- **`Grouped Query Attention(GQA), and Multi-Query Attention(MQA)`:**\
Grouped-query attention divides query heads into G-groups, each of which shares a single key head and value head. GQA-G refers to grouped-query with G groups. GQA-1, with a single group and therefore single key and value head, is equivalent to MQA, while GQA-H, with groups equal to number of heads, is equivalent to MHA. following Figure shows a comparison of grouped-query attention and multi head/multi-query attention. When converting a multi-head checkpoint to a GQA checkpoint, we construct each group key and value head by mean pooling all the original heads within that group. An intermediate number of groups leads to an interpolated model that is higher quality than MQA but faster than MHA, and, as we will show, rep resents a favorable trade-off. Going from MHA to MQA reduces H key and value heads to a single key and value head, reducing the size of the key-value cache and therefore amount of data that needs to be loaded by a factor of H. However, larger models generally scale the number of heads, such that multi-query attention represents a more aggressive cut in both memory bandwidth and capacity. GQA lets us keep the same proportional decrease in bandwidth and capacity as model size increases.
- MQA: Multi-Query attention(MQA) is a mechanism that uses only a single key-value head for multiple queries, which can save memory and greatly speed up decoder inference.
- Fixed GQA: However, MQA may lead to a decrease in quality. In fact, we not only want fast inference but also want the quality to be on par with MHA, so Grouped-query attention(GQA) comes into play. Grouped-query attention(GQA) is an interpolation of multi-query and multi-head attention. It achieves a quality similar to multi-head attention while maintaining a comparable speed to multi-query attention.
- `Scalable GQA:` the same as the fixed GQA but with multiple rotary embeddings.


Your Image Description

**_MHA vs GQA vs MQA:_**

| MHA | GQA | MQA |
|:--------------------:|:-------------------------------------:|:--------------------:|
| High quality | A good compromise between quality and | Loss in quality |
| Computationally slow | speed | Computationally fast |


Your Image Description


Time per sample for GQA-XXL as a function of the number of GQA groups with input length 2048 and output length 512. Going from 1 (MQA) to 8 groups adds modest inference overhead, with increasing cost to adding more groups.
demonstrates the effect of the number of GQA groups on inference speed. For larger models the memory band width overhead from the KV cache is less con straining (Shazeer, 2019), while the reduction in key-value size is sharper due to the increased number of heads. As a result, increasing the number of groups from MQA only results in modest slow downs initially, with increasing cost as we move closer to MHA. We selected 8 groups as a favor able
middle ground.


Your Image Description


shows how performance varies with uptraining proportion for T5 XXL with MQA and GQA. First, we note that GQA already achieves reasonable performance after conversion while MQA requires uptraining to be useful. Both MQA and GQA gain from 5% uptraining with diminishing returns from 10%.

> MHA enables a nuanced understanding of the relationships between different parts of the input. Nevertheless, this complexity comes at a cost β€” a significant demand on memory bandwidth, especially during decoder inference. In multi-query attention, we average the heads for keys and values so that all query heads share the same key and value head. This is achieved by replicating the mean-pooled β€œhead” H times, where H is the number of query heads. However, MQA is not without its drawbacks. The reduced complexity can lead to quality degradation and training instability. Grouped-query attention (GQA) is a simple approach that blends elements of multi-head attention (MHA) and multi-query attention (MQA) to create a more efficient attention mechanism.

```python
LLAMA_ATTENTIONS_CLASSES = {
"GQA": LlamaScalableGroupedQueryAttention,
"MHA": MultiHeadAttention,
"MQA": MultiQueryAttention,
}
```

---

## Usage:
### `Using the adult llama:`
install the requirements libraries:
```
pip install requirements
```
or
```
pip install pytorch transformers
```
clone the repo
```
git clone https://github.com/Esmail-ibraheem/Tinyllamas.git
```
run the download shell file to download the llama2 weights
```
.\download.sh
```
after downloading the weights, run the inference code:
```
python inference.py
```

now you should be able to test the model, by changing the prompts to whatever you want, here I wrote some physics prompts:
```python
prompts = [
"Simulate the motion of a projectile launched at a certain angle and velocity, including factors like gravity and air resistance.",
"Create a program that calculates the gravitational force between two objects based on their masses and distances."
"Develop a program to simulate the behavior of ideal gases using the laws of thermodynamics."
]
```

### `or using this single file (Tinyllama) which is a baby llama:`
first download the checkpoints from Karpathys tinysotries:
```
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
```

then run this command in your termianl:
```
python tinyllama.py stories15M.bin 0.8 256 "dostoevsky crime and punishment"
```

---

## Citation:
```BibTex
@misc{Gumaan2024-Tinyllamas,
title = "Tinyllamas",
author = "Gumaan, Esmail",
howpublished = {\url{https://github.com/Esmail-ibraheem/Tinyllamas}},
year = "2024",
month = "May",
note = "[Online; accessed 2024-05-15]",
}

```
---
## Notes and Acknowledgments:
I developed this project to enhance my skills in large language models and transformers. I built the Llama model from scratch and implemented various features, including multiple attentions. Feel free to suggest any additional features you'd like, such as flash attention or related concepts. This project integrates multiple research papers.

**papers**:
- [llama 2 research paper](https://arxiv.org/abs/2307.09288)
- [attention is all you need research paper](https://arxiv.org/abs/1706.03762)
- [Grouped Query Attention research paper](https://arxiv.org/abs/2305.13245)
- [RoFormer: Enhanced Transformer with Rotary Position Embedding research paper](https://arxiv.org/abs/2104.09864)

**other**
- [llama from scratch](https://youtu.be/oM4VmoabDAI?si=rDegyrnSghByUEnK)
- [huggingFace transformers lib](https://github.com/huggingface/transformers)