https://github.com/ridgerchu/matmulfreellm

Implementation for MatMul-free LM.
https://github.com/ridgerchu/matmulfreellm

large-language-model linear-transformer llm

Last synced: 11 days ago
JSON representation

Implementation for MatMul-free LM.

Host: GitHub
URL: https://github.com/ridgerchu/matmulfreellm
Owner: ridgerchu
License: apache-2.0
Created: 2024-04-23T20:44:31.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2024-11-05T23:38:38.000Z (7 months ago)
Last Synced: 2025-05-07T01:09:11.049Z (18 days ago)
Topics: large-language-model, linear-transformer, llm
Language: Python
Homepage:
Size: 1.52 MB
Stars: 2,996
Watchers: 46
Forks: 186
Open Issues: 24
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome - ridgerchu/matmulfreellm - Implementation for MatMul-free LM. (Python)
StarryDivineSky - ridgerchu/matmulfreellm - Free LM 是一种语言模型架构，无需矩阵乘法（MatMul）运算。此存储库提供了与 🤗 Transformers 库兼容的 MatMul-Free LM 实现。我们评估了缩放定律如何拟合 Transformer++ 和我们的模型中的 370M、1.3B 和 2.7B 参数模型。为了公平比较，每个操作的处理方式相同，尽管我们的模型在某些层中使用了更有效的三元权重。有趣的是，与 Transformer++ 相比，我们模型的缩放投影表现出更陡峭的下降，这表明我们的架构在利用额外计算来提高性能方面更有效。 (A01_文本生成_文本对话 / 大语言对话模型及数据)

README

        






MatMul-Free LM

 If you like our project, please give us a star ⭐ on GitHub for the latest updates.  

 This repo is adapted from flash-linear-attention. 



[![hf_model](https://img.shields.io/badge/🤗-Models-blue.svg)](https://huggingface.co/collections/ridger/matmulfree-lm-665f4d2b4e4648756e0dd13c) [![arXiv](https://img.shields.io/badge/Arxiv-2406.02528-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2406.02528) 

# Introduction







MatMul-Free LM is a language model architecture that eliminates the need for Matrix Multiplication (MatMul) operations. This repository provides an implementation of MatMul-Free LM that is compatible with the 🤗 Transformers library.

# Scaling Law







We evaluate how the scaling law fits to the 370M, 1.3B and 2.7B parameter models in both Transformer++ and our model. For a fair comparison, each operation is treated identically, though our model uses more efficient ternary weights in some layers. Interestingly, the scaling projection for our model exhibits a steeper descent compared to Transformer++, suggesting our architecture is more efficient in leveraging additional compute to improve performance.

# Installation

The following requirements should be satisfied 

- [PyTorch](https://pytorch.org/) >= 2.0

- [Triton](https://github.com/openai/triton) >=2.2

- [einops](https://einops.rocks/)

```sh

pip install -U git+https://github.com/ridgerchu/matmulfreellm

```

# Usage

## Pre-trained Model Zoo

| Model Size     | Layer | Hidden dimension  | Trained tokens |

|:----------------|:------------:|:----------------:|:------------------:|

| [370M](https://huggingface.co/ridger/MMfreeLM-370M)  | 24  | 1024 | 15B  |

| [1.3B](https://huggingface.co/ridger/MMfreeLM-1.3B)  | 24 | 2048 | 100B  |

| [2.7B](https://huggingface.co/ridger/MMfreeLM-2.7B)  | 32  | 2560 | 100B  |

## Model

We provide the implementations of models that are compatible with 🤗 Transformers library. 

Here's an example of how to initialize a model from the default configs in `matmulfreelm`:

This is a huggingface-compatible library that you can use such command to initialize the model with huggingface `AutoModel`:

```py

>>> from mmfreelm.models import HGRNBitConfig

>>> from transformers import AutoModel

>>> config = HGRNBitConfig()

>>> AutoModel.from_config(config)

HGRNBitModel(

  (embeddings): Embedding(32000, 2048)

  (layers): ModuleList(

    (0): HGRNBitBlock(

      (attn_norm): RMSNorm(2048, eps=1e-06)

      (attn): HGRNBitAttention(

        (i_proj): FusedBitLinear(

          in_features=2048, out_features=2048, bias=False

          (norm): RMSNorm(2048, eps=1e-08)

        )

        (f_proj): FusedBitLinear(

          in_features=2048, out_features=2048, bias=False

          (norm): RMSNorm(2048, eps=1e-08)

        )

        (g_proj): FusedBitLinear(

          in_features=2048, out_features=2048, bias=False

          (norm): RMSNorm(2048, eps=1e-08)

        )

        (g_norm): FusedRMSNormSwishGate()

        (o_proj): FusedBitLinear(

          in_features=2048, out_features=2048, bias=False

          (norm): RMSNorm(2048, eps=1e-08)

        )

      )

      (mlp_norm): RMSNorm(2048, eps=1e-06)

      (mlp): HGRNBitMLP(

        (gate_proj): FusedBitLinear(

          in_features=2048, out_features=11264, bias=False

          (norm): RMSNorm(2048, eps=1e-08)

        )

        (down_proj): FusedBitLinear(

          in_features=5632, out_features=2048, bias=False

          (norm): RMSNorm(5632, eps=1e-08)

        )

        (act_fn): SiLU()

      )

    )

    

)

>>> 

```

## Generation

Upon successfully pretraining a model, it becomes accessible for generating text using the 🤗 text generation APIs.

In the following, we give a generation example in `generate.py`:

```py

import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

import mmfreelm

from transformers import AutoModelForCausalLM, AutoTokenizer

#Change here to our open-sourced model

name = ''

tokenizer = AutoTokenizer.from_pretrained(name)

model = AutoModelForCausalLM.from_pretrained(name).cuda().half()

input_prompt = "In a shocking finding, scientist discovered a herd of unicorns living in a remote, "

input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.cuda()

outputs = model.generate(input_ids, max_length=32,  do_sample=True, top_p=0.4, temperature=0.6)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

```

# Citation

If you use this repo in your work, please cite our preprint:

```bib

@article{zhu2024scalable,

title={Scalable MatMul-free Language Modeling},

author={Zhu, Rui-Jie and Zhang, Yu and Sifferman, Ethan and Sheaves, Tyler and Wang, Yiqiao and Richmond, Dustin and Zhou, Peng and Eshraghian, Jason K},

journal={arXiv preprint arXiv:2406.02528},

year={2024}

}

```