https://github.com/changwoolee/blast

[NeurIPS 2024] BLAST: Block Level Adaptive Structured Matrix for Efficient Deep Neural Network Inference
https://github.com/changwoolee/blast

efficient-inference large-language-models llama matrix-factorization matrix-multiplication model-compression

Last synced: 13 days ago
JSON representation

[NeurIPS 2024] BLAST: Block Level Adaptive Structured Matrix for Efficient Deep Neural Network Inference

Host: GitHub
URL: https://github.com/changwoolee/blast
Owner: changwoolee
License: mit
Created: 2024-09-27T17:24:51.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-11-06T02:08:33.000Z (6 months ago)
Last Synced: 2025-03-25T14:04:46.165Z (about 1 month ago)
Topics: efficient-inference, large-language-models, llama, matrix-factorization, matrix-multiplication, model-compression
Language: Python
Homepage:
Size: 1.43 MB
Stars: 10
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        


 

# BLAST: Block Level Adaptive Structured Matrix for Efficient Deep Neural Network Inference

**[Changwoo Lee](http://changwoolee.github.io), [Soo Min Kwon](https://soominkwon.github.io), [Qing Qu](https://qingqu.engin.umich.edu), and [Hun-Seok Kim](https://kim.engin.umich.edu)**

University of Michigan



**[[Paper](https://arxiv.org/abs/2410.21262)]**



## Notice

This repo is being actively updated.

* [Blast-Llama-4B](https://huggingface.co/cwoolee/blast-llama-4B) is now available on Hugging Face! 🤗 

* [arXiv](https://arxiv.org/abs/2410.21262) version is available!

* The paper is accepted to NeurIPS 2024.

## Dependencies

The packages can be installed via `conda env create --file environment.yml`.

Additionally, install `lm-evaluation-harness` with BLAST implementation via 

```bash

cd lm-evaluation-harness

pip install -e .

```

## Blast-Llama-4B Model

Blast-Llama-4B is a Llama-7B model compressed by 50% via the procedure described below.

The model can be loaded using `transformers` library.

```python

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")

model = AutoModelForCausalLM.from_pretrained("cwoolee/blast-llama-4B", trust_remote_code=True)

```

## Llama Decompsotion

Run `bash ./scripts/decompose_llama.sh 0-31`.

## Blast-Llama Retraining

Run `bash ./scripts/train_blast.sh`. The script assumes that 4 gpus are available.

We re-trained the compressed Llama model for 400 steps on a subset of SlimPajama dataset available at [here](https://huggingface.co/datasets/DKYoon/SlimPajama-6B).

## Evaluation using `lm-evaluation-harness`

Run `bash scripts/lm-eval-blast.sh`.

## Acklowledgment

This repo is highly inspired by [huggingface/transformers](https://github.com/huggingface/transformers/tree/main) and [EleutherAI/lm-evaluation-harness

](https://github.com/EleutherAI/lm-evaluation-harness).

## Citation

Please cite our paper if you find this repo or our paper useful

```

@inproceedings{

    lee2024blast,

    title={{BLAST}: Block-Level Adaptive Structured Matrices for Efficient Deep Neural Network Inference},

    author={Lee, Changwoo and Kwon, Soo Min and Qu, Qing and Kim, Hun-Seok},

    booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},

    year={2024},

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/changwoolee/blast

Awesome Lists containing this project

README