Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Jaykef/mlx-rag-gguf

Minimal, clean code implementation of RAG with mlx using gguf model weights
https://github.com/Jaykef/mlx-rag-gguf

Last synced: 4 months ago
JSON representation

Minimal, clean code implementation of RAG with mlx using gguf model weights

Host: GitHub
URL: https://github.com/Jaykef/mlx-rag-gguf
Owner: Jaykef
Created: 2024-03-26T03:31:54.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-04-27T04:18:20.000Z (10 months ago)
Last Synced: 2024-10-29T18:22:56.764Z (4 months ago)
Language: Python
Size: 6.31 MB
Stars: 43
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# MLX RAG With GGUF Model Weights
Minimal, clean code implementation of RAG with mlx using gguf model weights.

The code here builds on https://github.com/vegaluisjose/mlx-rag, it has been optimized to support RAG-based inferencing for .gguf models. I am using BAAI/bge-small-en for the embedding model, TinyLlama-1.1B-Chat-v1.0-GGUF (you can choose from supported models below) as base model and the custom vector database script for indexing texts in a pdf file. Inference speeds can go up to ~413 tokens/sec for prompts and ~36 tokens/sec for generation on my 8G M2 Air.

## Update
- Added support for [phi-3-mini-4k-instruct.gguf](https://huggingface.co/Jaward/phi-3-mini-4k-instruct.Q4_0.gguf) and other `Q4_0`, `Q4_1` & `Q8_0` quantized models, download and save model in models/phi-3-mini-instruct folder

## Demo

https://github.com/Jaykef/mlx-rag-gguf/assets/11355002/e97907ed-1142-4f3e-b2fd-95690c4b50f3

## Usage
Download Models (you can use hf's snapshot_download but I recommend downloading separately to save time). Save in models folder.
> [!NOTE]
> MLX currently only support a few quantizations: `Q4_0`, `Q4_1`, and `Q8_0`.
> Unsupported quantizations will be cast to `float16`.

### Tested/Supported models

Tinyllama Q4_0 and Q8_0
- tinyllama-1.1b-chat-v1.0.Q4_0.gguf
- tinyllama-1.1b-chat-v1.0.Q8_0.gguf

Phi-3-mini Q4_0
- phi-3-mini-4k-instruct.Q4_0.gguf

Mistral Q4_0 and Q8_0
- mistral-7b-v0.1.Q4_0.gguf
- mistral-7b-v0.1.Q8_0.gguf

Embedding models
- mlx-bge-small-en converted mlx format of BAAI/bge-small-en, save it in the mlx-bge-small-en folder.
- bge-small-en Only need the model.safetensors file, save it in the bge-small-en folder.

Install requirements
```
python3 -m pip install -r requirements.txt
```

Convert pdf into mlx compatible vector database
```
python3 create_vdb.py --pdf mlx_docs.pdf --vdb vdb.npz
```

Query the model
```
python3 rag_vdb.py \
--question "Teach me the basics of mlx" \
--vdb "vdb.npz" \
--gguf "models/phi-3-mini-instruct/phi-3-mini-4k-instruct.Q4_0.gguf"
```

The files in the repo work as follow:

- gguf.py: Has all stubs for loading and inferencing .gguf models.
- vdb.py: Holds logic for creating a vector database from a pdf file and saving it in mlx format (.npz) .
- create_vdb.py: It inherits from vdb.py and has all arguments used in creating a vector DB from a PDF file in mlx format (.npz).
- rag_vdb.py: Retrieves data from vdb used in querying the base model.
- model.py: Houses logic for the base model (with configs), embedding model and transformer encoder.
- utils.py: Utility function for accessing GGUF tokens.

Queries make use of both .gguf (base model) and .npz (retrieval model) simultaneouly resulting in much higher inferencing speeds.

Checkout other cool mlx projects here: https://github.com/ml-explore/mlx/discussions/654#discussioncomment

## License
MIT