Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Jaykef/mlx-rag-gguf
Minimal, clean code implementation of RAG with mlx using gguf model weights
https://github.com/Jaykef/mlx-rag-gguf
Last synced: 4 months ago
JSON representation
Minimal, clean code implementation of RAG with mlx using gguf model weights
- Host: GitHub
- URL: https://github.com/Jaykef/mlx-rag-gguf
- Owner: Jaykef
- Created: 2024-03-26T03:31:54.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-04-27T04:18:20.000Z (10 months ago)
- Last Synced: 2024-10-29T18:22:56.764Z (4 months ago)
- Language: Python
- Size: 6.31 MB
- Stars: 43
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# MLX RAG With GGUF Model Weights
Minimal, clean code implementation of RAG with mlx using gguf model weights.The code here builds on https://github.com/vegaluisjose/mlx-rag, it has been optimized to support RAG-based inferencing for .gguf models. I am using BAAI/bge-small-en for the embedding model, TinyLlama-1.1B-Chat-v1.0-GGUF (you can choose from supported models below) as base model and the custom vector database script for indexing texts in a pdf file. Inference speeds can go up to ~413 tokens/sec for prompts and ~36 tokens/sec for generation on my 8G M2 Air.
## Update
- Added support for [phi-3-mini-4k-instruct.gguf](https://huggingface.co/Jaward/phi-3-mini-4k-instruct.Q4_0.gguf) and other `Q4_0`, `Q4_1` & `Q8_0` quantized models, download and save model in models/phi-3-mini-instruct folder## Demo
https://github.com/Jaykef/mlx-rag-gguf/assets/11355002/e97907ed-1142-4f3e-b2fd-95690c4b50f3
## Usage
Download Models (you can use hf's snapshot_download but I recommend downloading separately to save time). Save in models folder.
> [!NOTE]
> MLX currently only support a few quantizations: `Q4_0`, `Q4_1`, and `Q8_0`.
> Unsupported quantizations will be cast to `float16`.### Tested/Supported models
Tinyllama Q4_0 and Q8_0
- tinyllama-1.1b-chat-v1.0.Q4_0.gguf
- tinyllama-1.1b-chat-v1.0.Q8_0.ggufPhi-3-mini Q4_0
- phi-3-mini-4k-instruct.Q4_0.ggufMistral Q4_0 and Q8_0
- mistral-7b-v0.1.Q4_0.gguf
- mistral-7b-v0.1.Q8_0.ggufEmbedding models
- mlx-bge-small-en converted mlx format of BAAI/bge-small-en, save it in the mlx-bge-small-en folder.
- bge-small-en Only need the model.safetensors file, save it in the bge-small-en folder.Install requirements
```
python3 -m pip install -r requirements.txt
```Convert pdf into mlx compatible vector database
```
python3 create_vdb.py --pdf mlx_docs.pdf --vdb vdb.npz
```Query the model
```
python3 rag_vdb.py \
--question "Teach me the basics of mlx" \
--vdb "vdb.npz" \
--gguf "models/phi-3-mini-instruct/phi-3-mini-4k-instruct.Q4_0.gguf"
```The files in the repo work as follow:
- gguf.py: Has all stubs for loading and inferencing .gguf models.
- vdb.py: Holds logic for creating a vector database from a pdf file and saving it in mlx format (.npz) .
- create_vdb.py: It inherits from vdb.py and has all arguments used in creating a vector DB from a PDF file in mlx format (.npz).
- rag_vdb.py: Retrieves data from vdb used in querying the base model.
- model.py: Houses logic for the base model (with configs), embedding model and transformer encoder.
- utils.py: Utility function for accessing GGUF tokens.Queries make use of both .gguf (base model) and .npz (retrieval model) simultaneouly resulting in much higher inferencing speeds.
Checkout other cool mlx projects here: https://github.com/ml-explore/mlx/discussions/654#discussioncomment
## License
MIT