https://github.com/ericlbuehler/edge-u-cation
Edge(u)cation: Cutting-edge multimodal LLMs on the edge with mistral.rs, using F8Q8
https://github.com/ericlbuehler/edge-u-cation
Last synced: 6 months ago
JSON representation
Edge(u)cation: Cutting-edge multimodal LLMs on the edge with mistral.rs, using F8Q8
- Host: GitHub
- URL: https://github.com/ericlbuehler/edge-u-cation
- Owner: EricLBuehler
- License: mit
- Created: 2025-02-08T18:06:11.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2025-02-08T18:37:40.000Z (over 1 year ago)
- Last Synced: 2025-02-08T19:32:08.502Z (over 1 year ago)
- Language: Java
- Size: 11.1 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# LLMs on Edge: Novel Post-Training Quantization for Education Applications
Edge(u)cation: Cutting-edge multimodal LLMs on the edge with mistral.rs, using F8Q8.
Model used: https://huggingface.co/EricB/Phi-3.5-vision-instruct-UQFF.
**Abstract**:
Could a personalized, portable AI tutor transform education and improve outcomes for students in disadvantaged communities? Advancements in open-source large language models (LLMs), particularly multimodal models that can understand images and text, enable AI-driven personalized learning by giving students rapid and personalized feedback while addressing privacy concerns. However, running these models on consumer devices like a cell phone remains cost-prohibitive. I hypothesize that LLMs can be made more efficient for use on a phone through an improved algorithm that decomposes the parameters of a neural network into two parts, one of which has a certain range that we can exploit to reduce memory footprint. This would allow me to fit a powerful LLM onto a phone while retaining high accuracy. I found that my novel post-training quantization method reduces memory footprint for a cutting-edge 8 billion parameter model from 16 GB RAM to 8.16 GB RAM, a 49\% reduction in model size. Integrated into my custom inference engine written in Rust called mistral.rs, this approach powers Edge(u)cation, an AI tutor app I created for mobile devices. To validate its impact, I then deployed Edge(u)cation in several example settings including math and engineering education experiments through real-time, AI-driven feedback. In conclusion, this work demonstrates a scalable, cost-effective solution for personalized learning, fostering STEM engagement in under-resourced communities. All codes and models are published for anyone to access, use, and build on.

Source: https://commons.wikimedia.org/wiki/File:Classroom_Picture_1.JPG
## F8Q8: 8-bit RTN-based blockwise nested quantization
F8E4M3 diagram:

F8Q8:
- Uses a block size of 32
- Is a form of 8-bit RTN quantization without any zero point/bias
- Takes advantage of the observed range of the RTN scale $d$, to compress it into [F8E4M3](https://github.com/EricLBuehler/float8).
## Examples
Quadratic equation example:

Bridge design analysis example:
