Projects in Awesome Lists tagged with quantization
A curated list of projects in awesome lists tagged with quantization .
https://github.com/hiyouga/llama-factory
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
agent ai chatglm fine-tuning gpt instruction-tuning language-model large-language-models llama llama3 llm lora mistral moe peft qlora quantization qwen rlhf transformers
Last synced: 09 Sep 2025
https://github.com/hiyouga/LLaMA-Factory
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
agent ai chatglm fine-tuning gpt instruction-tuning language-model large-language-models llama llama3 llm lora mistral moe peft qlora quantization qwen rlhf transformers
Last synced: 14 Mar 2025
https://github.com/ymcui/chinese-llama-alpaca
中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
alpaca alpaca-2 large-language-models llama llama-2 llm lora nlp plm pre-trained-language-models quantization
Last synced: 13 May 2025
https://github.com/ymcui/Chinese-LLaMA-Alpaca
中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
alpaca alpaca-2 large-language-models llama llama-2 llm lora nlp plm pre-trained-language-models quantization
Last synced: 13 Mar 2025
https://github.com/systran/faster-whisper
Faster Whisper transcription with CTranslate2
deep-learning inference openai quantization speech-recognition speech-to-text transformer whisper
Last synced: 09 Sep 2025
https://github.com/SYSTRAN/faster-whisper
Faster Whisper transcription with CTranslate2
deep-learning inference openai quantization speech-recognition speech-to-text transformer whisper
Last synced: 24 Mar 2025
https://github.com/ufund-me/qbot
[🔥updating ...] AI 自动量化交易机器人(完全本地部署) AI-powered Quantitative Investment Research Platform. 📃 online docs: https://ufund-me.github.io/Qbot ✨ :news: qbot-mini: https://github.com/Charmve/iQuant
bitcoin blockchain deep-learning fintech funds machine-learning pytrade qlib quant-trade quant-trader quantitative-finance quantitative-trading quantization strategies trade-bot trademarks
Last synced: 12 May 2025
https://github.com/UFund-Me/Qbot
[🔥updating ...] AI 自动量化交易机器人(完全本地部署) AI-powered Quantitative Investment Research Platform. 📃 online docs: https://ufund-me.github.io/Qbot ✨ :news: qbot-mini: https://github.com/Charmve/iQuant
bitcoin blockchain deep-learning fintech funds machine-learning pytrade qlib quant-trade quant-trader quantitative-finance quantitative-trading quantization strategies trade-bot trademarks
Last synced: 27 Mar 2025
https://github.com/bitsandbytes-foundation/bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
llm machine-learning pytorch qlora quantization
Last synced: 09 Sep 2025
https://github.com/TimDettmers/bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
llm machine-learning pytorch qlora quantization
Last synced: 24 Mar 2025
https://github.com/kornelski/pngquant
Lossy PNG compressor — pngquant command based on libimagequant library
c conversion image-optimization palette png png-compression pngquant quality quantization smaller stdin
Last synced: 18 Dec 2025
https://github.com/autogptq/autogptq
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
deep-learning inference large-language-models llms nlp pytorch quantization transformer transformers
Last synced: 08 Apr 2025
https://github.com/AutoGPTQ/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
deep-learning inference large-language-models llms nlp pytorch quantization transformer transformers
Last synced: 14 Mar 2025
https://nervanasystems.github.io/distiller/
Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
automl-for-compression deep-neural-networks distillation early-exit group-lasso jupyter-notebook network-compression onnx pruning pruning-structures pytorch quantization regularization truncated-svd
Last synced: 09 Jul 2025
https://intellabs.github.io/distiller/
Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
automl-for-compression deep-neural-networks distillation early-exit group-lasso jupyter-notebook network-compression onnx pruning pruning-structures pytorch quantization regularization truncated-svd
Last synced: 03 May 2025
https://github.com/intellabs/distiller
Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
automl-for-compression deep-neural-networks distillation early-exit group-lasso jupyter-notebook network-compression onnx pruning pruning-structures pytorch quantization regularization truncated-svd
Last synced: 27 Sep 2025
https://github.com/IntelLabs/distiller
Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
automl-for-compression deep-neural-networks distillation early-exit group-lasso jupyter-notebook network-compression onnx pruning pruning-structures pytorch quantization regularization truncated-svd
Last synced: 20 Mar 2025
https://github.com/opennmt/ctranslate2
Fast inference engine for Transformer models
avx avx2 cpp cuda deep-learning deep-neural-networks gemm inference intrinsics machine-translation mkl neon neural-machine-translation onednn openmp opennmt parallel-computing quantization thrust transformer-models
Last synced: 08 Oct 2025
https://github.com/OpenNMT/CTranslate2
Fast inference engine for Transformer models
avx avx2 cpp cuda deep-learning deep-neural-networks gemm inference intrinsics machine-translation mkl neon neural-machine-translation onednn openmp opennmt parallel-computing quantization thrust transformer-models
Last synced: 02 Apr 2025
https://github.com/neuralmagic/deepsparse
Sparsity-aware deep learning inference runtime for CPUs
computer-vision cpus deepsparse inference llm-inference machinelearning nlp object-detection onnx performance pretrained-models pruning quantization sparsification
Last synced: 14 May 2025
https://github.com/huawei-noah/pretrained-language-model
Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.
knowledge-distillation large-scale-distributed model-compression pretrained-models quantization
Last synced: 14 May 2025
https://github.com/huawei-noah/Pretrained-Language-Model
Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.
knowledge-distillation large-scale-distributed model-compression pretrained-models quantization
Last synced: 16 Mar 2025
https://github.com/intellabs/nlp-architect
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
bert deep-learning deeplearning dynet nlp nlu pytorch quantization tensorflow transformers
Last synced: 28 Sep 2025
https://github.com/IntelLabs/nlp-architect
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
bert deep-learning deeplearning dynet nlp nlu pytorch quantization tensorflow transformers
Last synced: 27 Mar 2025
https://github.com/huggingface/optimum
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
graphcore habana inference intel onnx onnxruntime optimization pytorch quantization tflite training transformers
Last synced: 14 May 2025
https://github.com/aaron-xichen/pytorch-playground
Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)
pytorch pytorch-tutorial pytorch-tutorials quantization
Last synced: 15 May 2025
https://github.com/stochasticai/xturing
Build, customize and control you own LLMs. From data pre-processing to fine-tuning, xTuring provides an easy way to personalize open-source LLMs. Join our discord community: https://discord.gg/TgHXuSJEk6
adapter alpaca deep-learning fine-tuning finetuning gen-ai generative-ai gpt-2 gpt-j language-model llama llm lora mistral mixed-precision peft quantization
Last synced: 15 May 2025
https://github.com/stochasticai/xTuring
Build, customize and control you own LLMs. From data pre-processing to fine-tuning, xTuring provides an easy way to personalize open-source LLMs. Join our discord community: https://discord.gg/TgHXuSJEk6
adapter alpaca deep-learning fine-tuning finetuning gen-ai generative-ai gpt-2 gpt-j language-model llama llm lora mistral mixed-precision peft quantization
Last synced: 13 Mar 2025
https://intel.github.io/neural-compressor/
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
auto-tuning awq fp4 gptq int4 int8 knowledge-distillation large-language-models low-precision mxformat post-training-quantization pruning quantization quantization-aware-training smoothquant sparsegpt sparsity
Last synced: 09 Dec 2025
https://github.com/intel/neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
auto-tuning awq fp4 gptq int4 int8 knowledge-distillation large-language-models low-precision mxformat post-training-quantization pruning quantization quantization-aware-training smoothquant sparsegpt sparsity
Last synced: 12 May 2025
https://github.com/dvmazur/mixtral-offloading
Run Mixtral-8x7B models in Colab or consumer desktops
colab-notebook deep-learning google-colab language-model llm mixture-of-experts offloading pytorch quantization
Last synced: 15 May 2025
https://github.com/quic/aimet
AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
auto-ml compression deep-learning deep-neural-networks machine-learning network-compression network-quantization open-source opensource pruning quantization
Last synced: 13 May 2025
https://github.com/666DZY666/micronet
micronet, a model compression and deploy lib. compression: 1、quantization: quantization-aware-training(QAT), High-Bit(>2b)(DoReFa/Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference)、Low-Bit(≤2b)/Ternary and Binary(TWN/BNN/XNOR-Net); post-training-quantization(PTQ), 8-bit(tensorrt); 2、 pruning: normal、regular and group convolutional channel pruning; 3、 group convolution structure; 4、batch-normalization fuse for quantization. deploy: tensorrt, fp32/fp16/int8(ptq-calibration)、op-adapt(upsample)、dynamic_shape
batch-normalization-fuse bnn convolutional-networks dorefa group-convolution integer-arithmetic-only model-compression network-in-network network-slimming neuromorphic-computing onnx post-training-quantization pruning pytorch quantization quantization-aware-training tensorrt tensorrt-int8-python twn xnor-net
Last synced: 20 Mar 2025
https://github.com/pytorch/ao
PyTorch native quantization and sparsity for training and inference
brrr cuda dtypes float8 inference llama mx offloading optimizer pytorch quantization sparsity training transformer
Last synced: 12 May 2025
https://github.com/nunchaku-tech/ComfyUI-nunchaku
ComfyUI Plugin of Nunchaku
comfyui diffusion flux genai mlsys quantization
Last synced: 02 Sep 2025
https://github.com/intel/intel-extension-for-pytorch
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
deep-learning intel machine-learning neural-network pytorch quantization
Last synced: 12 May 2025
https://github.com/openppl/ppq
PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.
caffe cuda deep-learning neural-network onnx open-source pytorch quantization
Last synced: 15 May 2025
https://github.com/mit-han-lab/nunchaku
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
comfyui diffusion-models flux genai iclr iclr2025 lora mlsys quantization
Last synced: 13 May 2025
https://github.com/OpenPPL/ppq
PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.
caffe cuda deep-learning neural-network onnx open-source pytorch quantization
Last synced: 20 Mar 2025
https://github.com/paddlepaddle/paddleslim
PaddleSlim is an open-source library for deep model compression and architecture search.
bert compression detection distillation ernie nas pruning quantization segmentation sparsity tensorrt transformer yolov5 yolov6 yolov7
Last synced: 14 May 2025
https://github.com/PaddlePaddle/PaddleSlim
PaddleSlim is an open-source library for deep model compression and architecture search.
bert compression detection distillation ernie nas pruning quantization segmentation sparsity tensorrt transformer yolov5 yolov6 yolov7
Last synced: 20 Mar 2025
https://github.com/open-mmlab/mmrazor
OpenMMLab Model Compression Toolbox and Benchmark.
autoslim classification darts detection knowledge-distillation nas pruning pytorch quantization segmentation spos
Last synced: 14 May 2025
https://github.com/tensorflow/model-optimization
A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
compression deep-learning keras machine-learning ml model-compression optimization pruning quantization quantized-networks quantized-neural-networks quantized-training sparsity tensorflow
Last synced: 12 May 2025
https://github.com/rwkv/rwkv.cpp
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
deep-learning ggml language-model llm machine-learning quantization rwkv
Last synced: 14 May 2025
https://github.com/RWKV/rwkv.cpp
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
deep-learning ggml language-model llm machine-learning quantization rwkv
Last synced: 15 Apr 2025
https://github.com/thu-ml/sageattention
Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
attention cuda efficient-attention inference-acceleration llm llm-infra mlsys quantization triton video-generate video-generation vit
Last synced: 14 May 2025
https://github.com/xilinx/brevitas
Brevitas: neural network quantization in PyTorch
brevitas deep-learning fpga hardware-acceleration neural-networks ptq pytorch qat quantization xilinx
Last synced: 11 Oct 2025
https://github.com/vllm-project/llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
compression quantization sparsity
Last synced: 14 May 2025
https://github.com/rahulschand/gpu_poor
Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization
ggml gpu huggingface language-model llama llama2 llamacpp llm pytorch quantization
Last synced: 14 May 2025
https://github.com/Xilinx/brevitas
Brevitas: neural network quantization in PyTorch
brevitas deep-learning fpga hardware-acceleration neural-networks ptq pytorch qat quantization xilinx
Last synced: 23 Apr 2025
https://github.com/huawei-noah/efficient-computing
Efficient computing methods developed by Huawei Noah's Ark Lab
binary-neural-networks knowledge-distillation model-compression pruning quantization self-supervised
Last synced: 14 May 2025
https://github.com/open-edge-platform/training_extensions
Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™
action-recognition anomaly-detection automl computer-vision datumaro deep-learning hyper-parameter-optimization image-classification image-segmentation incremental-learning machine-learning neural-networks-compression object-detection openvino pytorch quantization self-supervised-learning semi-supervised-learning transfer-learning
Last synced: 14 May 2025
https://github.com/openvinotoolkit/training_extensions
Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™
action-recognition anomaly-detection automl computer-vision datumaro deep-learning hyper-parameter-optimization image-classification image-segmentation incremental-learning machine-learning neural-networks-compression object-detection openvino pytorch quantization self-supervised-learning semi-supervised-learning transfer-learning
Last synced: 02 Apr 2025
https://github.com/RahulSChand/gpu_poor
Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization
ggml gpu huggingface language-model llama llama2 llamacpp llm pytorch quantization
Last synced: 17 Apr 2025
https://github.com/huawei-noah/Efficient-Computing
Efficient computing methods developed by Huawei Noah's Ark Lab
binary-neural-networks knowledge-distillation model-compression pruning quantization self-supervised
Last synced: 20 Mar 2025
https://github.com/openvinotoolkit/nncf
Neural Network Compression Framework for enhanced OpenVINO™ inference
bert classification compression deep-learning genai llm mixed-precision-training nlp object-detection onnx openvino pruning pytorch quantization quantization-aware-training semantic-segmentation sparsity tensorflow transformers
Last synced: 13 May 2025
https://github.com/huggingface/optimum-quanto
A pytorch quantization backend for optimum
Last synced: 14 May 2025
https://github.com/xilinx/finn
Dataflow compiler for QNN inference on FPGAs
compiler dataflow fpga neural-network quantization
Last synced: 11 Oct 2025
https://github.com/mit-han-lab/tinychatengine
TinyChatEngine: On-Device LLM Inference Library
arm c cpp cuda-programming deep-learning edge-computing large-language-models on-device-ai quantization x86-64
Last synced: 13 May 2025
https://github.com/mit-han-lab/tinyengine
[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 256KB Memory
c codegenerator cpp deep-learning edge-computing microcontroller neural-architecture-search pytorch quantization tinyml
Last synced: 13 May 2025
https://github.com/imageoptim/libimagequant
Palette quantization library that powers pngquant and other PNG optimizers
callback conversion image-optimization image-pixels minification palette palette-generation pixel-array pngquant quality quantization rgba-pixels visual-studio
Last synced: 13 May 2025
https://github.com/pinto0309/onnx2tf
Self-Created Tools to convert ONNX files (NCHW) to TensorFlow/TFLite/Keras format (NHWC). The purpose of this tool is to solve the massive Transpose extrapolation problem in onnx-tensorflow (onnx-tf). I don't need a Star, but give me a pull request.
android coreml deep-learning docker keras lstm machine-learning model-converter models onnx onnx-tensorflow quantization tensorflow tensorflow-lite tfjs tflite transformer yolov9
Last synced: 14 May 2025
https://github.com/mit-han-lab/TinyChatEngine
TinyChatEngine: On-Device LLM Inference Library
arm c cpp cuda-programming deep-learning edge-computing large-language-models on-device-ai quantization x86-64
Last synced: 07 May 2025
https://github.com/Xilinx/finn
Dataflow compiler for QNN inference on FPGAs
compiler dataflow fpga neural-network quantization
Last synced: 20 Mar 2025
https://github.com/squeezeailab/squeezellm
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
efficient-inference large-language-models llama llm localllm model-compression natural-language-processing post-training-quantization quantization small-models text-generation transformer
Last synced: 13 Apr 2025
https://github.com/thu-ml/SageAttention
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
attention cuda inference-acceleration llm quantization triton video-generation
Last synced: 15 Aug 2025
https://github.com/deepvac/deepvac
PyTorch Project Specification.
amp coreml ddp deepvac ncnn onnx python pytorch quantization tensorboard tensorrt torchscript
Last synced: 16 May 2025
https://github.com/DeepVAC/deepvac
PyTorch Project Specification.
amp coreml ddp deepvac ncnn onnx python pytorch quantization tensorboard tensorrt torchscript
Last synced: 20 Mar 2025
https://github.com/IST-DASLab/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Last synced: 30 Aug 2025
https://github.com/OpenGVLab/OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
large-language-models llm quantization
Last synced: 07 May 2025
https://github.com/sforaidl/kd_lib
A Pytorch Knowledge Distillation library for benchmarking and extending works in the domains of Knowledge Distillation, Pruning, and Quantization.
algorithm-implementations benchmarking data-science deep-learning-library knowledge-distillation machine-learning model-compression pruning pytorch quantization
Last synced: 16 May 2025
https://github.com/google/qkeras
QKeras: a quantization deep learning library for Tensorflow Keras
accelerator asic-design deep-learning fpga fpga-accelerator hardware-acceleration keras machine-learning quantization quantized-networks quantized-neural-networks tensorflow
Last synced: 20 Mar 2025
https://github.com/Maknee/minigpt4.cpp
Port of MiniGPT4 in C++ (4bit, 5bit, 6bit, 8bit, 16bit CPU inference with GGML)
c cpp deep-learning ggml machine-learning minigpt4 multimodal quantization
Last synced: 15 Apr 2025
https://github.com/huggingface/optimum-intel
🤗 Optimum Intel: Accelerate inference with Intel optimization tools
diffusers distillation inference intel onnx openvino optimization pruning quantization transformers
Last synced: 14 Oct 2025
https://github.com/ModelTC/llmc
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
awq benchmark deployment evaluation internlm2 large-language-models lightllm llama3 llm lvlm mixtral omniquant post-training-quantization pruning quantization quarot smoothquant spinquant tool vllm
Last synced: 23 Apr 2025
https://github.com/intel/auto-round
Advanced Quantization Algorithm for LLMs/VLMs.
awq gptq int4 neural-compressor quantization rounding
Last synced: 25 Dec 2025
https://github.com/DerryHub/BEVFormer_tensorrt
BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins (float/half/half2/int8).
bevformer cuda int8-inference pytorch quantization tensorrt-plugins
Last synced: 20 Mar 2025
https://github.com/sony/model_optimization
Model Compression Toolkit (MCT) is an open source project for neural network model optimization under efficient, constrained hardware. This project provides researchers, developers, and engineers advanced quantization and compression tools for deploying state-of-the-art neural networks.
deep-learning deep-neural-networks edge-ai machine-learning network-compression network-quantization neural-network optimizer ptq pytorch qat quantization tensorflow
Last synced: 14 May 2025
https://github.com/mit-han-lab/haq
[CVPR 2019, Oral] HAQ: Hardware-Aware Automated Quantization with Mixed Precision
automl efficient-model mixed-precision quantization
Last synced: 13 May 2025
https://github.com/neuralmagic/sparsezoo
Neural network model repository for highly sparse and sparse-quantized models with matching sparsification recipes
computer-vision deep-learning-algorithms deep-learning-models mobilenet models-optimized nlp object-detection-model pretrained-models pruning quantization resnet smaller-models sparse-quantized-models sparsification-recipe transfer-learning yolo
Last synced: 16 May 2025
https://github.com/tpoisonooo/llama.onnx
LLaMa/RWKV onnx models, quantization and testcase
alpaca llama llm onnx onnxruntime quantization rwkv transformer
Last synced: 07 Apr 2025
https://github.com/xiuyu-li/q-diffusion
[ICCV 2023] Q-Diffusion: Quantizing Diffusion Models.
ddim diffusion-models model-compression post-training-quantization pytorch quantization stable-diffusion
Last synced: 06 Apr 2025
https://github.com/SqueezeAILab/KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
compression efficient-inference efficient-model large-language-models llama llm localllama localllm mistral model-compression natural-language-processing quantization small-models text-generation transformer
Last synced: 08 May 2025
https://github.com/inisis/brocolli
Everything in Torch Fx
caffe onnx pytorch quantization
Last synced: 13 Apr 2025
https://github.com/megvii-research/Sparsebit
A model compression and acceleration toolbox based on pytorch.
deep-learning post-training-quantization pruning quantization quantization-aware-training sparse tensorrt
Last synced: 12 May 2025
https://github.com/neuralmagic/sparsify
ML model optimization product to accelerate inference.
automl computer-vision deep-learning-accelerator image-classification inference-performance keras object-detection onnx pruning pytorch quantization smaller-models sparsification-recipe sparsify tensorflow
Last synced: 12 Apr 2025
https://github.com/beomi/bitnet-transformers
0️⃣1️⃣🤗 BitNet-Transformers: Huggingface Transformers Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch with Llama(2) Architecture
llm quantization quantization-aware-training transformers
Last synced: 07 May 2025
https://github.com/megvii-research/FQ-ViT
[IJCAI 2022] FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
imagenet post-training-quantization pytorch quantization vision-transformer
Last synced: 20 Mar 2025
https://github.com/squeezeailab/kvquant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
compression efficient-inference efficient-model large-language-models llama llm localllama localllm mistral model-compression natural-language-processing quantization small-models text-generation transformer
Last synced: 07 Apr 2025
https://github.com/jy-yuan/KIVI
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
inference large-language-models llama llm natural-language-processing quantization transformer
Last synced: 08 May 2025
https://github.com/sinanuozdemir/quick-start-guide-to-llms
The Official Repo for "Quick Start Guide to Large Language Models"
ai bert deepseek distillation generative-ai gpt llama-4 llm machine-learning multimodal nlp quantization rag
Last synced: 16 May 2025
https://github.com/datawhalechina/llm-deploy
大模型/LLM推理和部署理论与实践
knowledge-distillation llm llm-deploy lora pruning quantization
Last synced: 13 Jun 2025
https://github.com/microsoft/LQ-Nets
LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks
cnn compression dnn quantization
Last synced: 20 Mar 2025
https://github.com/kssteven418/i-bert
[ICML'21 Oral] I-BERT: Integer-only BERT Quantization
bert efficient-model efficient-neural-networks model-compression natural-language-processing quantization transformer
Last synced: 06 Apr 2025
https://github.com/picovoice/picollm
On-device LLM Inference Powered by X-Bit Quantization
compression efficient-inference gemma generative-ai language-model language-models large-language-model llama llama2 llama3 llm llm-inference llms mistral mixtral model-compression natural-language-processing quantization self-hosted
Last synced: 23 Oct 2025
https://github.com/j-marple-dev/model_compression
PyTorch Model Compression
lottey-ticket-hypothesis pruning pytorch quantization
Last synced: 03 May 2025
https://github.com/zcemycl/tf2deepfloorplan
TF2 Deep FloorPlan Recognition using a Multi-task Network with Room-boundary-Guided Attention. Enable tensorboard, quantization, flask, tflite, docker, github actions and google colab.
attention-network curl deep-learning deep-neural-networks docker flask github-actions github-release google-colab image-processing image-recognition jupyter-notebook keras-tensorflow pygame pypi-package python3 quantization tensorboard tensorflow2 tflite
Last synced: 09 Apr 2025
https://github.com/ikergarcia1996/easy-translate
Easy-Translate is a script for translating large text files with a SINGLE COMMAND. Easy-Translate is designed to be as easy as possible for beginners and as seamlesscustomizable and as possible for advanced users.
4-bit 8-bit begginers cpu easy easy-to-use gpu hugginface hugginface-hub huggingface-transformers llm m2m100 machine-translation nllb200 prompt pytorch quantization transformers translation
Last synced: 15 May 2025
https://github.com/Aaronhuang-778/BiLLM
[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
Last synced: 09 May 2025
https://github.com/dbohdan/hicolor
🎨 Convert images to 15/16-bit RGB color with dithering
color-quantization color-reduction dithering high-color image-conversion image-format image-library image-processing quantization retro-graphics
Last synced: 29 Aug 2025