Projects in Awesome Lists tagged with quantization

https://github.com/hiyouga/llama-factory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

agent ai chatglm fine-tuning gpt instruction-tuning language-model large-language-models llama llama3 llm lora mistral moe peft qlora quantization qwen rlhf transformers

Last synced: 09 Sep 2025

https://github.com/hiyouga/LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

agent ai chatglm fine-tuning gpt instruction-tuning language-model large-language-models llama llama3 llm lora mistral moe peft qlora quantization qwen rlhf transformers

Last synced: 14 Mar 2025

https://github.com/ymcui/chinese-llama-alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)

alpaca alpaca-2 large-language-models llama llama-2 llm lora nlp plm pre-trained-language-models quantization

Last synced: 13 May 2025

https://github.com/ymcui/Chinese-LLaMA-Alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)

alpaca alpaca-2 large-language-models llama llama-2 llm lora nlp plm pre-trained-language-models quantization

Last synced: 13 Mar 2025

https://github.com/systran/faster-whisper

Faster Whisper transcription with CTranslate2

deep-learning inference openai quantization speech-recognition speech-to-text transformer whisper

Last synced: 09 Sep 2025

https://github.com/SYSTRAN/faster-whisper

Faster Whisper transcription with CTranslate2

deep-learning inference openai quantization speech-recognition speech-to-text transformer whisper

Last synced: 24 Mar 2025

https://github.com/ufund-me/qbot

[🔥updating ...] AI 自动量化交易机器人(完全本地部署) AI-powered Quantitative Investment Research Platform. 📃 online docs: https://ufund-me.github.io/Qbot ✨ :news: qbot-mini: https://github.com/Charmve/iQuant

bitcoin blockchain deep-learning fintech funds machine-learning pytrade qlib quant-trade quant-trader quantitative-finance quantitative-trading quantization strategies trade-bot trademarks

Last synced: 12 May 2025

https://github.com/UFund-Me/Qbot

[🔥updating ...] AI 自动量化交易机器人(完全本地部署) AI-powered Quantitative Investment Research Platform. 📃 online docs: https://ufund-me.github.io/Qbot ✨ :news: qbot-mini: https://github.com/Charmve/iQuant

bitcoin blockchain deep-learning fintech funds machine-learning pytrade qlib quant-trade quant-trader quantitative-finance quantitative-trading quantization strategies trade-bot trademarks

Last synced: 27 Mar 2025

https://github.com/bitsandbytes-foundation/bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

llm machine-learning pytorch qlora quantization

Last synced: 09 Sep 2025

https://github.com/TimDettmers/bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

llm machine-learning pytorch qlora quantization

Last synced: 24 Mar 2025

https://github.com/kornelski/pngquant

Lossy PNG compressor — pngquant command based on libimagequant library

c conversion image-optimization palette png png-compression pngquant quality quantization smaller stdin

Last synced: 18 Dec 2025

https://github.com/autogptq/autogptq

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

deep-learning inference large-language-models llms nlp pytorch quantization transformer transformers

Last synced: 08 Apr 2025

https://github.com/AutoGPTQ/AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

deep-learning inference large-language-models llms nlp pytorch quantization transformer transformers

Last synced: 14 Mar 2025

https://nervanasystems.github.io/distiller/

Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller

automl-for-compression deep-neural-networks distillation early-exit group-lasso jupyter-notebook network-compression onnx pruning pruning-structures pytorch quantization regularization truncated-svd

Last synced: 09 Jul 2025

https://intellabs.github.io/distiller/

Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller

automl-for-compression deep-neural-networks distillation early-exit group-lasso jupyter-notebook network-compression onnx pruning pruning-structures pytorch quantization regularization truncated-svd

Last synced: 03 May 2025

https://github.com/intellabs/distiller

Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller

automl-for-compression deep-neural-networks distillation early-exit group-lasso jupyter-notebook network-compression onnx pruning pruning-structures pytorch quantization regularization truncated-svd

Last synced: 27 Sep 2025

https://github.com/IntelLabs/distiller

Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller

automl-for-compression deep-neural-networks distillation early-exit group-lasso jupyter-notebook network-compression onnx pruning pruning-structures pytorch quantization regularization truncated-svd

Last synced: 20 Mar 2025

https://github.com/opennmt/ctranslate2

Fast inference engine for Transformer models

avx avx2 cpp cuda deep-learning deep-neural-networks gemm inference intrinsics machine-translation mkl neon neural-machine-translation onednn openmp opennmt parallel-computing quantization thrust transformer-models

Last synced: 08 Oct 2025

https://github.com/OpenNMT/CTranslate2

Fast inference engine for Transformer models

avx avx2 cpp cuda deep-learning deep-neural-networks gemm inference intrinsics machine-translation mkl neon neural-machine-translation onednn openmp opennmt parallel-computing quantization thrust transformer-models

Last synced: 02 Apr 2025

https://github.com/neuralmagic/deepsparse

Sparsity-aware deep learning inference runtime for CPUs

computer-vision cpus deepsparse inference llm-inference machinelearning nlp object-detection onnx performance pretrained-models pruning quantization sparsification

Last synced: 14 May 2025

https://github.com/huawei-noah/pretrained-language-model

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

knowledge-distillation large-scale-distributed model-compression pretrained-models quantization

Last synced: 14 May 2025

https://github.com/huawei-noah/Pretrained-Language-Model

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

knowledge-distillation large-scale-distributed model-compression pretrained-models quantization

Last synced: 16 Mar 2025

https://github.com/intellabs/nlp-architect

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

bert deep-learning deeplearning dynet nlp nlu pytorch quantization tensorflow transformers

Last synced: 28 Sep 2025

https://github.com/IntelLabs/nlp-architect

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

bert deep-learning deeplearning dynet nlp nlu pytorch quantization tensorflow transformers

Last synced: 27 Mar 2025

https://github.com/huggingface/optimum

🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools

graphcore habana inference intel onnx onnxruntime optimization pytorch quantization tflite training transformers

Last synced: 14 May 2025

https://github.com/aaron-xichen/pytorch-playground

Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

pytorch pytorch-tutorial pytorch-tutorials quantization

Last synced: 15 May 2025

https://github.com/stochasticai/xturing

Build, customize and control you own LLMs. From data pre-processing to fine-tuning, xTuring provides an easy way to personalize open-source LLMs. Join our discord community: https://discord.gg/TgHXuSJEk6

adapter alpaca deep-learning fine-tuning finetuning gen-ai generative-ai gpt-2 gpt-j language-model llama llm lora mistral mixed-precision peft quantization

Last synced: 15 May 2025

https://github.com/stochasticai/xTuring

Build, customize and control you own LLMs. From data pre-processing to fine-tuning, xTuring provides an easy way to personalize open-source LLMs. Join our discord community: https://discord.gg/TgHXuSJEk6

adapter alpaca deep-learning fine-tuning finetuning gen-ai generative-ai gpt-2 gpt-j language-model llama llm lora mistral mixed-precision peft quantization

Last synced: 13 Mar 2025

https://intel.github.io/neural-compressor/

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime

auto-tuning awq fp4 gptq int4 int8 knowledge-distillation large-language-models low-precision mxformat post-training-quantization pruning quantization quantization-aware-training smoothquant sparsegpt sparsity

Last synced: 09 Dec 2025

https://github.com/intel/neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime

auto-tuning awq fp4 gptq int4 int8 knowledge-distillation large-language-models low-precision mxformat post-training-quantization pruning quantization quantization-aware-training smoothquant sparsegpt sparsity

Last synced: 12 May 2025

https://github.com/dvmazur/mixtral-offloading

Run Mixtral-8x7B models in Colab or consumer desktops

colab-notebook deep-learning google-colab language-model llm mixture-of-experts offloading pytorch quantization

Last synced: 15 May 2025

https://github.com/quic/aimet

AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.

auto-ml compression deep-learning deep-neural-networks machine-learning network-compression network-quantization open-source opensource pruning quantization

Last synced: 13 May 2025

https://github.com/666DZY666/micronet

micronet, a model compression and deploy lib. compression: 1、quantization: quantization-aware-training(QAT), High-Bit(>2b)(DoReFa/Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference)、Low-Bit(≤2b)/Ternary and Binary(TWN/BNN/XNOR-Net); post-training-quantization(PTQ), 8-bit(tensorrt); 2、 pruning: normal、regular and group convolutional channel pruning; 3、 group convolution structure; 4、batch-normalization fuse for quantization. deploy: tensorrt, fp32/fp16/int8(ptq-calibration)、op-adapt(upsample)、dynamic_shape

batch-normalization-fuse bnn convolutional-networks dorefa group-convolution integer-arithmetic-only model-compression network-in-network network-slimming neuromorphic-computing onnx post-training-quantization pruning pytorch quantization quantization-aware-training tensorrt tensorrt-int8-python twn xnor-net

Last synced: 20 Mar 2025

https://github.com/pytorch/ao

PyTorch native quantization and sparsity for training and inference

brrr cuda dtypes float8 inference llama mx offloading optimizer pytorch quantization sparsity training transformer

Last synced: 12 May 2025

https://github.com/nunchaku-tech/ComfyUI-nunchaku

ComfyUI Plugin of Nunchaku

comfyui diffusion flux genai mlsys quantization

Last synced: 02 Sep 2025

https://github.com/intel/intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

deep-learning intel machine-learning neural-network pytorch quantization

Last synced: 12 May 2025

https://github.com/openppl/ppq

PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.

caffe cuda deep-learning neural-network onnx open-source pytorch quantization

Last synced: 15 May 2025

https://github.com/mit-han-lab/nunchaku

[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

comfyui diffusion-models flux genai iclr iclr2025 lora mlsys quantization

Last synced: 13 May 2025

https://github.com/OpenPPL/ppq

PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.

caffe cuda deep-learning neural-network onnx open-source pytorch quantization

Last synced: 20 Mar 2025

https://github.com/paddlepaddle/paddleslim

PaddleSlim is an open-source library for deep model compression and architecture search.

bert compression detection distillation ernie nas pruning quantization segmentation sparsity tensorrt transformer yolov5 yolov6 yolov7

Last synced: 14 May 2025

https://github.com/PaddlePaddle/PaddleSlim

PaddleSlim is an open-source library for deep model compression and architecture search.

bert compression detection distillation ernie nas pruning quantization segmentation sparsity tensorrt transformer yolov5 yolov6 yolov7

Last synced: 20 Mar 2025

https://github.com/open-mmlab/mmrazor

OpenMMLab Model Compression Toolbox and Benchmark.

autoslim classification darts detection knowledge-distillation nas pruning pytorch quantization segmentation spos

Last synced: 14 May 2025

https://github.com/tensorflow/model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.

compression deep-learning keras machine-learning ml model-compression optimization pruning quantization quantized-networks quantized-neural-networks quantized-training sparsity tensorflow

Last synced: 12 May 2025

https://github.com/rwkv/rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model

deep-learning ggml language-model llm machine-learning quantization rwkv

Last synced: 14 May 2025

https://github.com/RWKV/rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model

deep-learning ggml language-model llm machine-learning quantization rwkv

Last synced: 15 Apr 2025

https://github.com/thu-ml/sageattention

Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.

attention cuda efficient-attention inference-acceleration llm llm-infra mlsys quantization triton video-generate video-generation vit

Last synced: 14 May 2025

https://github.com/xilinx/brevitas

Brevitas: neural network quantization in PyTorch

brevitas deep-learning fpga hardware-acceleration neural-networks ptq pytorch qat quantization xilinx

Last synced: 11 Oct 2025

https://github.com/vllm-project/llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

compression quantization sparsity

Last synced: 14 May 2025

https://github.com/rahulschand/gpu_poor

Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization

ggml gpu huggingface language-model llama llama2 llamacpp llm pytorch quantization

Last synced: 14 May 2025

https://github.com/Xilinx/brevitas

Brevitas: neural network quantization in PyTorch

brevitas deep-learning fpga hardware-acceleration neural-networks ptq pytorch qat quantization xilinx

Last synced: 23 Apr 2025

https://github.com/huawei-noah/efficient-computing

Efficient computing methods developed by Huawei Noah's Ark Lab

binary-neural-networks knowledge-distillation model-compression pruning quantization self-supervised

Last synced: 14 May 2025

https://github.com/open-edge-platform/training_extensions

Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™

action-recognition anomaly-detection automl computer-vision datumaro deep-learning hyper-parameter-optimization image-classification image-segmentation incremental-learning machine-learning neural-networks-compression object-detection openvino pytorch quantization self-supervised-learning semi-supervised-learning transfer-learning

Last synced: 14 May 2025

https://github.com/openvinotoolkit/training_extensions

Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™

action-recognition anomaly-detection automl computer-vision datumaro deep-learning hyper-parameter-optimization image-classification image-segmentation incremental-learning machine-learning neural-networks-compression object-detection openvino pytorch quantization self-supervised-learning semi-supervised-learning transfer-learning

Last synced: 02 Apr 2025

https://github.com/RahulSChand/gpu_poor

Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization

ggml gpu huggingface language-model llama llama2 llamacpp llm pytorch quantization

Last synced: 17 Apr 2025

https://github.com/huawei-noah/Efficient-Computing

Efficient computing methods developed by Huawei Noah's Ark Lab

binary-neural-networks knowledge-distillation model-compression pruning quantization self-supervised

Last synced: 20 Mar 2025

https://github.com/openvinotoolkit/nncf

Neural Network Compression Framework for enhanced OpenVINO™ inference

bert classification compression deep-learning genai llm mixed-precision-training nlp object-detection onnx openvino pruning pytorch quantization quantization-aware-training semantic-segmentation sparsity tensorflow transformers

Last synced: 13 May 2025

https://github.com/huggingface/optimum-quanto

A pytorch quantization backend for optimum

optimum pytorch quantization

Last synced: 14 May 2025

https://github.com/xilinx/finn

Dataflow compiler for QNN inference on FPGAs

compiler dataflow fpga neural-network quantization

Last synced: 11 Oct 2025

https://github.com/mit-han-lab/tinychatengine

TinyChatEngine: On-Device LLM Inference Library

arm c cpp cuda-programming deep-learning edge-computing large-language-models on-device-ai quantization x86-64

Last synced: 13 May 2025

https://github.com/mit-han-lab/tinyengine

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 256KB Memory

c codegenerator cpp deep-learning edge-computing microcontroller neural-architecture-search pytorch quantization tinyml

Last synced: 13 May 2025

https://github.com/imageoptim/libimagequant

Palette quantization library that powers pngquant and other PNG optimizers

callback conversion image-optimization image-pixels minification palette palette-generation pixel-array pngquant quality quantization rgba-pixels visual-studio

Last synced: 13 May 2025

https://github.com/pinto0309/onnx2tf

Self-Created Tools to convert ONNX files (NCHW) to TensorFlow/TFLite/Keras format (NHWC). The purpose of this tool is to solve the massive Transpose extrapolation problem in onnx-tensorflow (onnx-tf). I don't need a Star, but give me a pull request.

android coreml deep-learning docker keras lstm machine-learning model-converter models onnx onnx-tensorflow quantization tensorflow tensorflow-lite tfjs tflite transformer yolov9

Last synced: 14 May 2025

https://github.com/mit-han-lab/TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library

arm c cpp cuda-programming deep-learning edge-computing large-language-models on-device-ai quantization x86-64

Last synced: 07 May 2025

https://github.com/Xilinx/finn

Dataflow compiler for QNN inference on FPGAs

compiler dataflow fpga neural-network quantization

Last synced: 20 Mar 2025

https://github.com/squeezeailab/squeezellm

[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization

efficient-inference large-language-models llama llm localllm model-compression natural-language-processing post-training-quantization quantization small-models text-generation transformer

Last synced: 13 Apr 2025

https://github.com/thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

attention cuda inference-acceleration llm quantization triton video-generation

Last synced: 15 Aug 2025

https://github.com/deepvac/deepvac

PyTorch Project Specification.

amp coreml ddp deepvac ncnn onnx python pytorch quantization tensorboard tensorrt torchscript

Last synced: 16 May 2025

https://github.com/DeepVAC/deepvac

PyTorch Project Specification.

amp coreml ddp deepvac ncnn onnx python pytorch quantization tensorboard tensorrt torchscript

Last synced: 20 Mar 2025

https://github.com/IST-DASLab/marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

4bit kernel llm quantization

Last synced: 30 Aug 2025

https://github.com/OpenGVLab/OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

large-language-models llm quantization

Last synced: 07 May 2025

https://github.com/sforaidl/kd_lib

A Pytorch Knowledge Distillation library for benchmarking and extending works in the domains of Knowledge Distillation, Pruning, and Quantization.

algorithm-implementations benchmarking data-science deep-learning-library knowledge-distillation machine-learning model-compression pruning pytorch quantization

Last synced: 16 May 2025

https://github.com/google/qkeras

QKeras: a quantization deep learning library for Tensorflow Keras

accelerator asic-design deep-learning fpga fpga-accelerator hardware-acceleration keras machine-learning quantization quantized-networks quantized-neural-networks tensorflow

Last synced: 20 Mar 2025

https://github.com/Maknee/minigpt4.cpp

Port of MiniGPT4 in C++ (4bit, 5bit, 6bit, 8bit, 16bit CPU inference with GGML)

c cpp deep-learning ggml machine-learning minigpt4 multimodal quantization

Last synced: 15 Apr 2025

https://github.com/huggingface/optimum-intel

🤗 Optimum Intel: Accelerate inference with Intel optimization tools

diffusers distillation inference intel onnx openvino optimization pruning quantization transformers

Last synced: 14 Oct 2025

https://github.com/ModelTC/llmc

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

awq benchmark deployment evaluation internlm2 large-language-models lightllm llama3 llm lvlm mixtral omniquant post-training-quantization pruning quantization quarot smoothquant spinquant tool vllm

Last synced: 23 Apr 2025

https://github.com/intel/auto-round

Advanced Quantization Algorithm for LLMs/VLMs.

awq gptq int4 neural-compressor quantization rounding

Last synced: 25 Dec 2025

https://github.com/DerryHub/BEVFormer_tensorrt

BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins (float/half/half2/int8).

bevformer cuda int8-inference pytorch quantization tensorrt-plugins

Last synced: 20 Mar 2025

https://github.com/sony/model_optimization

Model Compression Toolkit (MCT) is an open source project for neural network model optimization under efficient, constrained hardware. This project provides researchers, developers, and engineers advanced quantization and compression tools for deploying state-of-the-art neural networks.

deep-learning deep-neural-networks edge-ai machine-learning network-compression network-quantization neural-network optimizer ptq pytorch qat quantization tensorflow

Last synced: 14 May 2025

https://github.com/mit-han-lab/haq

[CVPR 2019, Oral] HAQ: Hardware-Aware Automated Quantization with Mixed Precision

automl efficient-model mixed-precision quantization

Last synced: 13 May 2025

https://github.com/neuralmagic/sparsezoo

Neural network model repository for highly sparse and sparse-quantized models with matching sparsification recipes

computer-vision deep-learning-algorithms deep-learning-models mobilenet models-optimized nlp object-detection-model pretrained-models pruning quantization resnet smaller-models sparse-quantized-models sparsification-recipe transfer-learning yolo

Last synced: 16 May 2025

https://github.com/tpoisonooo/llama.onnx

LLaMa/RWKV onnx models, quantization and testcase

alpaca llama llm onnx onnxruntime quantization rwkv transformer

Last synced: 07 Apr 2025

https://github.com/xiuyu-li/q-diffusion

[ICCV 2023] Q-Diffusion: Quantizing Diffusion Models.

ddim diffusion-models model-compression post-training-quantization pytorch quantization stable-diffusion

Last synced: 06 Apr 2025

https://github.com/SqueezeAILab/KVQuant

[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

compression efficient-inference efficient-model large-language-models llama llm localllama localllm mistral model-compression natural-language-processing quantization small-models text-generation transformer

Last synced: 08 May 2025

https://github.com/inisis/brocolli

Everything in Torch Fx

caffe onnx pytorch quantization

Last synced: 13 Apr 2025

https://github.com/megvii-research/Sparsebit

A model compression and acceleration toolbox based on pytorch.

deep-learning post-training-quantization pruning quantization quantization-aware-training sparse tensorrt

Last synced: 12 May 2025

https://github.com/neuralmagic/sparsify

ML model optimization product to accelerate inference.

automl computer-vision deep-learning-accelerator image-classification inference-performance keras object-detection onnx pruning pytorch quantization smaller-models sparsification-recipe sparsify tensorflow

Last synced: 12 Apr 2025

https://github.com/beomi/bitnet-transformers

0️⃣1️⃣🤗 BitNet-Transformers: Huggingface Transformers Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch with Llama(2) Architecture

llm quantization quantization-aware-training transformers

Last synced: 07 May 2025

https://github.com/megvii-research/FQ-ViT

[IJCAI 2022] FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer

imagenet post-training-quantization pytorch quantization vision-transformer

Last synced: 20 Mar 2025

https://github.com/squeezeailab/kvquant

[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

compression efficient-inference efficient-model large-language-models llama llm localllama localllm mistral model-compression natural-language-processing quantization small-models text-generation transformer

Last synced: 07 Apr 2025

https://github.com/jy-yuan/KIVI

[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

inference large-language-models llama llm natural-language-processing quantization transformer

Last synced: 08 May 2025

https://github.com/sinanuozdemir/quick-start-guide-to-llms

The Official Repo for "Quick Start Guide to Large Language Models"

ai bert deepseek distillation generative-ai gpt llama-4 llm machine-learning multimodal nlp quantization rag

Last synced: 16 May 2025

https://github.com/datawhalechina/llm-deploy

大模型/LLM推理和部署理论与实践

knowledge-distillation llm llm-deploy lora pruning quantization

Last synced: 13 Jun 2025

https://github.com/microsoft/LQ-Nets

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

cnn compression dnn quantization

Last synced: 20 Mar 2025

https://github.com/kssteven418/i-bert

[ICML'21 Oral] I-BERT: Integer-only BERT Quantization

bert efficient-model efficient-neural-networks model-compression natural-language-processing quantization transformer

Last synced: 06 Apr 2025

https://github.com/picovoice/picollm

On-device LLM Inference Powered by X-Bit Quantization

compression efficient-inference gemma generative-ai language-model language-models large-language-model llama llama2 llama3 llm llm-inference llms mistral mixtral model-compression natural-language-processing quantization self-hosted

Last synced: 23 Oct 2025

https://github.com/j-marple-dev/model_compression

PyTorch Model Compression

lottey-ticket-hypothesis pruning pytorch quantization

Last synced: 03 May 2025

https://github.com/zcemycl/tf2deepfloorplan

TF2 Deep FloorPlan Recognition using a Multi-task Network with Room-boundary-Guided Attention. Enable tensorboard, quantization, flask, tflite, docker, github actions and google colab.

attention-network curl deep-learning deep-neural-networks docker flask github-actions github-release google-colab image-processing image-recognition jupyter-notebook keras-tensorflow pygame pypi-package python3 quantization tensorboard tensorflow2 tflite

Last synced: 09 Apr 2025

https://github.com/ikergarcia1996/easy-translate

Easy-Translate is a script for translating large text files with a SINGLE COMMAND. Easy-Translate is designed to be as easy as possible for beginners and as seamlesscustomizable and as possible for advanced users.

4-bit 8-bit begginers cpu easy easy-to-use gpu hugginface hugginface-hub huggingface-transformers llm m2m100 machine-translation nllb200 prompt pytorch quantization transformers translation