Projects in Awesome Lists tagged with fp8

https://github.com/NVIDIA/TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.

cuda deep-learning fp4 fp8 gpu jax machine-learning python pytorch

Last synced: 16 Nov 2025

https://github.com/nvidia/transformerengine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.

cuda deep-learning fp8 gpu jax machine-learning python pytorch

Last synced: 24 Feb 2026

https://github.com/azure/ms-amp

Microsoft Automatic Mixed Precision Library

amp deep-learning fp8 gpu mixed-precision pytorch transformer

Last synced: 07 Apr 2025

https://github.com/intel/neural-speed

An innovative library for efficient LLM inference via low-bit quantization

cpu fp4 fp8 gaudi2 gpu int1 int2 int3 int4 int5 int6 int7 int8 llamacpp llm-fine-tuning llm-inference low-bit mxformat nf4 sparsity

Last synced: 25 Oct 2025

https://github.com/aredden/flux-fp8-api

Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.

diffusion fast-inference flux fp8 pytorch quantization

Last synced: 19 Sep 2025

https://github.com/maeddesg/vulkanforge

LLM inference engine for AMD RDNA4 — Rust + Vulkan compute shaders, gguf & native FP8.

amd fp8 gemma4 gfx1201 gguf inference llm machine-learning mesa rdna4 rust vulkan

Last synced: 14 Jun 2026

https://github.com/graphcore-research/jax-scalify

JAX Scalify: end-to-end scaled arithmetics

fp8 jax llm low-precision

Last synced: 17 Jan 2026

https://github.com/zerfoo/zerfoo

Pure Go machine learning framework. Train, run, and serve ML models with go build. Zero CGo.

autodiff deep-learning distributed-training float16 float8 fp16 fp8 go golang graph-ml machine-learning ml-framework neural-network onnx transformer

Last synced: 13 Jun 2026

https://github.com/jcartu/qwen36-27b-blackwell-inference-study

Systematic 24-hour benchmark study of Qwen3.6-27B inference on dual NVIDIA RTX PRO 6000 Blackwell SM120 (TP=2). 8 experiments comparing repne/vllm fork vs upstream vLLM across FP8/BF16/NVFP4/Q8_0 quants and MTP/DFlash speculative decoding. Peak: 2,083 tok/s at c=32. Quality: KLD vs BF16 = 0.0018 (noise floor).

benchmark bf16 blackwell fp8 inference nvfp4 qwen qwen3 rtx-pro-6000 speculative-decoding vllm

Last synced: 12 Jun 2026

https://github.com/murrellgroup/microfloats.jl

Slow, low-precision floating point types

floating-point fp4 fp6 fp8 microfloat microscaling minifloat

Last synced: 12 Feb 2026

https://github.com/umangyadav/py_fp8

FP8 dtypes enumeration in python

fp8 fp8e4m3 fp8e4m3fnuz fp8e5m2 fp8e5m2fnuz

Last synced: 17 Jun 2025

https://github.com/pathcosmos/frankenstallm

Korean 3B LLM (pure Transformer) pretrained from scratch on 8× NVIDIA B200 GPUs with SFT + ORPO alignment

flash-attention fp8 gguf gqa korean-llm nvidia-b200 orpo pretraining sft transformer

Last synced: 29 May 2026

https://github.com/theogravity/dual-rtx-6000-blackwell-qwen3.6-27b-fp8

Optimized vLLM setup for Qwen3.6-27B-FP8 on dual RTX PRO 6000 Blackwell (192 GB GDDR7, no NVLink) ; config, benchmark sweep results, and custom chat template with thinking mode off by default.

benchmark blackwell fp8 llm-inference local-llm multi-token-prediction qwen3 rtx-pro-6000 speculative-decoding vllm

Last synced: 11 May 2026

https://github.com/jcartu/qwen36-27b-fp8-repne-vs-upstream

Same FP8+MTP=3 config on Repne fork vs upstream vLLM v0.20.1, dual RTX PRO 6000 Blackwell. Repne fork wins at short-context multi-stream.

benchmark blackwell fp8 mtp qwen speculative-decoding vllm

Last synced: 12 Jun 2026

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome