https://github.com/afondiel/ai-efficiency-metrics

Quick-reference guide covering memory, computation, performance, energy, and cost metrics for AI Systems.
https://github.com/afondiel/ai-efficiency-metrics
ai-efficiency edge-ai green-ai greener-ai mlperf tinyml
Last synced: about 1 month ago
JSON representation
Quick-reference guide covering memory, computation, performance, energy, and cost metrics for AI Systems.
Host: GitHub
URL: https://github.com/afondiel/ai-efficiency-metrics
Owner: afondiel
License: mit
Created: 2026-03-14T15:17:44.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-03-14T20:01:03.000Z (3 months ago)
Last Synced: 2026-03-15T02:11:53.417Z (3 months ago)
Topics: ai-efficiency, edge-ai, green-ai, greener-ai, mlperf, tinyml
Homepage:
Size: 8.79 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          [![](https://img.shields.io/badge/Contribute-Welcome-green)](./CONTRIBUTING.md) ![GitHub](https://img.shields.io/github/license/afondiel/ai-efficiency-metrics)

# AI Efficiency Metrics Cheatsheet

> Quick-reference guide covering memory, computation, performance, energy, and cost metrics for **AI Systems** — from CNNs and Transformers to TinyML and Diffusion models-based systems.

---

## Table of Contents

1. [Notation Reference](#notation-reference)

2. [Family I — Memory Metrics](#family-i--memory-metrics)

3. [Family II — Computation Metrics](#family-ii--computation-metrics)

4. [Family III — Performance & Latency Metrics](#family-iii--performance--latency-metrics)

5. [Family IV — Energy, Carbon & Cost Metrics](#family-iv--energy-carbon--cost-metrics)

6. [Layer-by-Layer Formulas](#layer-by-layer-formulas)

7. [Activation Formulas](#activation-formulas)

8. [Architecture-Specific Metrics](#architecture-specific-metrics)

9. [Hardware Constants Cheat Table](#hardware-constants-cheat-table)

10. [Quick Conversion Rules](#quick-conversion-rules)

---

## Notation Reference

| Symbol | Meaning |

|--------|---------|

| $c_i$ / $c_o$ | Input / output channels |

| $k_h$ / $k_w$ | Kernel height / width |

| $h_i$ / $w_i$ | Input feature map height / width |

| $h_o$ / $w_o$ | Output feature map height / width |

| $g$ | Number of groups (grouped convolution) |

| $n$ / $BS$ | Batch size |

| $L$ | Number of layers |

| $d$ / $d_{model}$ | Hidden / embedding dimension |

| $d_{head}$ | Per-head dimension ($d_{model} / \text{Heads}$) |

| $V$ | Vocabulary size |

| $N$ | Sequence length (tokens) |

| $b$ | Bit width (e.g., 32 for FP32, 16 for FP16, 8 for INT8) |

| $T$ | Number of diffusion sampling steps |

---

## Family I — Memory Metrics

| Metric | Formula | Unit | Notes |

|--------|---------|------|-------|

| **\#Parameters ($W$)** | Sum of all weight tensor elements (see [layer table](#layer-by-layer-formulas)) | count | Bias typically ignored in estimates |

| **Model Size** | $W \times b$ | bits (convert to MB/GB) | Storage cost of weights on disk/flash |

| **\#Activations (Total)** | $\sum_{\text{all layers}} \text{output activation elements}$ | count | Sum across all feature maps |

| **Peak \#Activations** | $\approx \text{Input Activation} + \text{Output Activation}$ (bottleneck layer) | count | **Often the memory bottleneck** in inference |

| **Activation Memory** | $\text{Total Activation} \times b$ | bits → bytes | SRAM/GPU memory consumed |

| **KV Cache (LLMs)** | $BS \times L \times \text{Heads} \times d_{head} \times N \times 2 \times b$ | bits → bytes | Stores K & V for autoregressive decoding |

### Key Insight

> In CNN inference (especially TinyML), **Peak Activations** — not parameters — are the memory bottleneck. MobileNets reduce parameters but often **not** peak activation size.

---

## Family II — Computation Metrics

| Metric | Formula | Unit | Notes |

|--------|---------|------|-------|

| **MAC** | $a \leftarrow a + b \cdot c$ | count | One multiply + one accumulate |

| **FLOP** | $1 \text{ MAC} = 2 \text{ FLOPs}$ | count | Floating point operations |

| **Total FLOPs** | $\text{Total MACs} \times 2$ | count | |

| **OP** | Same as FLOP but for non-float (e.g., INT8) | count | Generalized operation count |

| **FLOPS / OPS** | Operations per second (hardware capability) | ops/s | FLOPS = FP; OPS = general |

### Transformer-Specific Computation

| Attention Type | Complexity (vs. sequence length $N$) |

|----------------|--------------------------------------|

| **Softmax Attention** | $O(N^2)$ |

| **Linear Attention** | $O(N)$ |

### Transformer FLOPs Heuristic (Decoder-Only)

$$

\text{FLOPs per token} \approx 6 \times L \times d^2

$$

$$

\text{Estimated params} \approx V \cdot d + L \times 12 \times d^2

$$

### Winograd Convolution (3×3)

Reduces cost from $9 \times C \times 4$ MACs → $16 \times C$ MACs for 4 outputs → **2.25× reduction**.

---

## Family III — Performance & Latency Metrics

### Latency

$$

\boxed{\text{Latency} \approx \max\left(T_{\text{compute}},\; T_{\text{memory}}\right)}

$$

| Component | Formula |

|-----------|---------|

| $T_{\text{compute}}$ | $\dfrac{\text{Total OPs in model}}{\text{OPS}_{\text{hardware}}}$ |

| $T_{\text{memory}}$ | $T_{\text{activations}} + T_{\text{weights}}$ |

| $T_{\text{activations}}$ | $\dfrac{\text{Input Act. Size} + \text{Output Act. Size}}{\text{Memory Bandwidth}}$ |

| $T_{\text{weights}}$ | $\dfrac{\text{Model Size}}{\text{Memory Bandwidth}}$ |

### Throughput

$$

\text{Throughput} = \frac{\text{Total Processed Units}}{\text{Total Time (s)}}

$$

Or simply: $\text{Throughput} \approx 1 / \text{Latency}$ (for single-stream).

### Tokens per Second (LLMs)

$$

\text{Tokens/s} = \frac{1}{T_{\text{per-token}}} = \frac{\text{OPS}_{\text{hardware}} \times \text{Utilization}}{\text{FLOPs per token}}

$$

---

## Family IV — Energy, Carbon & Cost Metrics

| Metric | Formula | Unit |

|--------|---------|------|

| **Device Power Draw** | $\text{TDP} \times \text{Utilization}$ | W |

| **Total Power (with PUE)** | $\text{Power Draw} \times \text{PUE}$ | W |

| **Energy per Inference** | $\dfrac{\text{Total Power} \times \text{Inference Time (s)}}{3600}$ | Wh |

| **OPS/W** | $\dfrac{\text{OPS/s}}{\text{Total Power (W)}}$ | ops/W |

| **OPS/Wh** | $\dfrac{\text{OPS per inference}}{\text{Energy per inference (Wh)}}$ | ops/Wh |

| **IPS/W** | $\dfrac{\text{Inferences/s}}{\text{Total Power (W)}}$ | inf/W |

| **TPS/W** | $\dfrac{\text{Tokens/s} \times BS}{\text{Total Power (W)}}$ | tok/W |

| **Carbon/Inference** | $\dfrac{\text{Energy (kWh)} \times \text{Grid Intensity (gCO₂/kWh)}}{1000}$ | kgCO₂ |

| **Cost/Inference** | $\text{Energy (kWh)} \times \text{Electricity Price (USD/kWh)}$ | USD |

### Key Insight

> **DRAM access ≈ 200× more energy** than a 32-bit arithmetic operation. Minimizing data movement is often more impactful than reducing FLOPs.

$$

\text{Energy} \propto \text{Data Movement} \rightarrow \text{More Memory References} \rightarrow \text{More Energy}

$$

---

## Layer-by-Layer Formulas

*Bias ignored. Batch size $n = 1$.*

| Layer Type | \#Parameters | MACs |

|:-----------|:------------|:-----|

| **Fully-Connected (Linear)** | $c_o \cdot c_i$ | $c_o \cdot c_i$ |

| **Standard Convolution** | $c_o \cdot c_i \cdot k_h \cdot k_w$ | $c_o \cdot c_i \cdot k_h \cdot k_w \cdot h_o \cdot w_o$ |

| **Grouped Convolution** | $\dfrac{c_o \cdot c_i \cdot k_h \cdot k_w}{g}$ | $\dfrac{c_o \cdot c_i \cdot k_h \cdot k_w \cdot h_o \cdot w_o}{g}$ |

| **Depthwise Convolution** | $c_o \cdot k_h \cdot k_w$ | $c_o \cdot k_h \cdot k_w \cdot h_o \cdot w_o$ |

| **1×1 Convolution** | $c_o \cdot c_i$ | $c_o \cdot c_i \cdot h_o \cdot w_o$ |

> **Depthwise** = Grouped Conv where $g = c_i = c_o$. **1×1 Conv** = Standard Conv where $k_h = k_w = 1$.

---

## Activation Formulas

### Per-Layer Activation Sizes

| Layer Type | Input Activation | Output Activation |

|:-----------|:-----------------|:------------------|

| **CNN Layer** | $n \cdot c_i \cdot h_i \cdot w_i$ | $n \cdot c_o \cdot h_o \cdot w_o$ |

| **Linear Layer** | $n \cdot c_i$ | $n \cdot c_o$ |

| **Transformer Layer** | $BS \cdot N \cdot d_{model}$ | $BS \cdot N \cdot d_{model}$ |

### Peak vs. Total

| Metric | What It Measures | Formula |

|--------|-----------------|---------|

| **Peak \#Activations** | Max memory at any single point (HW constraint) | $\max_{\text{layer } l}\left(\text{Input}_l + \text{Output}_l\right)$ |

| **Total \#Activations** | Sum of all feature maps across all layers | $\sum_{\text{all layers}} \text{Output}_l$ |

| **Activation Memory** | Byte cost of activations | $\text{Total Activation} \times b / 8$ bytes |

### Training Memory Note

> During **on-device training**, **all** intermediate activations from the forward pass must be stored for backpropagation — making memory $\gg$ inference-only. Sparse backprop can store only ~1/4 of activations.

---

## Architecture-Specific Metrics

### CNNs (Vision)

| Metric | Details |

|--------|---------|

| **Peak Activations** | Primary memory bottleneck; often larger than weights |

| **In-place Depthwise Conv** | Overwrites input buffer → memory = $\max(\text{In}, \text{Out})$ instead of $\text{In} + \text{Out}$ |

| **Winograd (3×3)** | 2.25× MAC reduction |

### Vision Transformers (ViT)

| Metric | Formula / Details |

|--------|-------------------|

| **Initial Token Count** | $\dfrac{H \times W}{\text{PatchSize}^2}$ |

| **Attention Cost** | $O(N^2)$ for softmax; $O(N)$ for linear attention |

### LLMs / Transformers

| Metric | Formula / Details |

|--------|-------------------|

| **KV Cache (MHA)** | $BS \times L \times \text{Heads} \times d_{head} \times N \times 2 \times b$ |

| **GQA Cache** | ~$8 \times $ smaller than MHA |

| **MQA Cache** | ~$64 \times $ smaller than MHA |

| **FLOPs/token** | $\approx 6 \times L \times d^2$ |

| **Estimated Params** | $\approx V \cdot d + 12 \cdot L \cdot d^2$ |

| **Effective Context (StreamingLLM)** | Uses "Attention Sinks" to handle context beyond window |

### Diffusion Models

| Metric | Formula / Details |

|--------|-------------------|

| **Total Latency** | $\text{Latency}_{\text{step}} \times T$ (sampling steps) |

| **FLOPs Scaling** | Linear with number of denoising steps $T$ |

### TinyML / Edge (MCU)

| Metric | Constraint |

|--------|------------|

| **Peak SRAM** | Must fit within ~320 KB |

| **Flash Usage** | = Model Size (weights only) |

| **SRAM Reuse (In-place DWConv)** | Output overwrites input buffer |

### Video Models

| Metric | Details |

|--------|---------|

| **TSM (Temporal Shift Module)** | 0 extra params, 0 extra MACs for temporal modeling; increases data movement cost |

| **Throughput** | Measured in Videos/s |

### Multimodal (VLM)

| Metric | Details |

|--------|---------|

| **Perceiver Resampler** | Maps variable visual features → fixed visual tokens (e.g., 5 tokens in Flamingo) |

| **Cross-Attention Overhead** | Cost between visual encoder and LLM backbone |

### 3D Point Clouds

| Metric | Details |

|--------|---------|

| **Sparsity** | Voxelized point clouds typically $<0.1\%$ dense |

| **PVCNN (Point-Voxel)** | Balances random memory access (point) vs. regular compute (voxel) |

---

## Hardware Constants Cheat Table

Define these before computing Family III & IV metrics:

| Parameter | Symbol | Example Value | Unit |

|-----------|--------|---------------|------|

| Processor Peak Performance | $\text{OPS}_{\text{hw}}$ | 100 | TOPS or TFLOPS |

| Memory Bandwidth | $\text{BW}_{\text{mem}}$ | 900 | GB/s |

| TDP (Thermal Design Power) | $\text{TDP}$ | 300 | W |

| Bit Width | $b$ | 16 | bits |

| Device Utilization | $U$ | 0.7 | fraction |

| PUE (Power Usage Effectiveness) | $\text{PUE}$ | 1.2 | ratio |

| Grid Carbon Intensity | — | 400 | gCO₂/kWh |

| Electricity Cost | — | 0.20 | USD/kWh |

### Common Energy Costs (Relative)

| Operation | Relative Energy |

|-----------|----------------|

| 32-bit INT Add | 1× (baseline) |

| 32-bit FP Multiply | ~4× |

| SRAM Read | ~6× |

| **DRAM Read** | **~200×** |

---

## Quick Conversion Rules

| From | To | Rule |

|------|----|------|

| MACs | FLOPs | $\times 2$ |

| MACs | OPs | $\times 2$ (non-float) |

| Parameters | Model Size (bytes) | $\times b / 8$ |

| Activations | Memory (bytes) | $\times b / 8$ |

| TFLOPS | FLOPS | $\times 10^{12}$ |

| Wh | kWh | $\div 1000$ |

| Latency (s) | Throughput (units/s) | $1 / \text{Latency}$ |

| Energy (kWh) → Carbon | $\text{kWh} \times \text{gCO}_2\text{/kWh}$ | gCO₂ |

| Energy (kWh) → Cost | $\text{kWh} \times \text{price}$ | USD |

---

## Compute-Bound vs. Memory-Bound Decision

$$

\text{Arithmetic Intensity} = \frac{\text{Total OPs}}{\text{Total Bytes Moved}}

$$

| Condition | Regime | Bottleneck |

|-----------|--------|------------|

| $\text{Arithmetic Intensity} > \frac{\text{OPS/hw}}{\text{BW/mem}}$ | **Memory-bound** | Data movement |

| $\text{Arithmetic Intensity} < \frac{\text{OPS/hw}}{\text{BW/mem}}$ | **Compute-bound** | Arithmetic |

> Use the **Roofline Model** to visualize where your workload sits.

---

## Distributed / Scaling Metrics

| Metric | Formula |

|--------|---------|

| **Scalability Ratio** | $\dfrac{\text{Throughput with } N \text{ GPUs}}{\text{Throughput with 1 GPU}}$ |

| **Communication Overhead** | $\dfrac{T_{\text{data transfer}}}{T_{\text{compute}} + T_{\text{data transfer}}}$ |

---

## Quantization Impact

| Precision | Bit Width | Model Size Reduction (vs FP32) | Typical Accuracy Impact |

|-----------|-----------|-------------------------------|------------------------|

| FP32 | 32 | 1× (baseline) | — |

| FP16 / BF16 | 16 | 2× | Minimal |

| INT8 | 8 | 4× | Small (< 1%) |

| INT4 | 4 | 8× | Moderate |

> Quantization error measured via **MSE** (Newton-Raphson clipping optimization).

---

## Efficiency Optimization Levers (Summary)

| Lever | Reduces | Metric Impact |

|-------|---------|---------------|

| **Pruning** | \#Parameters, MACs | ↓ Model Size, ↓ FLOPs, ↓ Latency |

| **Quantization** | Bit Width | ↓ Model Size, ↓ Memory BW, ↑ OPS/W |

| **Knowledge Distillation** | Model complexity | ↓ Params while preserving accuracy |

| **Depthwise Separable Conv** | MACs, Params | ↓ FLOPs by ~8-9× vs standard conv |

| **GQA / MQA** | KV Cache size | ↓ Memory for LLM serving |

| **Linear Attention** | Attention cost | $O(N)$ vs $O(N^2)$ |

| **Winograd Transform** | Conv MACs | 2.25× reduction for 3×3 |

| **In-place DWConv** | Peak SRAM | Fits MCU memory constraints |

| **LoRA (PEFT)** | Training memory | Updates only rank-$r$ of weights |

| **Sparse Backprop** | Training activations | Stores ~1/4 of forward activations |

| **TSM (Video)** | Extra params/FLOPs | 0 overhead temporal modeling |

---

## References

- [MIT HAN Lab ](https://hanlab.mit.edu/)— [TinyML & Efficient AI lectures (Lec02-Basics)](https://hanlab.mit.edu/courses/2024-fall-65940)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/afondiel/ai-efficiency-metrics

Awesome Lists containing this project

README