AI Efficiency Metrics Cheatsheet
Quick-reference guide covering memory, computation, performance, energy, and cost metrics for AI Systems — from CNNs and Transformers to TinyML and Diffusion models-based systems.
Notation Reference
Family I — Memory Metrics
Family II — Computation Metrics
Family III — Performance & Latency Metrics
Family IV — Energy, Carbon & Cost Metrics
Layer-by-Layer Formulas
Activation Formulas
Architecture-Specific Metrics
Hardware Constants Cheat Table
Quick Conversion Rules
Symbol
Meaning
$c_i$ / $c_o$
Input / output channels
$k_h$ / $k_w$
Kernel height / width
$h_i$ / $w_i$
Input feature map height / width
$h_o$ / $w_o$
Output feature map height / width
$g$
Number of groups (grouped convolution)
$n$ / $BS$
Batch size
$L$
Number of layers
$d$ / $d_{model}$
Hidden / embedding dimension
$d_{head}$
Per-head dimension ($d_{model} / \text{Heads}$ )
$V$
Vocabulary size
$N$
Sequence length (tokens)
$b$
Bit width (e.g., 32 for FP32, 16 for FP16, 8 for INT8)
$T$
Number of diffusion sampling steps
Family I — Memory Metrics
Metric
Formula
Unit
Notes
#Parameters ($W$ )
Sum of all weight tensor elements (see layer table )
count
Bias typically ignored in estimates
Model Size
$W \times b$
bits (convert to MB/GB)
Storage cost of weights on disk/flash
#Activations (Total)
$\sum_{\text{all layers}} \text{output activation elements}$
count
Sum across all feature maps
Peak #Activations
$\approx \text{Input Activation} + \text{Output Activation}$ (bottleneck layer)
count
Often the memory bottleneck in inference
Activation Memory
$\text{Total Activation} \times b$
bits → bytes
SRAM/GPU memory consumed
KV Cache (LLMs)
$BS \times L \times \text{Heads} \times d_{head} \times N \times 2 \times b$
bits → bytes
Stores K & V for autoregressive decoding
In CNN inference (especially TinyML), Peak Activations — not parameters — are the memory bottleneck. MobileNets reduce parameters but often not peak activation size.
Family II — Computation Metrics
Metric
Formula
Unit
Notes
MAC
$a \leftarrow a + b \cdot c$
count
One multiply + one accumulate
FLOP
$1 \text{ MAC} = 2 \text{ FLOPs}$
count
Floating point operations
Total FLOPs
$\text{Total MACs} \times 2$
count
OP
Same as FLOP but for non-float (e.g., INT8)
count
Generalized operation count
FLOPS / OPS
Operations per second (hardware capability)
ops/s
FLOPS = FP; OPS = general
Transformer-Specific Computation
Attention Type
Complexity (vs. sequence length $N$ )
Softmax Attention
$O(N^2)$
Linear Attention
$O(N)$
Transformer FLOPs Heuristic (Decoder-Only)
$$
\text{FLOPs per token} \approx 6 \times L \times d^2
$$
$$
\text{Estimated params} \approx V \cdot d + L \times 12 \times d^2
$$
Winograd Convolution (3×3)
Reduces cost from $9 \times C \times 4$ MACs → $16 \times C$ MACs for 4 outputs → 2.25× reduction .
Family III — Performance & Latency Metrics
$$
\boxed{\text{Latency} \approx \max\left(T_{\text{compute}},; T_{\text{memory}}\right)}
$$
Component
Formula
$T_{\text{compute}}$
$\dfrac{\text{Total OPs in model}}{\text{OPS}_{\text{hardware}}}$
$T_{\text{memory}}$
$T_{\text{activations}} + T_{\text{weights}}$
$T_{\text{activations}}$
$\dfrac{\text{Input Act. Size} + \text{Output Act. Size}}{\text{Memory Bandwidth}}$
$T_{\text{weights}}$
$\dfrac{\text{Model Size}}{\text{Memory Bandwidth}}$
$$
\text{Throughput} = \frac{\text{Total Processed Units}}{\text{Total Time (s)}}
$$
Or simply: $\text{Throughput} \approx 1 / \text{Latency}$ (for single-stream).
$$
\text{Tokens/s} = \frac{1}{T_{\text{per-token}}} = \frac{\text{OPS}_{\text{hardware}} \times \text{Utilization}}{\text{FLOPs per token}}
$$
Family IV — Energy, Carbon & Cost Metrics
Metric
Formula
Unit
Device Power Draw
$\text{TDP} \times \text{Utilization}$
W
Total Power (with PUE)
$\text{Power Draw} \times \text{PUE}$
W
Energy per Inference
$\dfrac{\text{Total Power} \times \text{Inference Time (s)}}{3600}$
Wh
OPS/W
$\dfrac{\text{OPS/s}}{\text{Total Power (W)}}$
ops/W
OPS/Wh
$\dfrac{\text{OPS per inference}}{\text{Energy per inference (Wh)}}$
ops/Wh
IPS/W
$\dfrac{\text{Inferences/s}}{\text{Total Power (W)}}$
inf/W
TPS/W
$\dfrac{\text{Tokens/s} \times BS}{\text{Total Power (W)}}$
tok/W
Carbon/Inference
$\dfrac{\text{Energy (kWh)} \times \text{Grid Intensity (gCO₂/kWh)}}{1000}$
kgCO₂
Cost/Inference
$\text{Energy (kWh)} \times \text{Electricity Price (USD/kWh)}$
USD
DRAM access ≈ 200× more energy than a 32-bit arithmetic operation. Minimizing data movement is often more impactful than reducing FLOPs.
$$
\text{Energy} \propto \text{Data Movement} \rightarrow \text{More Memory References} \rightarrow \text{More Energy}
$$
Bias ignored. Batch size $n = 1$.
Layer Type
#Parameters
MACs
Fully-Connected (Linear)
$c_o \cdot c_i$
$c_o \cdot c_i$
Standard Convolution
$c_o \cdot c_i \cdot k_h \cdot k_w$
$c_o \cdot c_i \cdot k_h \cdot k_w \cdot h_o \cdot w_o$
Grouped Convolution
$\dfrac{c_o \cdot c_i \cdot k_h \cdot k_w}{g}$
$\dfrac{c_o \cdot c_i \cdot k_h \cdot k_w \cdot h_o \cdot w_o}{g}$
Depthwise Convolution
$c_o \cdot k_h \cdot k_w$
$c_o \cdot k_h \cdot k_w \cdot h_o \cdot w_o$
1×1 Convolution
$c_o \cdot c_i$
$c_o \cdot c_i \cdot h_o \cdot w_o$
Depthwise = Grouped Conv where $g = c_i = c_o$ . 1×1 Conv = Standard Conv where $k_h = k_w = 1$ .
Per-Layer Activation Sizes
Layer Type
Input Activation
Output Activation
CNN Layer
$n \cdot c_i \cdot h_i \cdot w_i$
$n \cdot c_o \cdot h_o \cdot w_o$
Linear Layer
$n \cdot c_i$
$n \cdot c_o$
Transformer Layer
$BS \cdot N \cdot d_{model}$
$BS \cdot N \cdot d_{model}$
Metric
What It Measures
Formula
Peak #Activations
Max memory at any single point (HW constraint)
$\max_{\text{layer } l}\left(\text{Input}_l + \text{Output}_l\right)$
Total #Activations
Sum of all feature maps across all layers
$\sum_{\text{all layers}} \text{Output}_l$
Activation Memory
Byte cost of activations
$\text{Total Activation} \times b / 8$ bytes
During on-device training , all intermediate activations from the forward pass must be stored for backpropagation — making memory $\gg$ inference-only. Sparse backprop can store only ~1/4 of activations.
Architecture-Specific Metrics
Metric
Details
Peak Activations
Primary memory bottleneck; often larger than weights
In-place Depthwise Conv
Overwrites input buffer → memory = $\max(\text{In}, \text{Out})$ instead of $\text{In} + \text{Out}$
Winograd (3×3)
2.25× MAC reduction
Vision Transformers (ViT)
Metric
Formula / Details
Initial Token Count
$\dfrac{H \times W}{\text{PatchSize}^2}$
Attention Cost
$O(N^2)$ for softmax; $O(N)$ for linear attention
Metric
Formula / Details
KV Cache (MHA)
$BS \times L \times \text{Heads} \times d_{head} \times N \times 2 \times b$
GQA Cache
~$8 \times $ smaller than MHA
MQA Cache
~$64 \times $ smaller than MHA
FLOPs/token
$\approx 6 \times L \times d^2$
Estimated Params
$\approx V \cdot d + 12 \cdot L \cdot d^2$
Effective Context (StreamingLLM)
Uses "Attention Sinks" to handle context beyond window
Metric
Formula / Details
Total Latency
$\text{Latency}_{\text{step}} \times T$ (sampling steps)
FLOPs Scaling
Linear with number of denoising steps $T$
Metric
Constraint
Peak SRAM
Must fit within ~320 KB
Flash Usage
= Model Size (weights only)
SRAM Reuse (In-place DWConv)
Output overwrites input buffer
Metric
Details
TSM (Temporal Shift Module)
0 extra params, 0 extra MACs for temporal modeling; increases data movement cost
Throughput
Measured in Videos/s
Metric
Details
Perceiver Resampler
Maps variable visual features → fixed visual tokens (e.g., 5 tokens in Flamingo)
Cross-Attention Overhead
Cost between visual encoder and LLM backbone
Metric
Details
Sparsity
Voxelized point clouds typically $<0.1%$ dense
PVCNN (Point-Voxel)
Balances random memory access (point) vs. regular compute (voxel)
Hardware Constants Cheat Table
Define these before computing Family III & IV metrics:
Parameter
Symbol
Example Value
Unit
Processor Peak Performance
$\text{OPS}_{\text{hw}}$
100
TOPS or TFLOPS
Memory Bandwidth
$\text{BW}_{\text{mem}}$
900
GB/s
TDP (Thermal Design Power)
$\text{TDP}$
300
W
Bit Width
$b$
16
bits
Device Utilization
$U$
0.7
fraction
PUE (Power Usage Effectiveness)
$\text{PUE}$
1.2
ratio
Grid Carbon Intensity
—
400
gCO₂/kWh
Electricity Cost
—
0.20
USD/kWh
Common Energy Costs (Relative)
Operation
Relative Energy
32-bit INT Add
1× (baseline)
32-bit FP Multiply
~4×
SRAM Read
~6×
DRAM Read
~200×
From
To
Rule
MACs
FLOPs
$\times 2$
MACs
OPs
$\times 2$ (non-float)
Parameters
Model Size (bytes)
$\times b / 8$
Activations
Memory (bytes)
$\times b / 8$
TFLOPS
FLOPS
$\times 10^{12}$
Wh
kWh
$\div 1000$
Latency (s)
Throughput (units/s)
$1 / \text{Latency}$
Energy (kWh) → Carbon
$\text{kWh} \times \text{gCO}_2\text{/kWh}$
gCO₂
Energy (kWh) → Cost
$\text{kWh} \times \text{price}$
USD
Compute-Bound vs. Memory-Bound Decision
$$
\text{Arithmetic Intensity} = \frac{\text{Total OPs}}{\text{Total Bytes Moved}}
$$
Condition
Regime
Bottleneck
$\text{Arithmetic Intensity} > \frac{\text{OPS/hw}}{\text{BW/mem}}$
Memory-bound
Data movement
$\text{Arithmetic Intensity} < \frac{\text{OPS/hw}}{\text{BW/mem}}$
Compute-bound
Arithmetic
Use the Roofline Model to visualize where your workload sits.
Distributed / Scaling Metrics
Metric
Formula
Scalability Ratio
$\dfrac{\text{Throughput with } N \text{ GPUs}}{\text{Throughput with 1 GPU}}$
Communication Overhead
$\dfrac{T_{\text{data transfer}}}{T_{\text{compute}} + T_{\text{data transfer}}}$
Precision
Bit Width
Model Size Reduction (vs FP32)
Typical Accuracy Impact
FP32
32
1× (baseline)
—
FP16 / BF16
16
2×
Minimal
INT8
8
4×
Small (< 1%)
INT4
4
8×
Moderate
Quantization error measured via MSE (Newton-Raphson clipping optimization).
Efficiency Optimization Levers (Summary)
Lever
Reduces
Metric Impact
Pruning
#Parameters, MACs
↓ Model Size, ↓ FLOPs, ↓ Latency
Quantization
Bit Width
↓ Model Size, ↓ Memory BW, ↑ OPS/W
Knowledge Distillation
Model complexity
↓ Params while preserving accuracy
Depthwise Separable Conv
MACs, Params
↓ FLOPs by ~8-9× vs standard conv
GQA / MQA
KV Cache size
↓ Memory for LLM serving
Linear Attention
Attention cost
$O(N)$ vs $O(N^2)$
Winograd Transform
Conv MACs
2.25× reduction for 3×3
In-place DWConv
Peak SRAM
Fits MCU memory constraints
LoRA (PEFT)
Training memory
Updates only rank-$r$ of weights
Sparse Backprop
Training activations
Stores ~1/4 of forward activations
TSM (Video)
Extra params/FLOPs
0 overhead temporal modeling