The Challenge: Models Are Massive

Modern LLMs require gigabytes of memory just to store their parameters. A 7B parameter model at FP16 needs 14GB, while a 175B model needs 350GB. This creates barriers for deployment, especially on edge devices and in memory-constrained environments.

Quantization reduces memory usage by representing weights and activations with fewer bits, trading some precision for dramatic memory savings.

Model Memory Requirements by Precision

Key Takeaway

Quantization converts floating-point numbers to lower-precision integers, cutting memory requirements by 2-4x while maintaining model quality with specialized techniques.

By quantizing from FP16 (16 bits) to INT8 (8 bits) or even INT4 (4 bits), we can fit larger models on the same hardware and often achieve faster inference through specialized integer arithmetic.

Transformer Layers

Transformer models stack identical layers, each containing:

  • Multi-Head Attention: Computes relationships between tokens
  • Feed-Forward Network (MLP):strong> Processes each token independently
  • Layer Normalization: Stabilizes training and inference
  • Residual Connections: Preserve information flow
Input Multi-Head Attention Weights: Q, K, V, O Activations: attention scores Add & LayerNorm Feed-Forward Network Weights: W1, W2 Activations: hidden states Add & LayerNorm Output Residual KV Cache Stores K, V for all tokens

KV Cache Growth

During inference, the attention mechanism stores Key and Value vectors for all previously generated tokens in a KV cache. For sequence length L and hidden dimension H, the cache requires 2 × L × H values.

At INT8, a 2048-token sequence with 4096-dim embeddings needs 16MB of KV cache per layer—significant memory that benefits from quantization.

Definitions

Weights

Fixed parameters learned during training. In a Linear layer y = Wx + b, W are the weights. Typically range from -2 to 2 after normalization.

Activations

Dynamic values flowing through the network during inference. These are intermediate outputs between layers. Range can vary widely, with outliers requiring special handling.

KV Cache

Stores Key and Value vectors from attention layers for efficient autoregressive generation. Grows linearly with sequence length. Typically bounded values but can have outliers.

Biases

Additive parameters in Linear layers. Usually small in magnitude (often < 1) and sometimes omitted in modern architectures for simplicity.

Comparison Table

Component Role Typical Range Quantization Challenge
Weights Fixed transformations [-2, 2] (normalized) Per-channel variation
Activations Dynamic data flow [-6, 6] with outliers Outliers and per-token variation
KV Cache Attention memory [-3, 3] Sequence-dependent scaling

Linear Layer Example

Consider a simple Linear layer: y = Wx + b

W = [[ 0.5, -0.3], [ 1.2, 0.8]] // 2x2 weight matrix x = [0.7, -1.5] // Input vector b = [0.1, -0.2] // Bias vector y = Wx + b = [0.5*0.7 + (-0.3)*(-1.5) + 0.1, 1.2*0.7 + 0.8*(-1.5) + (-0.2)] = [0.35 + 0.45 + 0.1, 0.84 - 1.2 - 0.2] = [0.90, -0.56] // Output vector

This tiny example shows how weights (W) transform inputs (x) through matrix multiplication, with biases (b) adding an offset. At scale, LLMs perform billions of such operations.

Core Concepts

Bit-width

Number of bits used to represent each number. Fewer bits = less precision but more memory savings.

  • FP32: 32 bits (full precision)
  • FP16: 16 bits (half precision)
  • INT8: 8 bits (integer)
  • INT4: 4 bits (extreme quantization)

Scale & Zero-point

Map between floating-point and integer ranges:

Scale (s) = (max_fp - min_fp) / (max_int - min_int) Zero-point (zp) = round(min_fp / s) - min_int

These parameters enable reversible conversion between precisions.

Quantization Types

Symmetric

Zero-point is 0. Range is [-127, 127] for INT8. Simpler but may not use full range optimally.

Asymmetric

Non-zero zero-point allows better range utilization. Range is [0, 255] for UINT8.

Granularity

Per-tensor: One scale for all values

Per-channel: Separate scale per output channel

Per-group: Groups of channels share scales

Quantization Formula

Quantization: q = round(y / scale) + zero_point Dequantization: y ≈ (q - zero_point) × scale

The round operation introduces quantization error. Better scale selection minimizes this error.

Interactive Demo

Adjust the bit-width to see how quantization affects a sample vector:

Bit-width: 4 bits
Symmetric:

Original (FP16)

[0.15, -0.32, 0.87, -1.23, 2.45]

Quantized & Dequantized

Loading...

Quantization Error (MSE)

Calculating...

When Applied

Post-Training Quantization (PTQ)

Quantize already-trained models without retraining. Fast but may lose accuracy at very low bit-widths.

Quantization-Aware Training (QAT)

Simulate quantization during training. Better accuracy but requires access to training pipeline.

Hybrid Small-Update

PTQ with brief fine-tuning. Balances speed and accuracy for most practical scenarios.

What Is Quantized

Target Description Benefits Challenges
Weight-only Only model weights quantized, activations remain FP16 Easy to implement, good memory savings Limited speedup, mixed precision overhead
Weights + Activations Both weights and activations quantized Maximum speedup, significant memory savings Activation outliers, accuracy drops
KV Cache Only attention cache quantized Reduces memory for long sequences Affects generation quality gradually
Low-bit Finetuning Special formats like NF4 for efficient updates Enables large model adaptation Requires custom kernels

How It Works

Uniform Methods

  • Standard INT quantization: Linear mapping between FP and INT
  • Activation-aware scaling: Different scales for activation stats
  • Rotations: Pre-process weights to reduce outliers

Non-uniform Methods

  • Codebooks: Vector quantization with learned centroids
  • Per-vector schemes: Individual scales per vector
  • Sparse + Quant: Combine sparsity with quantization

Technique Taxonomy

Quantization PTQ QAT Hybrid Fine-tuning GPTQ AWQ SmoothQuant ZeroQuant QLoRA LoRA-Q INT4/INT8 Activation-aware Smoothing Zero-shot NF4 format Efficient

GPTQ

What: Post-training quantization using optimal brain surgeon approach

Targets: Weights (INT4/INT8)

Why helps: Minimizes quantization error layer by layer

Tradeoffs: Slow quantization time, may need calibration data

AWQ

What: Activation-aware weight quantization

Targets: Weights (INT4) based on activation importance

Why helps: Protects salient weights with higher precision

Tradeoffs: Requires activation statistics, mixed precision overhead

SmoothQuant

What: Smoothes activation distributions to reduce outliers

Targets: Weights + Activations (INT8)

Why helps: Balances quantization difficulty between weights and activations

Tradeoffs: Requires calibration pass, slight accuracy variance

ZeroQuant

What: Zero-shot quantization without calibration

Targets: Weights + Activations (INT8)

Why helps: Works out-of-the-box without data

Tradeoffs: May underperform data-dependent methods

LLM.int8()

What: Mixed precision with outlier handling

Targets: Weights (FP16 for outliers, INT8 for rest)

Why helps: Preserves accuracy while quantizing most weights

Tradeoffs: Requires runtime outlier detection, partial speedup

QuaRot

What: Quantization with rotation transformations

Targets: Weights + Activations (INT4/INT8)

Why helps: Reduces outlier impact through Hadamard rotations

Tradeoffs: Additional compute for rotations, kernel complexity

QLoRA (NF4)

What: Low-rank adaptation with 4-bit normal float

Targets: Adapter weights in NF4, base model frozen

Why helps: Enables fine-tuning large models on single GPU

Tradeoffs: Only for fine-tuning, not inference acceleration

KV-Quant Schemes

What: Quantize attention cache separately

Targets: KV Cache (INT8/INT4)

Why helps: Reduces memory for long context generation

Tradeoffs: Quality degrades with very low bits, per-channel overhead

Linear Layer Quantization Lab

Configure quantization settings and observe the impact on memory and accuracy:

Configuration

Weight Bits: 8
Activation Bits: 8
Per-Channel Scaling:
Outlier Protection (LLM.int8 style):

Metrics

Memory Estimate: Calculating... MB MSE Error: Calculating...

Sample Linear Layer

Loading example...

Memory vs Speed vs Quality

Bit-width Memory Reduction Speed Improvement Quality Impact Best For
FP16 (16 bits) 2x vs FP32 1.5-2x Minimal General use, GPUs with tensor cores
INT8 (8 bits) 4x vs FP32 2-4x Small to moderate Edge devices, CPUs, inference servers
INT4 (4 bits) 8x vs FP32 3-6x (with optimization) Moderate to significant Memory-constrained deployment
Mixed Precision 2-6x 1.5-3x Minimal to small Balance of quality and efficiency

Speedup Considerations

Where Speedups Appear

  • Matrix multiplication with SIMD instructions
  • Reduced memory bandwidth requirements
  • Better cache utilization
  • Specialized hardware (NPUs, TPUs)

Where Speedups Don't Appear

  • Memory-bound operations (embedding lookups)
  • Small batch sizes on GPUs
  • Operations without optimized kernels
  • Dequantization overhead in mixed precision

KV Cache Savings

For long sequences, KV cache quantization provides significant memory savings:

Sequence Length: 4096 tokens Hidden Dimension: 4096 Layers: 32 FP16 KV Cache: 4096 × 4096 × 32 × 2 × 2 bytes = 2 GB INT8 KV Cache: 4096 × 4096 × 32 × 2 × 1 byte = 1 GB INT4 KV Cache: 4096 × 4096 × 32 × 2 × 0.5 byte = 0.5 GB Savings: INT4 uses 4x less memory than FP16

Activation-aware Quantization

Quantization technique that considers activation statistics when determining weight quantization parameters.

Calibration

Process of collecting statistics from a small dataset to determine optimal quantization parameters.

Dequantization

Conversion from quantized integer values back to floating-point for computation.

INT8/INT4

8-bit and 4-bit integer representations used for quantized values.

Mixed Precision

Using different precisions for different parts of the model (e.g., INT4 weights, FP16 activations).

Outliers

Values far from the typical range that are difficult to quantize accurately.

Per-channel Quantization

Using separate quantization parameters for each output channel of a layer.

Per-tensor Quantization

Using a single set of quantization parameters for all values in a tensor.

Post-training Quantization (PTQ)

Quantizing a model after training without additional fine-tuning.

Quantization-aware Training (QAT)

Training process that simulates quantization effects to produce quantization-ready models.

Scale

Multiplier used to map floating-point values to integer range during quantization.

Zero-point

Offset added during quantization to handle asymmetric value distributions.

Essential Papers

GPTQ Playground

Post-training quantization with error-compensated rounding

PTQ Weight-only INT4/INT8
4 bits
128
Memory Savings
75%
Compression Ratio
4×
Estimated Error
2.1%

AWQ Playground

Activation-aware weight quantization protecting salient channels

PTQ Weight-only Activation-aware
4 bits
1%
Memory Savings
75%
Protected Channels
64
Accuracy Loss
0.8%

SmoothQuant Playground

Shifts activation outliers into weights via cross-layer scaling

PTQ Weights + Activations INT8
0.5
8 bits
8 bits
Memory Savings
50%
Throughput Gain
2.1×
Outlier Reduction
85%

LLM.int8() Playground

Mixed precision routing outlier channels to FP16

Runtime Mixed Precision INT8 + FP16
6.0
70B
Memory Usage
70GB
FP16 Channels
0.1%
Performance
95%

QLoRA (NF4) Playground

Adapter training with 4-bit NF4 quantization

Fine-tuning 4-bit NF4
16
32
Trainable Parameters
0.5%
Memory vs Full FT
-80%
Training Speed
0.7×

KV Cache Quantization Playground

Quantize attention KV cache to shrink memory for long context

KV Cache Runtime INT8/INT4/FP8
8 bits
4096
32
KV Cache Size
2.1GB
Memory Savings
50%
Max Batch Size
64