Introducing Quantization for LLMs

The Challenge: Models Are Massive

Modern LLMs require gigabytes of memory just to store their parameters. A 7B parameter model at FP16 needs 14GB, while a 175B model needs 350GB. This creates barriers for deployment, especially on edge devices and in memory-constrained environments.

Quantization reduces memory usage by representing weights and activations with fewer bits, trading some precision for dramatic memory savings.

Model Memory Requirements by Precision

Key Takeaway

Quantization converts floating-point numbers to lower-precision integers, cutting memory requirements by 2-4x while maintaining model quality with specialized techniques.

By quantizing from FP16 (16 bits) to INT8 (8 bits) or even INT4 (4 bits), we can fit larger models on the same hardware and often achieve faster inference through specialized integer arithmetic.

Transformer Layers

Transformer models stack identical layers, each containing:

Multi-Head Attention: Computes relationships between tokens
Feed-Forward Network (MLP):strong> Processes each token independently

Layer Normalization: Stabilizes training and inference

Residual Connections: Preserve information flow

KV Cache Growth

During inference, the attention mechanism stores Key and Value vectors for all previously generated tokens in a KV cache. For sequence length L and hidden dimension H, the cache requires 2 × L × H values.

At INT8, a 2048-token sequence with 4096-dim embeddings needs 16MB of KV cache per layer—significant memory that benefits from quantization.

Weights, Activations, and KV Cache

The core components of neural networks

Definitions

Weights

Fixed parameters learned during training. In a Linear layer y = Wx + b, W are the weights. Typically range from -2 to 2 after normalization.

Activations

Dynamic values flowing through the network during inference. These are intermediate outputs between layers. Range can vary widely, with outliers requiring special handling.

KV Cache

Stores Key and Value vectors from attention layers for efficient autoregressive generation. Grows linearly with sequence length. Typically bounded values but can have outliers.

Biases

Additive parameters in Linear layers. Usually small in magnitude (often < 1) and sometimes omitted in modern architectures for simplicity.

Comparison Table

Component Role Typical Range Quantization Challenge

Weights Fixed transformations [-2, 2] (normalized) Per-channel variation

Activations Dynamic data flow [-6, 6] with outliers Outliers and per-token variation

KV Cache Attention memory [-3, 3] Sequence-dependent scaling

Linear Layer Example

Consider a simple Linear layer: y = Wx + b

W = [[ 0.5, -0.3], [ 1.2, 0.8]] // 2x2 weight matrix x = [0.7, -1.5] // Input vector b = [0.1, -0.2] // Bias vector y = Wx + b = [0.5*0.7 + (-0.3)*(-1.5) + 0.1, 1.2*0.7 + 0.8*(-1.5) + (-0.2)] = [0.35 + 0.45 + 0.1, 0.84 - 1.2 - 0.2] = [0.90, -0.56] // Output vector

This tiny example shows how weights (W) transform inputs (x) through matrix multiplication, with biases (b) adding an offset. At scale, LLMs perform billions of such operations.

Quantization Basics

Understanding the fundamentals of numerical precision reduction

Core Concepts

Bit-width

Number of bits used to represent each number. Fewer bits = less precision but more memory savings.

FP32: 32 bits (full precision)

FP16: 16 bits (half precision)

INT8: 8 bits (integer)

INT4: 4 bits (extreme quantization)

Scale & Zero-point

Map between floating-point and integer ranges:

Scale (s) = (max_fp - min_fp) / (max_int - min_int) Zero-point (zp) = round(min_fp / s) - min_int

These parameters enable reversible conversion between precisions.

Quantization Types

Symmetric

Zero-point is 0. Range is [-127, 127] for INT8. Simpler but may not use full range optimally.

Asymmetric

Non-zero zero-point allows better range utilization. Range is [0, 255] for UINT8.

Granularity

Per-tensor: One scale for all values

Per-channel: Separate scale per output channel

Per-group: Groups of channels share scales

Quantization Formula

Quantization: q = round(y / scale) + zero_point Dequantization: y ≈ (q - zero_point) × scale

The round operation introduces quantization error. Better scale selection minimizes this error.

Interactive Demo

Adjust the bit-width to see how quantization affects a sample vector:

Bit-width: 4 bits

Symmetric:

Original (FP16)

[0.15, -0.32, 0.87, -1.23, 2.45]

Quantized & Dequantized

Loading...

Quantization Error (MSE)

Calculating...

Categories of Quantization Techniques

Understanding the landscape of quantization approaches

When Applied

Post-Training Quantization (PTQ)

Quantize already-trained models without retraining. Fast but may lose accuracy at very low bit-widths.

Quantization-Aware Training (QAT)

Simulate quantization during training. Better accuracy but requires access to training pipeline.

Hybrid Small-Update

PTQ with brief fine-tuning. Balances speed and accuracy for most practical scenarios.

What Is Quantized

Target Description Benefits Challenges

Weight-only Only model weights quantized, activations remain FP16 Easy to implement, good memory savings Limited speedup, mixed precision overhead

Weights + Activations Both weights and activations quantized Maximum speedup, significant memory savings Activation outliers, accuracy drops

KV Cache Only attention cache quantized Reduces memory for long sequences Affects generation quality gradually

Low-bit Finetuning Special formats like NF4 for efficient updates Enables large model adaptation Requires custom kernels

How It Works

Uniform Methods

Standard INT quantization: Linear mapping between FP and INT

Activation-aware scaling: Different scales for activation stats

Rotations: Pre-process weights to reduce outliers

Non-uniform Methods

Codebooks: Vector quantization with learned centroids

Per-vector schemes: Individual scales per vector

Sparse + Quant: Combine sparsity with quantization

Technique Taxonomy

Quantization Methods in Practice

Practical techniques and their tradeoffs

GPTQ

What: Post-training quantization using optimal brain surgeon approach

Targets: Weights (INT4/INT8)

Why helps: Minimizes quantization error layer by layer

Tradeoffs: Slow quantization time, may need calibration data

AWQ

What: Activation-aware weight quantization

Targets: Weights (INT4) based on activation importance

Why helps: Protects salient weights with higher precision

Tradeoffs: Requires activation statistics, mixed precision overhead

SmoothQuant

What: Smoothes activation distributions to reduce outliers

Targets: Weights + Activations (INT8)

Why helps: Balances quantization difficulty between weights and activations

Tradeoffs: Requires calibration pass, slight accuracy variance

ZeroQuant

What: Zero-shot quantization without calibration

Targets: Weights + Activations (INT8)

Why helps: Works out-of-the-box without data

Tradeoffs: May underperform data-dependent methods

LLM.int8()

What: Mixed precision with outlier handling

Targets: Weights (FP16 for outliers, INT8 for rest)

Why helps: Preserves accuracy while quantizing most weights

Tradeoffs: Requires runtime outlier detection, partial speedup

QuaRot

What: Quantization with rotation transformations

Targets: Weights + Activations (INT4/INT8)

Why helps: Reduces outlier impact through Hadamard rotations

Tradeoffs: Additional compute for rotations, kernel complexity

QLoRA (NF4)

What: Low-rank adaptation with 4-bit normal float

Targets: Adapter weights in NF4, base model frozen

Why helps: Enables fine-tuning large models on single GPU

Tradeoffs: Only for fine-tuning, not inference acceleration

KV-Quant Schemes

What: Quantize attention cache separately

Targets: KV Cache (INT8/INT4)

Why helps: Reduces memory for long context generation

Tradeoffs: Quality degrades with very low bits, per-channel overhead

Hands-on Mini Lab

Experiment with quantization parameters in real-time

Linear Layer Quantization Lab

Configure quantization settings and observe the impact on memory and accuracy:

Configuration

Weight Bits: 8

Activation Bits: 8

Per-Channel Scaling:

Outlier Protection (LLM.int8 style):

Metrics

Memory Estimate: Calculating... MB MSE Error: Calculating...

Sample Linear Layer

Loading example...

Performance and Tradeoffs

Understanding the practical implications of quantization

Memory vs Speed vs Quality

Bit-width Memory Reduction Speed Improvement Quality Impact Best For

FP16 (16 bits) 2x vs FP32 1.5-2x Minimal General use, GPUs with tensor cores

INT8 (8 bits) 4x vs FP32 2-4x Small to moderate Edge devices, CPUs, inference servers

INT4 (4 bits) 8x vs FP32 3-6x (with optimization) Moderate to significant Memory-constrained deployment

Mixed Precision 2-6x 1.5-3x Minimal to small Balance of quality and efficiency

Speedup Considerations

Where Speedups Appear

Matrix multiplication with SIMD instructions

Reduced memory bandwidth requirements

Better cache utilization

Specialized hardware (NPUs, TPUs)

Where Speedups Don't Appear

Memory-bound operations (embedding lookups)

Small batch sizes on GPUs

Operations without optimized kernels

Dequantization overhead in mixed precision

KV Cache Savings

For long sequences, KV cache quantization provides significant memory savings:

Sequence Length: 4096 tokens Hidden Dimension: 4096 Layers: 32 FP16 KV Cache: 4096 × 4096 × 32 × 2 × 2 bytes = 2 GB INT8 KV Cache: 4096 × 4096 × 32 × 2 × 1 byte = 1 GB INT4 KV Cache: 4096 × 4096 × 32 × 2 × 0.5 byte = 0.5 GB Savings: INT4 uses 4x less memory than FP16

Glossary

Key terms in model quantization

Activation-aware Quantization

Quantization technique that considers activation statistics when determining weight quantization parameters.

Calibration

Process of collecting statistics from a small dataset to determine optimal quantization parameters.

Dequantization

Conversion from quantized integer values back to floating-point for computation.

INT8/INT4

8-bit and 4-bit integer representations used for quantized values.

Mixed Precision

Using different precisions for different parts of the model (e.g., INT4 weights, FP16 activations).

Outliers

Values far from the typical range that are difficult to quantize accurately.

Per-channel Quantization

Using separate quantization parameters for each output channel of a layer.

Per-tensor Quantization

Using a single set of quantization parameters for all values in a tensor.

Post-training Quantization (PTQ)

Quantizing a model after training without additional fine-tuning.

Quantization-aware Training (QAT)

Training process that simulates quantization effects to produce quantization-ready models.

Scale

Multiplier used to map floating-point values to integer range during quantization.

Zero-point

Offset added during quantization to handle asymmetric value distributions.

References

Further reading and resources

Essential Papers

GPTQ: Accurate Post-training Compression for Generative LLMs
Introduces layer-wise weight quantization using optimal brain surgeon

AWQ: Activation-aware Weight Quantization
Weight quantization based on activation magnitude distribution

LLM.int8(): 8-bit Matrix Multiplication for Transformers
Mixed precision approach handling outliers in LLMs

SmoothQuant: Accurate and Efficient Post-training Quantization
Mathematical framework to balance quantization difficulty

QLoRA: Efficient Finetuning of Quantized LLMs
4-bit NormalFloat (NF4) for efficient fine-tuning

ZeroQuant: Efficient and Affordable Post-Training Quantization
Zero-shot quantization without calibration data

QuaRot: Quantization with Rotations
Uses rotations to handle activation outliers

Hugging Face Transformers
Library with quantization support for various methods

Interactive Playgrounds

Experiment with different quantization methods in real-time

GPTQ Playground

Post-training quantization with error-compensated rounding

PTQ Weight-only INT4/INT8

Weight Bits
4 bits

Block Size
128

Memory Savings

75%

Compression Ratio

4×

Estimated Error

2.1%

AWQ Playground

Activation-aware weight quantization protecting salient channels

PTQ Weight-only Activation-aware

Weight Bits
4 bits

Salient Channel %
1%

Memory Savings

75%

Protected Channels

64

Accuracy Loss

0.8%

SmoothQuant Playground

Shifts activation outliers into weights via cross-layer scaling

PTQ Weights + Activations INT8

Smoothing Alpha (α)
0.5

Weight Bits
8 bits

Activation Bits
8 bits

Memory Savings

50%

Throughput Gain

2.1×

Outlier Reduction

85%

LLM.int8() Playground

Mixed precision routing outlier channels to FP16

Runtime Mixed Precision INT8 + FP16

Outlier Threshold
6.0

Model Size (B params)
70B

Memory Usage

70GB

FP16 Channels

0.1%

Performance

95%

QLoRA (NF4) Playground

Adapter training with 4-bit NF4 quantization

Fine-tuning 4-bit NF4

LoRA Rank (r)
16

LoRA Alpha (α)
32

Double Quantization

Trainable Parameters

0.5%

Memory vs Full FT

-80%

Training Speed

0.7×

KV Cache Quantization Playground

Quantize attention KV cache to shrink memory for long context

KV Cache Runtime INT8/INT4/FP8

KV Bits
8 bits

Context Length
4096

Layers
32

KV Cache Size

2.1GB

Memory Savings

50%

Max Batch Size

64

Component	Role	Typical Range	Quantization Challenge
Weights	Fixed transformations	[-2, 2] (normalized)	Per-channel variation
Activations	Dynamic data flow	[-6, 6] with outliers	Outliers and per-token variation
KV Cache	Attention memory	[-3, 3]	Sequence-dependent scaling

Target	Description	Benefits	Challenges
Weight-only	Only model weights quantized, activations remain FP16	Easy to implement, good memory savings	Limited speedup, mixed precision overhead
Weights + Activations	Both weights and activations quantized	Maximum speedup, significant memory savings	Activation outliers, accuracy drops
KV Cache	Only attention cache quantized	Reduces memory for long sequences	Affects generation quality gradually
Low-bit Finetuning	Special formats like NF4 for efficient updates	Enables large model adaptation	Requires custom kernels

Bit-width	Memory Reduction	Speed Improvement	Quality Impact	Best For
FP16 (16 bits)	2x vs FP32	1.5-2x	Minimal	General use, GPUs with tensor cores
INT8 (8 bits)	4x vs FP32	2-4x	Small to moderate	Edge devices, CPUs, inference servers
INT4 (4 bits)	8x vs FP32	3-6x (with optimization)	Moderate to significant	Memory-constrained deployment
Mixed Precision	2-6x	1.5-3x	Minimal to small	Balance of quality and efficiency