Making large language models smaller, faster, and more accessible
Modern LLMs require gigabytes of memory just to store their parameters. A 7B parameter model at FP16 needs 14GB, while a 175B model needs 350GB. This creates barriers for deployment, especially on edge devices and in memory-constrained environments.
Quantization reduces memory usage by representing weights and activations with fewer bits, trading some precision for dramatic memory savings.
Quantization converts floating-point numbers to lower-precision integers, cutting memory requirements by 2-4x while maintaining model quality with specialized techniques.
By quantizing from FP16 (16 bits) to INT8 (8 bits) or even INT4 (4 bits), we can fit larger models on the same hardware and often achieve faster inference through specialized integer arithmetic.
The architecture behind modern language models
Transformer models stack identical layers, each containing:
During inference, the attention mechanism stores Key and Value vectors for all previously generated tokens in a KV cache. For sequence length L and hidden dimension H, the cache requires 2 × L × H values.
At INT8, a 2048-token sequence with 4096-dim embeddings needs 16MB of KV cache per layer—significant memory that benefits from quantization.
The core components of neural networks
Fixed parameters learned during training. In a Linear layer y = Wx + b, W are the weights. Typically range from -2 to 2 after normalization.
Dynamic values flowing through the network during inference. These are intermediate outputs between layers. Range can vary widely, with outliers requiring special handling.
Stores Key and Value vectors from attention layers for efficient autoregressive generation. Grows linearly with sequence length. Typically bounded values but can have outliers.
Additive parameters in Linear layers. Usually small in magnitude (often < 1) and sometimes omitted in modern architectures for simplicity.
| Component | Role | Typical Range | Quantization Challenge |
|---|---|---|---|
| Weights | Fixed transformations | [-2, 2] (normalized) | Per-channel variation |
| Activations | Dynamic data flow | [-6, 6] with outliers | Outliers and per-token variation |
| KV Cache | Attention memory | [-3, 3] | Sequence-dependent scaling |
Consider a simple Linear layer: y = Wx + b
This tiny example shows how weights (W) transform inputs (x) through matrix multiplication, with biases (b) adding an offset. At scale, LLMs perform billions of such operations.
Understanding the fundamentals of numerical precision reduction
Number of bits used to represent each number. Fewer bits = less precision but more memory savings.
Map between floating-point and integer ranges:
These parameters enable reversible conversion between precisions.
Zero-point is 0. Range is [-127, 127] for INT8. Simpler but may not use full range optimally.
Non-zero zero-point allows better range utilization. Range is [0, 255] for UINT8.
Per-tensor: One scale for all values
Per-channel: Separate scale per output channel
Per-group: Groups of channels share scales
The round operation introduces quantization error. Better scale selection minimizes this error.
Adjust the bit-width to see how quantization affects a sample vector:
Understanding the landscape of quantization approaches
Quantize already-trained models without retraining. Fast but may lose accuracy at very low bit-widths.
Simulate quantization during training. Better accuracy but requires access to training pipeline.
PTQ with brief fine-tuning. Balances speed and accuracy for most practical scenarios.
| Target | Description | Benefits | Challenges |
|---|---|---|---|
| Weight-only | Only model weights quantized, activations remain FP16 | Easy to implement, good memory savings | Limited speedup, mixed precision overhead |
| Weights + Activations | Both weights and activations quantized | Maximum speedup, significant memory savings | Activation outliers, accuracy drops |
| KV Cache | Only attention cache quantized | Reduces memory for long sequences | Affects generation quality gradually |
| Low-bit Finetuning | Special formats like NF4 for efficient updates | Enables large model adaptation | Requires custom kernels |
Practical techniques and their tradeoffs
What: Post-training quantization using optimal brain surgeon approach
Targets: Weights (INT4/INT8)
Why helps: Minimizes quantization error layer by layer
Tradeoffs: Slow quantization time, may need calibration data
What: Activation-aware weight quantization
Targets: Weights (INT4) based on activation importance
Why helps: Protects salient weights with higher precision
Tradeoffs: Requires activation statistics, mixed precision overhead
What: Smoothes activation distributions to reduce outliers
Targets: Weights + Activations (INT8)
Why helps: Balances quantization difficulty between weights and activations
Tradeoffs: Requires calibration pass, slight accuracy variance
What: Zero-shot quantization without calibration
Targets: Weights + Activations (INT8)
Why helps: Works out-of-the-box without data
Tradeoffs: May underperform data-dependent methods
What: Mixed precision with outlier handling
Targets: Weights (FP16 for outliers, INT8 for rest)
Why helps: Preserves accuracy while quantizing most weights
Tradeoffs: Requires runtime outlier detection, partial speedup
What: Quantization with rotation transformations
Targets: Weights + Activations (INT4/INT8)
Why helps: Reduces outlier impact through Hadamard rotations
Tradeoffs: Additional compute for rotations, kernel complexity
What: Low-rank adaptation with 4-bit normal float
Targets: Adapter weights in NF4, base model frozen
Why helps: Enables fine-tuning large models on single GPU
Tradeoffs: Only for fine-tuning, not inference acceleration
What: Quantize attention cache separately
Targets: KV Cache (INT8/INT4)
Why helps: Reduces memory for long context generation
Tradeoffs: Quality degrades with very low bits, per-channel overhead
Experiment with quantization parameters in real-time
Configure quantization settings and observe the impact on memory and accuracy:
Understanding the practical implications of quantization
| Bit-width | Memory Reduction | Speed Improvement | Quality Impact | Best For |
|---|---|---|---|---|
| FP16 (16 bits) | 2x vs FP32 | 1.5-2x | Minimal | General use, GPUs with tensor cores |
| INT8 (8 bits) | 4x vs FP32 | 2-4x | Small to moderate | Edge devices, CPUs, inference servers |
| INT4 (4 bits) | 8x vs FP32 | 3-6x (with optimization) | Moderate to significant | Memory-constrained deployment |
| Mixed Precision | 2-6x | 1.5-3x | Minimal to small | Balance of quality and efficiency |
For long sequences, KV cache quantization provides significant memory savings:
Key terms in model quantization
Quantization technique that considers activation statistics when determining weight quantization parameters.
Process of collecting statistics from a small dataset to determine optimal quantization parameters.
Conversion from quantized integer values back to floating-point for computation.
8-bit and 4-bit integer representations used for quantized values.
Using different precisions for different parts of the model (e.g., INT4 weights, FP16 activations).
Values far from the typical range that are difficult to quantize accurately.
Using separate quantization parameters for each output channel of a layer.
Using a single set of quantization parameters for all values in a tensor.
Quantizing a model after training without additional fine-tuning.
Training process that simulates quantization effects to produce quantization-ready models.
Multiplier used to map floating-point values to integer range during quantization.
Offset added during quantization to handle asymmetric value distributions.
Further reading and resources
Introduces layer-wise weight quantization using optimal brain surgeon
Weight quantization based on activation magnitude distribution
Mixed precision approach handling outliers in LLMs
Mathematical framework to balance quantization difficulty
4-bit NormalFloat (NF4) for efficient fine-tuning
Zero-shot quantization without calibration data
Uses rotations to handle activation outliers
Library with quantization support for various methods
Experiment with different quantization methods in real-time
Post-training quantization with error-compensated rounding
Activation-aware weight quantization protecting salient channels
Shifts activation outliers into weights via cross-layer scaling
Mixed precision routing outlier channels to FP16
Adapter training with 4-bit NF4 quantization
Quantize attention KV cache to shrink memory for long context