TL;DR: GPU memory is often the bottleneck for AI workloads. HBM3e in the H200 provides 4.8 TB/s bandwidth and 141GB capacity. Match memory to your model size—don't overspend if your models fit on H100's 80GB.
Why Memory Matters
GPU memory architecture is often overlooked in AI infrastructure planning, but it's frequently the bottleneck that determines what models you can run and how efficiently.
Modern large language models have billions of parameters. GPT-4 is estimated to have over 1 trillion parameters. Each parameter requires memory, and during inference, you need additional memory for activations, KV cache, and other operational data.
HBM3e: The Memory Evolution
High Bandwidth Memory (HBM) has evolved through several generations:
| Generation | Bandwidth | Typical Capacity | Example GPU |
|---|---|---|---|
| HBM2 | 900 GB/s | 32-48 GB | V100 |
| HBM2e | 1.2 TB/s | 64-80 GB | A100 |
| HBM3 | 3.35 TB/s | 80 GB | H100 |
| HBM3e | 4.8 TB/s | 141 GB | H200 |
Key insight: The H200's HBM3e provides nearly 5x the bandwidth of HBM2, enabling larger batch sizes and faster inference.
Memory Bandwidth vs. Capacity
Two distinct metrics matter for AI workloads:
| Metric | What It Determines | Example |
|---|---|---|
| Memory Capacity | What models you can load | A 70B model in FP16 requires ~140GB—exceeds H100's 80GB |
| Memory Bandwidth | How fast you can move data | Higher bandwidth = more tokens per second |
Practical Implications
For Training
| Factor | Impact |
|---|---|
| Larger memory | Enables bigger batch sizes |
| Higher bandwidth | Accelerates gradient synchronization |
| Multi-GPU | Scaling becomes more efficient |
For Inference
| Factor | Impact |
|---|---|
| Memory capacity | Determines maximum model size per GPU |
| Bandwidth | Affects tokens-per-second throughput |
| KV cache size | Impacts context window length |
Right-Sizing Your Memory
Not every workload needs the largest GPU. Consider these guidelines:
| Model Size | Recommended GPU | Notes |
|---|---|---|
| 7B models | H100 or A100 | Single GPU sufficient |
| 13B-30B models | H100 | Comfortable headroom |
| 70B models | H200 | Or multi-GPU H100 |
| 70B+ models | Multi-GPU | Mandatory regardless of GPU type |
Memory Architecture Deep Dive
HBM3 and HBM3e (High Bandwidth Memory)
HBM uses a stacked memory architecture with through-silicon vias (TSVs) connecting multiple DRAM dies.
| Characteristic | HBM3/HBM3e | GDDR6X |
|---|---|---|
| Architecture | 8-12 DRAM dies stacked vertically | Traditional side-by-side |
| Bus Width | 1024-bit per stack | 384-bit |
| Bandwidth | 3.35-4.8 TB/s | ~1 TB/s |
| Power Efficiency | ~3.9 pJ/bit | ~8 pJ/bit |
Technical note: HBM3e in the H200 adds 43% more bandwidth than HBM3 through increased data rates (9.2 Gbps per pin vs. 6.4 Gbps).
GDDR6X (Graphics DDR)
Consumer and professional GPUs (RTX 4090, RTX 6000 Ada) use GDDR6X:
- Lower Capacity: 24-48GB typical vs. 80-141GB HBM
- Lower Bandwidth: ~1 TB/s vs. 3-5 TB/s for HBM
- Lower Cost: Significantly cheaper per GB
- Use Case: Suitable for inference on smaller models, not large-scale training
Memory Bandwidth Requirements by Workload
| Model Size | Training | Inference | Recommended |
|---|---|---|---|
| 7B | 500+ GB/s | 200+ GB/s | GDDR6X or HBM |
| 13B | 1+ TB/s | 500+ GB/s | HBM2e or HBM3 |
| 70B | 2+ TB/s | 1+ TB/s | HBM3 or HBM3e |
| 175B+ | 3+ TB/s | 2+ TB/s | HBM3e required |
Choosing the Right Memory Configuration
Training Considerations
For training workloads, memory requirements scale with:
| Factor | Memory Impact |
|---|---|
| Model Parameters | FP16 weights require 2 bytes per parameter. A 70B model needs 140GB just for weights. |
| Optimizer States | Adam optimizer stores momentum and variance, tripling memory for parameters. |
| Activations | Checkpointing reduces memory but increases compute. Plan for 2-4x parameter size. |
| Batch Size | Larger batches improve throughput but require proportionally more memory. |
Rule of thumb: For training, plan for 4-6x the model size in GPU memory per device, then distribute across nodes.
Inference Considerations
Inference memory requirements are more predictable:
| Factor | Memory Impact |
|---|---|
| Model Weights | FP16 requires 2 bytes/parameter; INT8 quantization halves this |
| KV Cache | Grows with context length and batch size. For Llama 70B with 8K context: ~16GB per concurrent request |
| Activation Memory | Minimal compared to training, typically <5% of total |
Rule of thumb: For inference, model weights + (KV cache × max concurrent users) determines memory needs.
Frequently Asked Questions
How much GPU memory do I need for fine-tuning?
Fine-tuning memory requirements depend on the technique:
| Technique | Memory Requirement | Example (Llama 70B) |
|---|---|---|
| Full Fine-tuning | 4-6x model size | 280-420GB |
| LoRA/QLoRA | 1.5-2x model size | ~100GB |
| Gradient Checkpointing | 2-3x model size | 140-210GB |
Practical example: Fine-tuning Llama 70B with LoRA requires ~100GB—possible on H200 (141GB) but not H100 (80GB) without quantization.
What's the difference between capacity and bandwidth for inference?
| Metric | What It Does | Impact |
|---|---|---|
| Capacity | Determines max model size | If model requires 90GB, it won't fit on 80GB H100 |
| Bandwidth | Determines inference speed | LLM inference is memory-bound—higher bandwidth = more tokens/sec |
Example: For a 70B model, H100's 3.35 TB/s yields ~70 tokens/sec; H200's 4.8 TB/s yields ~100 tokens/sec.
Can I use model parallelism to work around memory limits?
Yes, but with trade-offs:
| Technique | Pros | Cons |
|---|---|---|
| Tensor Parallelism | Splits layers across GPUs | Adds communication overhead (NVLink essential) |
| Pipeline Parallelism | Different layers on different GPUs | Adds latency, reduces throughput per GPU |
| ZeRO/FSDP | Shards optimizer states and parameters | Works for training, complex for inference |
Recommendation: Single-GPU inference is always simpler. If your model fits on one H200, that's preferable to two H100s with tensor parallelism.
Key Takeaways
| Takeaway | Why It Matters |
|---|---|
| Match memory to model size | Don't overspend on H200 if your models fit comfortably on H100 |
| Bandwidth matters for inference | Memory-bound operations scale directly with bandwidth |
| Plan for KV cache | Long-context applications can consume 50%+ of GPU memory for KV cache alone |
| Consider total cost | Two H100s may cost more than one H200 while providing similar effective capacity |
Conclusion
Understanding GPU memory architecture isn't just academic—it directly impacts your infrastructure decisions and costs. The right GPU for your workload balances capacity, bandwidth, and budget.