Understanding GPU Memory Architecture for AI Workloads

TL;DR: GPU memory is often the bottleneck for AI workloads. HBM3e in the H200 provides 4.8 TB/s bandwidth and 141GB capacity. Match memory to your model size—don't overspend if your models fit on H100's 80GB.

Why Memory Matters

GPU memory architecture is often overlooked in AI infrastructure planning, but it's frequently the bottleneck that determines what models you can run and how efficiently.

Modern large language models have billions of parameters. GPT-4 is estimated to have over 1 trillion parameters. Each parameter requires memory, and during inference, you need additional memory for activations, KV cache, and other operational data.

HBM3e: The Memory Evolution

High Bandwidth Memory (HBM) has evolved through several generations:

Generation	Bandwidth	Typical Capacity	Example GPU
HBM2	900 GB/s	32-48 GB	V100
HBM2e	1.2 TB/s	64-80 GB	A100
HBM3	3.35 TB/s	80 GB	H100
HBM3e	4.8 TB/s	141 GB	H200

Key insight: The H200's HBM3e provides nearly 5x the bandwidth of HBM2, enabling larger batch sizes and faster inference.

Memory Bandwidth vs. Capacity

Two distinct metrics matter for AI workloads:

Metric	What It Determines	Example
Memory Capacity	What models you can load	A 70B model in FP16 requires ~140GB—exceeds H100's 80GB
Memory Bandwidth	How fast you can move data	Higher bandwidth = more tokens per second

Practical Implications

For Training

Factor	Impact
Larger memory	Enables bigger batch sizes
Higher bandwidth	Accelerates gradient synchronization
Multi-GPU	Scaling becomes more efficient

For Inference

Factor	Impact
Memory capacity	Determines maximum model size per GPU
Bandwidth	Affects tokens-per-second throughput
KV cache size	Impacts context window length

Right-Sizing Your Memory

Not every workload needs the largest GPU. Consider these guidelines:

Model Size	Recommended GPU	Notes
7B models	H100 or A100	Single GPU sufficient
13B-30B models	H100	Comfortable headroom
70B models	H200	Or multi-GPU H100
70B+ models	Multi-GPU	Mandatory regardless of GPU type

Memory Architecture Deep Dive

HBM3 and HBM3e (High Bandwidth Memory)

HBM uses a stacked memory architecture with through-silicon vias (TSVs) connecting multiple DRAM dies.

Characteristic	HBM3/HBM3e	GDDR6X
Architecture	8-12 DRAM dies stacked vertically	Traditional side-by-side
Bus Width	1024-bit per stack	384-bit
Bandwidth	3.35-4.8 TB/s	~1 TB/s
Power Efficiency	~3.9 pJ/bit	~8 pJ/bit

Technical note: HBM3e in the H200 adds 43% more bandwidth than HBM3 through increased data rates (9.2 Gbps per pin vs. 6.4 Gbps).

GDDR6X (Graphics DDR)

Consumer and professional GPUs (RTX 4090, RTX 6000 Ada) use GDDR6X:

Lower Capacity: 24-48GB typical vs. 80-141GB HBM
Lower Bandwidth: ~1 TB/s vs. 3-5 TB/s for HBM
Lower Cost: Significantly cheaper per GB
Use Case: Suitable for inference on smaller models, not large-scale training

Memory Bandwidth Requirements by Workload

Model Size	Training	Inference	Recommended
7B	500+ GB/s	200+ GB/s	GDDR6X or HBM
13B	1+ TB/s	500+ GB/s	HBM2e or HBM3
70B	2+ TB/s	1+ TB/s	HBM3 or HBM3e
175B+	3+ TB/s	2+ TB/s	HBM3e required

Choosing the Right Memory Configuration

Training Considerations

For training workloads, memory requirements scale with:

Factor	Memory Impact
Model Parameters	FP16 weights require 2 bytes per parameter. A 70B model needs 140GB just for weights.
Optimizer States	Adam optimizer stores momentum and variance, tripling memory for parameters.
Activations	Checkpointing reduces memory but increases compute. Plan for 2-4x parameter size.
Batch Size	Larger batches improve throughput but require proportionally more memory.

Rule of thumb: For training, plan for 4-6x the model size in GPU memory per device, then distribute across nodes.

Inference Considerations

Inference memory requirements are more predictable:

Factor	Memory Impact
Model Weights	FP16 requires 2 bytes/parameter; INT8 quantization halves this
KV Cache	Grows with context length and batch size. For Llama 70B with 8K context: ~16GB per concurrent request
Activation Memory	Minimal compared to training, typically <5% of total

Rule of thumb: For inference, model weights + (KV cache × max concurrent users) determines memory needs.

Frequently Asked Questions

How much GPU memory do I need for fine-tuning?

Fine-tuning memory requirements depend on the technique:

Technique	Memory Requirement	Example (Llama 70B)
Full Fine-tuning	4-6x model size	280-420GB
LoRA/QLoRA	1.5-2x model size	~100GB
Gradient Checkpointing	2-3x model size	140-210GB

Practical example: Fine-tuning Llama 70B with LoRA requires ~100GB—possible on H200 (141GB) but not H100 (80GB) without quantization.

What's the difference between capacity and bandwidth for inference?

Metric	What It Does	Impact
Capacity	Determines max model size	If model requires 90GB, it won't fit on 80GB H100
Bandwidth	Determines inference speed	LLM inference is memory-bound—higher bandwidth = more tokens/sec

Example: For a 70B model, H100's 3.35 TB/s yields ~70 tokens/sec; H200's 4.8 TB/s yields ~100 tokens/sec.

Can I use model parallelism to work around memory limits?

Yes, but with trade-offs:

Technique	Pros	Cons
Tensor Parallelism	Splits layers across GPUs	Adds communication overhead (NVLink essential)
Pipeline Parallelism	Different layers on different GPUs	Adds latency, reduces throughput per GPU
ZeRO/FSDP	Shards optimizer states and parameters	Works for training, complex for inference

Recommendation: Single-GPU inference is always simpler. If your model fits on one H200, that's preferable to two H100s with tensor parallelism.

Key Takeaways

Takeaway	Why It Matters
Match memory to model size	Don't overspend on H200 if your models fit comfortably on H100
Bandwidth matters for inference	Memory-bound operations scale directly with bandwidth
Plan for KV cache	Long-context applications can consume 50%+ of GPU memory for KV cache alone
Consider total cost	Two H100s may cost more than one H200 while providing similar effective capacity

Conclusion

Understanding GPU memory architecture isn't just academic—it directly impacts your infrastructure decisions and costs. The right GPU for your workload balances capacity, bandwidth, and budget.