Back to Blog
Technical · January 12, 2026 · 6 min read

Understanding GPU Memory Architecture for AI Workloads

Kyle Sidles CTO, SLYD

TL;DR: GPU memory is often the bottleneck for AI workloads. HBM3e in the H200 provides 4.8 TB/s bandwidth and 141GB capacity. Match memory to your model size—don't overspend if your models fit on H100's 80GB.


Why Memory Matters

GPU memory architecture is often overlooked in AI infrastructure planning, but it's frequently the bottleneck that determines what models you can run and how efficiently.

Modern large language models have billions of parameters. GPT-4 is estimated to have over 1 trillion parameters. Each parameter requires memory, and during inference, you need additional memory for activations, KV cache, and other operational data.


HBM3e: The Memory Evolution

High Bandwidth Memory (HBM) has evolved through several generations:

Generation Bandwidth Typical Capacity Example GPU
HBM2 900 GB/s 32-48 GB V100
HBM2e 1.2 TB/s 64-80 GB A100
HBM3 3.35 TB/s 80 GB H100
HBM3e 4.8 TB/s 141 GB H200

Key insight: The H200's HBM3e provides nearly 5x the bandwidth of HBM2, enabling larger batch sizes and faster inference.


Memory Bandwidth vs. Capacity

Two distinct metrics matter for AI workloads:

Metric What It Determines Example
Memory Capacity What models you can load A 70B model in FP16 requires ~140GB—exceeds H100's 80GB
Memory Bandwidth How fast you can move data Higher bandwidth = more tokens per second

Practical Implications

For Training

Factor Impact
Larger memory Enables bigger batch sizes
Higher bandwidth Accelerates gradient synchronization
Multi-GPU Scaling becomes more efficient

For Inference

Factor Impact
Memory capacity Determines maximum model size per GPU
Bandwidth Affects tokens-per-second throughput
KV cache size Impacts context window length

Right-Sizing Your Memory

Not every workload needs the largest GPU. Consider these guidelines:

Model Size Recommended GPU Notes
7B models H100 or A100 Single GPU sufficient
13B-30B models H100 Comfortable headroom
70B models H200 Or multi-GPU H100
70B+ models Multi-GPU Mandatory regardless of GPU type

Memory Architecture Deep Dive

HBM3 and HBM3e (High Bandwidth Memory)

HBM uses a stacked memory architecture with through-silicon vias (TSVs) connecting multiple DRAM dies.

Characteristic HBM3/HBM3e GDDR6X
Architecture 8-12 DRAM dies stacked vertically Traditional side-by-side
Bus Width 1024-bit per stack 384-bit
Bandwidth 3.35-4.8 TB/s ~1 TB/s
Power Efficiency ~3.9 pJ/bit ~8 pJ/bit

Technical note: HBM3e in the H200 adds 43% more bandwidth than HBM3 through increased data rates (9.2 Gbps per pin vs. 6.4 Gbps).

GDDR6X (Graphics DDR)

Consumer and professional GPUs (RTX 4090, RTX 6000 Ada) use GDDR6X:

  • Lower Capacity: 24-48GB typical vs. 80-141GB HBM
  • Lower Bandwidth: ~1 TB/s vs. 3-5 TB/s for HBM
  • Lower Cost: Significantly cheaper per GB
  • Use Case: Suitable for inference on smaller models, not large-scale training

Memory Bandwidth Requirements by Workload

Model Size Training Inference Recommended
7B 500+ GB/s 200+ GB/s GDDR6X or HBM
13B 1+ TB/s 500+ GB/s HBM2e or HBM3
70B 2+ TB/s 1+ TB/s HBM3 or HBM3e
175B+ 3+ TB/s 2+ TB/s HBM3e required

Choosing the Right Memory Configuration

Training Considerations

For training workloads, memory requirements scale with:

Factor Memory Impact
Model Parameters FP16 weights require 2 bytes per parameter. A 70B model needs 140GB just for weights.
Optimizer States Adam optimizer stores momentum and variance, tripling memory for parameters.
Activations Checkpointing reduces memory but increases compute. Plan for 2-4x parameter size.
Batch Size Larger batches improve throughput but require proportionally more memory.

Rule of thumb: For training, plan for 4-6x the model size in GPU memory per device, then distribute across nodes.

Inference Considerations

Inference memory requirements are more predictable:

Factor Memory Impact
Model Weights FP16 requires 2 bytes/parameter; INT8 quantization halves this
KV Cache Grows with context length and batch size. For Llama 70B with 8K context: ~16GB per concurrent request
Activation Memory Minimal compared to training, typically <5% of total

Rule of thumb: For inference, model weights + (KV cache × max concurrent users) determines memory needs.


Frequently Asked Questions

How much GPU memory do I need for fine-tuning?

Fine-tuning memory requirements depend on the technique:

Technique Memory Requirement Example (Llama 70B)
Full Fine-tuning 4-6x model size 280-420GB
LoRA/QLoRA 1.5-2x model size ~100GB
Gradient Checkpointing 2-3x model size 140-210GB

Practical example: Fine-tuning Llama 70B with LoRA requires ~100GB—possible on H200 (141GB) but not H100 (80GB) without quantization.

What's the difference between capacity and bandwidth for inference?

Metric What It Does Impact
Capacity Determines max model size If model requires 90GB, it won't fit on 80GB H100
Bandwidth Determines inference speed LLM inference is memory-bound—higher bandwidth = more tokens/sec

Example: For a 70B model, H100's 3.35 TB/s yields ~70 tokens/sec; H200's 4.8 TB/s yields ~100 tokens/sec.

Can I use model parallelism to work around memory limits?

Yes, but with trade-offs:

Technique Pros Cons
Tensor Parallelism Splits layers across GPUs Adds communication overhead (NVLink essential)
Pipeline Parallelism Different layers on different GPUs Adds latency, reduces throughput per GPU
ZeRO/FSDP Shards optimizer states and parameters Works for training, complex for inference

Recommendation: Single-GPU inference is always simpler. If your model fits on one H200, that's preferable to two H100s with tensor parallelism.


Key Takeaways

Takeaway Why It Matters
Match memory to model size Don't overspend on H200 if your models fit comfortably on H100
Bandwidth matters for inference Memory-bound operations scale directly with bandwidth
Plan for KV cache Long-context applications can consume 50%+ of GPU memory for KV cache alone
Consider total cost Two H100s may cost more than one H200 while providing similar effective capacity

Conclusion

Understanding GPU memory architecture isn't just academic—it directly impacts your infrastructure decisions and costs. The right GPU for your workload balances capacity, bandwidth, and budget.

Share this article

Ready to Build Your AI Infrastructure?

Talk to our team about sovereign AI deployment for your enterprise.

Reconnecting to the server...

Please wait while we restore your connection

An unhandled error has occurred. Reload 🗙