AI/ML

Building GPU Infrastructure for AI Inference

November 20, 2025

NVIDIA GPU KAITO Azure AI Infrastructure

Building GPU Infrastructure for AI Inference

Deploying large language models and vision-language models in production requires careful infrastructure planning. Here is how we set up GPU infrastructure using NVIDIA H100s and KAITO on Azure Kubernetes Service.

Why GPU Infrastructure Matters

Running AI models at inference time is fundamentally different from training. Inference workloads need low latency, consistent throughput, and cost-efficient GPU utilization. The infrastructure choices you make directly impact your per-request costs and response times.

NVIDIA MIG (Multi-Instance GPU)

MIG allows partitioning a single GPU into multiple isolated instances. For an H100 with 80GB of memory, you can create up to 7 independent GPU instances, each with dedicated compute, memory, and cache resources.

This is transformative for inference workloads where a full GPU would be wasteful for a single model. MIG lets you run multiple models on a single GPU with hardware-level isolation.

KAITO on AKS

KAITO (Kubernetes AI Toolchain Operator) simplifies AI model deployment on Kubernetes. Instead of manually configuring GPU resources, container images, and model serving frameworks, KAITO provides a declarative approach.

You define a workspace resource specifying the model and GPU requirements, and KAITO handles node provisioning, model downloading, and serving setup.

DGX Spark with Blackwell

For edge or on-premise scenarios, the NVIDIA DGX Spark with GB200 Blackwell GPU provides 120GB of GPU memory in a compact form factor. Key considerations:

ARM64 architecture (Grace CPU) requires native builds
Only NGC containers are validated for Blackwell (sm_121)
K3s provides a lightweight Kubernetes distribution for single-node deployments

Lessons Learned

Right-size GPU allocation with MIG to reduce costs
Use KAITO for declarative model deployment on AKS
Test container compatibility with GPU architecture early
Monitor GPU utilization metrics to optimize placement
Plan for model updates and A/B serving strategies

GPU infrastructure for AI is still a rapidly evolving space. The tools and best practices are changing fast, but the fundamental principles of resource efficiency, isolation, and automation remain constant.

← Back to blog