Production LLMOps on Azure: NHL goes from notebooks to a service

The Setup

Where the work started

The NHL needed production-ready GenAI infrastructure. The existing ML platform was not built for LLM-scale training or inference, and cost attribution was opaque — leadership had no way to evaluate whether a given LLM-powered feature was carrying its weight.

The environment was Azure-first. Any platform built here had to respect existing enterprise agreements, security postures, and identity boundaries, and had to move the work from shared notebooks to a defensible, observable service.

What had to be true

Stand up an LLMOps platform capable of distributed training, production inference with bounded latency, and end-to-end observability.
Expose per-request cost attribution so feature teams could be charged back accurately and leadership could evaluate ROI at the individual feature level.
Keep the platform defensible from a security standpoint: private networking, scoped identity, auditable model storage, and reproducible training runs.

What I Did

The architecture

Architected an LLMOps platform on Azure with a clean separation between the training tier, the inference tier, and the observability and cost-attribution layer that spans both. Every design choice was tested against a single question: does this make per-request cost visible?

01
Distributed training on Azure ML
PyTorch DDP for smaller models and FSDP for models above the single-GPU memory ceiling, running across 8× NVIDIA A100s on Azure ML compute. Checkpoint strategy tuned so failure-restart economics did not dominate wall-clock cost.
02
vLLM inference on AKS
vLLM with paged attention deployed on Azure Kubernetes Service, with separate request classes for latency-sensitive traffic and batch workloads. Autoscaling driven by request-class-aware metrics, not raw CPU.
03
Per-request cost attribution
Custom middleware tagged every inference with tenant, feature, model, and a cost envelope. A daily pipeline aggregated usage into feature-level chargeback reports — the first time leadership could ask ROI questions at that granularity.
04
End-to-end observability
MLflow for experiment and model-lineage tracking, Azure Monitor and Prometheus for infra and request metrics, and structured logs traceable from request ID to GPU-second. Drift and quality gates ran on a held-out evaluation set on a fixed cadence.
05
Security posture by default
Private Endpoints on model and artifact storage, Key Vault-backed secrets, RBAC-scoped compute, and reproducible training runs anchored to specific commits, datasets, and compute SKUs.

Outcome

What actually happened

Production GenAI on Azure with full cost observability, including per-request cost attribution that enabled ROI evaluation at the individual feature level.

8× A100: Training hardware
vLLM on AKS: Inference runtime
Per request: Cost granularity
MLflow-tracked: Lineage

Monthly chargeback reports became a standard input to feature prioritization — features were evaluated on ROI, not enthusiasm.
Inference latency SLAs held at target p95 under production load across request classes.
Training wall-clock time dropped meaningfully after FSDP + activation-checkpointing tuning.
Full reproducibility — every model in production traced to a specific commit, dataset version, and compute configuration.

Why it matters

The parts another team can take

Bake per-request cost into the inference path from day one. Retrofitting cost attribution onto an LLM system that was not designed for it is an entire second project.
FSDP pays off once models cross the single-GPU memory ceiling. Below that line, DDP is simpler and usually faster.
Treat LLM evaluation as a CI gate, not a quarterly exercise. Quality regressions compound silently otherwise.

Stack

Azure ML
Azure Kubernetes Service
PyTorch DDP / FSDP
vLLM
NVIDIA A100
MLflow
Azure Monitor
Private Endpoints