The Setup
Where the work started
The NHL needed production-ready GenAI infrastructure. The existing ML platform was not built for LLM-scale training or inference, and cost attribution was opaque — leadership had no way to evaluate whether a given LLM-powered feature was carrying its weight.
The environment was Azure-first. Any platform built here had to respect existing enterprise agreements, security postures, and identity boundaries, and had to move the work from shared notebooks to a defensible, observable service.
What had to be true
- Stand up an LLMOps platform capable of distributed training, production inference with bounded latency, and end-to-end observability.
- Expose per-request cost attribution so feature teams could be charged back accurately and leadership could evaluate ROI at the individual feature level.
- Keep the platform defensible from a security standpoint: private networking, scoped identity, auditable model storage, and reproducible training runs.
What I Did
The architecture
Architected an LLMOps platform on Azure with a clean separation between the training tier, the inference tier, and the observability and cost-attribution layer that spans both. Every design choice was tested against a single question: does this make per-request cost visible?
01
Distributed training on Azure ML
PyTorch DDP for smaller models and FSDP for models above the single-GPU memory ceiling, running across 8× NVIDIA A100s on Azure ML compute. Checkpoint strategy tuned so failure-restart economics did not dominate wall-clock cost.
02
vLLM inference on AKS
vLLM with paged attention deployed on Azure Kubernetes Service, with separate request classes for latency-sensitive traffic and batch workloads. Autoscaling driven by request-class-aware metrics, not raw CPU.
03
Per-request cost attribution
Custom middleware tagged every inference with tenant, feature, model, and a cost envelope. A daily pipeline aggregated usage into feature-level chargeback reports — the first time leadership could ask ROI questions at that granularity.
04
End-to-end observability
MLflow for experiment and model-lineage tracking, Azure Monitor and Prometheus for infra and request metrics, and structured logs traceable from request ID to GPU-second. Drift and quality gates ran on a held-out evaluation set on a fixed cadence.
05
Security posture by default
Private Endpoints on model and artifact storage, Key Vault-backed secrets, RBAC-scoped compute, and reproducible training runs anchored to specific commits, datasets, and compute SKUs.
Outcome
What actually happened
Production GenAI on Azure with full cost observability, including per-request cost attribution that enabled ROI evaluation at the individual feature level.
- 8× A100
- Training hardware
- vLLM on AKS
- Inference runtime
- Per request
- Cost granularity
- MLflow-tracked
- Lineage
- Monthly chargeback reports became a standard input to feature prioritization — features were evaluated on ROI, not enthusiasm.
- Inference latency SLAs held at target p95 under production load across request classes.
- Training wall-clock time dropped meaningfully after FSDP + activation-checkpointing tuning.
- Full reproducibility — every model in production traced to a specific commit, dataset version, and compute configuration.
Why it matters
The parts another team can take
- Bake per-request cost into the inference path from day one. Retrofitting cost attribution onto an LLM system that was not designed for it is an entire second project.
- FSDP pays off once models cross the single-GPU memory ceiling. Below that line, DDP is simpler and usually faster.
- Treat LLM evaluation as a CI gate, not a quarterly exercise. Quality regressions compound silently otherwise.
Stack
- Azure ML
- Azure Kubernetes Service
- PyTorch DDP / FSDP
- vLLM
- NVIDIA A100
- MLflow
- Azure Monitor
- Private Endpoints
