Microsoft Azure AI Solutions Engineer

Production LLMOps on Azure: NHL goes from notebooks to a service

Production GenAI on Azure with 8× A100 distributed training, vLLM inference on AKS, and per-request cost attribution that made ROI visible at the feature level.

Duration
In-flight engagement
Training hardware
8× A100
Inference runtime
vLLM on AKS
Cost granularity
Per request
  • Azure
  • LLMOps
  • vLLM
  • Distributed Training
  • Sports & Media

The Setup

Where the work started

The NHL needed production-ready GenAI infrastructure. The existing ML platform was not built for LLM-scale training or inference, and cost attribution was opaque — leadership had no way to evaluate whether a given LLM-powered feature was carrying its weight.

The environment was Azure-first. Any platform built here had to respect existing enterprise agreements, security postures, and identity boundaries, and had to move the work from shared notebooks to a defensible, observable service.

What had to be true

  • Stand up an LLMOps platform capable of distributed training, production inference with bounded latency, and end-to-end observability.
  • Expose per-request cost attribution so feature teams could be charged back accurately and leadership could evaluate ROI at the individual feature level.
  • Keep the platform defensible from a security standpoint: private networking, scoped identity, auditable model storage, and reproducible training runs.

What I Did

The architecture

Architected an LLMOps platform on Azure with a clean separation between the training tier, the inference tier, and the observability and cost-attribution layer that spans both. Every design choice was tested against a single question: does this make per-request cost visible?

  1. 01

    Distributed training on Azure ML

    PyTorch DDP for smaller models and FSDP for models above the single-GPU memory ceiling, running across 8× NVIDIA A100s on Azure ML compute. Checkpoint strategy tuned so failure-restart economics did not dominate wall-clock cost.

  2. 02

    vLLM inference on AKS

    vLLM with paged attention deployed on Azure Kubernetes Service, with separate request classes for latency-sensitive traffic and batch workloads. Autoscaling driven by request-class-aware metrics, not raw CPU.

  3. 03

    Per-request cost attribution

    Custom middleware tagged every inference with tenant, feature, model, and a cost envelope. A daily pipeline aggregated usage into feature-level chargeback reports — the first time leadership could ask ROI questions at that granularity.

  4. 04

    End-to-end observability

    MLflow for experiment and model-lineage tracking, Azure Monitor and Prometheus for infra and request metrics, and structured logs traceable from request ID to GPU-second. Drift and quality gates ran on a held-out evaluation set on a fixed cadence.

  5. 05

    Security posture by default

    Private Endpoints on model and artifact storage, Key Vault-backed secrets, RBAC-scoped compute, and reproducible training runs anchored to specific commits, datasets, and compute SKUs.

Outcome

What actually happened

Production GenAI on Azure with full cost observability, including per-request cost attribution that enabled ROI evaluation at the individual feature level.

8× A100
Training hardware
vLLM on AKS
Inference runtime
Per request
Cost granularity
MLflow-tracked
Lineage
  • Monthly chargeback reports became a standard input to feature prioritization — features were evaluated on ROI, not enthusiasm.
  • Inference latency SLAs held at target p95 under production load across request classes.
  • Training wall-clock time dropped meaningfully after FSDP + activation-checkpointing tuning.
  • Full reproducibility — every model in production traced to a specific commit, dataset version, and compute configuration.

Why it matters

The parts another team can take

  • Bake per-request cost into the inference path from day one. Retrofitting cost attribution onto an LLM system that was not designed for it is an entire second project.
  • FSDP pays off once models cross the single-GPU memory ceiling. Below that line, DDP is simpler and usually faster.
  • Treat LLM evaluation as a CI gate, not a quarterly exercise. Quality regressions compound silently otherwise.

Stack

  • Azure ML
  • Azure Kubernetes Service
  • PyTorch DDP / FSDP
  • vLLM
  • NVIDIA A100
  • MLflow
  • Azure Monitor
  • Private Endpoints

Next step

Want a similar read on your stack?

Start with a $249 Architecture Review, or book a 30-min discovery call for larger scope.

Public summary. Client-confidential specifics are not published. Figures reflect the engagement outcome as delivered.