Inside NVIDIA DSX: the AI factory playbook, and how PaletteAI helps you run it

Over the past decade, I’ve worked at NVIDIA, AWS and most recently DDN, helping enterprises and governments figure out how to stand up AI infrastructure that holds up in production.

So when NVIDIA introduced the DSX platform, my first reaction wasn't "ugh, another reference architecture." It was: this is the first time someone has put the entire AI factory problem on a single page, and meant it.

NVIDIA DSX is the playbook for how the rest of the decade's AI infrastructure gets built: silicon, software, power, cooling, networking, facilities, and operations, co-designed as one platform.

It’s also the framework Spectro Cloud has been building toward with PaletteAI for the past two years. PaletteAI is our Kubernetes-based platform for managing AI infrastructure, from bare metal to model to tokens, across data center and edge environments. Where DSX defines how the factory is designed and powered, PaletteAI is how it’s run.

Let me walk you through how DSX works, and where we fit in.

The AI factory needs its own playbook

The first wave of enterprise AI was built the way enterprise IT has always been built: order hardware, find rack space, stitch together a software stack, and hope networking and cooling keep up. That approach doesn’t work when you are trying to bring a hundred-megawatt site online on a deadline, or when token cost is the metric your CFO is grading you on.

AI factories have constraints that traditional data centers don't. Power is the hard ceiling, not floor space. Cooling has to be designed around 45°C liquid loops, not 22°C air. Networking has to handle east-west traffic patterns that look nothing like web tiering. The software stack changes every quarter as new models, new inference servers, and new schedulers ship. And the whole thing has to be operable by humans, at carrier-grade reliability, and across multiple sites and tenants from day one.

What's been missing is a common spine that ties all of that together. Every team optimizes for their own layer (compute, facilities, and software) and the integration tax lands on the schedule with slipped go-live dates, stranded GPUs, and token costs that don’t pencil. DSX closes that gap.

The six components of NVIDIA DSX

DSX brings together open source, modular software libraries, APIs, reference designs, NVIDIA-accelerated computing platforms, and partner technologies into a single co-designed platform for AI factory design, deployment, and operations. NVIDIA's framing is that they are perfectly positioned to align every layer of the stack at once, because they're already involved in every layer.

The platform now has six named elements, and it's useful to look at them as a stack rather than a feature list:

DSX Reference Design — generation-specific, validated designs covering compute, networking, storage, and the facilities side (power, cooling, controls, structural).
DSX Sim — a digital twin framework for AI factories that helps teams model designs, simulate and validate before deployment, plan facilities, design the network, validate partner integrations, and test every change.
DSX Air — enables AI factory builders to map out, plan, and test the compute, networking, storage and security infrastructure in a cloud-based environment, accelerating operations and streamlining deployment before any hardware is unboxed. (See our technical deep dive on DSX Air for multi-tenant AI infrastructure for how this plays out with PaletteAI.)
DSX MaxLPS — a suite of technologies that maximize token performance per megawatt inside a fixed power budget, combining liquid cooling with rack-to-rack efficiency tuning. When power is your binding constraint, this turns the same substation into more usable intelligence.
DSX Flex — connects the factory to the power grid. Workloads adapt to load shedding, demand response, and pricing signals; the site becomes a flexible load strengthening the grid and facilitating faster connections to power.
DSX Exchange — the integration fabric between IT, operational technology, and the agents increasingly running both.

Tying it all together is DSX OS — open source, modular software purpose-built for AI factory operations. It handles intelligent scheduling, runtime consistency, health automation, resiliency, and multi-tenant platform services across the full lifecycle.

DSX hits three vital issues for every enterprise AI project

Almost every AI infrastructure conversation I’ve been a part of this year circles back to the same three topics: cost, speed and ecosystem openness. DSX speaks to all of them.

‘Lowest token cost’ is the new benchmark. Last year, the main question was whether you could even get GPUs. Now it is about how much intelligence you can produce per dollar and per watt. DSX is explicitly engineered around that, turning every megawatt into more tokens, and shrinking the time between first power and first production.

Co-design is the only way to keep up. The hardware, the software, the cooling, and the power architecture are changing in lockstep, generation by generation. Operators who treat them as independent procurement tracks will lose to those who buy them as a single co-designed system. DSX is just such a system.

Open and modular wins. What’s most underappreciated about DSX is how much of it is open source. DSX OS, the reference designs themselves: NVIDIA isn’t playing gatekeeper here. That is a bet that the ecosystem grows faster when everyone is building against the same blueprint. From a partnerships seat, I can tell you the ecosystem is responding accordingly. Spectro Cloud, the system manufacturers, the cloud providers, the software vendors: we’re all leaning in.

Where PaletteAI fits: the operational control plan

A reference design tells you what to build, and DSX OS gives you a software substrate to build on, but you still need an operational control plane that can run the thing at scale, across sites, across tenants, and to keep pace with the rapid innovation in the ecosystem.

That's the job Spectro Cloud PaletteAI is built for.

PaletteAI is our platform for managing AI infrastructure from bare metal to model to token, across data center and edge environments. We are listed by NVIDIA as one of the NCP and software vendor ecosystem partners adopting DSX OS components.

A few specifics on how PaletteAI lights up a DSX deployment:

Immutable infrastructure as a security baseline. PaletteAI enforces declarative state across all managed clusters so runtime configuration cannot drift from a known-good baseline. For edge deployments, OS-level immutability can be achieved through Kairos to prevent filesystem-level tampering. For multi-tenant AI factories (and especially for sovereign and regulated deployments), that is the foundation that audit and certification rests on.
Permission-driven self-service for AI teams. Infrastructure teams define the roles, resource limits, and approved profiles. Data scientists and ML engineers deploy inside those boundaries without filing tickets. This is the practical bridge between the speed AI practitioners need and the control operators are accountable for.
A validated application catalog through PaletteAI Studio. PaletteAI Studio includes NVIDIA Dynamo for distributed inference, NIM microservices, Run.ai for GPU scheduling, and NVIDIA AI Enterprise alongside open source and partner tools like ClearML and LiteLLM. The catalog turns capabilities that normally take weeks of integration into a solution stack that an operator can deploy in a few clicks against a DSX-conformant cluster.
Multi-cluster, multi-site orchestration as a first-class concept. DSX is inherently a distributed architecture: gigascale sites, regional factories, edge nodes, and increasingly the AI Grid topology that federates them. PaletteAI has been a multi-cluster platform from day one. We didn't have to retrofit that, which matters when the operational scope expands from one site to twenty.
Designed to plug into DSX OS. PaletteAI adopts DSX OS components as the foundation and adds the operational control plane above them. The goal is an enterprise that picks up the DSX reference design and gets a running, manageable platform, one they can hand to a team and operate across sites, rather than a set of repos to finish assembling.

‍

The net effect: a customer following the DSX reference design, including the NVIDIA Vera Rubin platform, NVIDIA Spectrum-X Ethernet networking, validated storage, DSX MaxLPS thermals, and DSX Flex for grid-aware power orchestration, gets PaletteAI as the control plane that turns that blueprint into a running, multi-tenant, audit-ready AI factory. Time to first production token is compressed from weeks to days, day-two operations get boring in the way operations are supposed to be boring, and the infrastructure team gets its weekends back.

What I would tell a CTO

If you are planning your next build, start with the reference design and walk it end to end with your facilities team in the room. Much of the schedule risk on these builds is in power and cooling, not compute. DSX puts those on the same page as the GPUs for a reason.

Simulate before you build. DSX Sim and DSX Air exist so you stop discovering integration problems in deployment and production. The cost of a digital twin is rounding error compared to the cost of a stranded megawatt.

Pick your operational control plane early, and pick one that is already aligned with where DSX is going. The reference design will keep evolving: Vera Rubin today and what comes tomorrow. You want an operations layer that moves with NVIDIA's roadmap rather than against it.

If you have already made the hardware investment and are struggling to get to inference because token costs are too high, GPU utilization is lower than it should be, your AI teams are still waiting on environments, or you are not confident the platform will hold up across multiple sites and tenants… recognize that most times, the problem is not the hardware. It is the operational layer above it.

PaletteAI is what we have built to close that gap by bringing lifecycle management, self-service access controls, and multi-cluster orchestration to infrastructure that is already in place, so you can realize the ROI on the GPUs you have already bought.

Either way, let's talk. The combination of NVIDIA and Spectro Cloud is more capable than either piece alone.

Reach out at spectrocloud.com/get-started, or find me on LinkedIn.

Jun 4, 2026