What the AI labs already know about bare metal Kubernetes

The companies training the world's most-watched AI models have already settled the bare metal Kubernetes debate, and they didn't pick the cloud-native default.

Meta's two 24,576-GPU clusters that trained Llama 3 run on its in-house Grand Teton bare metal hardware platform, with container orchestration sitting directly on the silicon.

The neocloud powering OpenAI, Anthropic, Mistral and IBM Granite, CoreWeave, has built its entire business on a Kubernetes service that runs directly on bare metal, with the hypervisor explicitly removed from the stack.

Google's documented reference path for serving frontier-scale models like Llama 3.1 405B and DeepSeek-R1 671B uses Kubernetes on bare metal nodes.

When the labs that train OpenAI, Anthropic and Meta's frontier models all converge on the same architectural choice, it's worth paying attention to what they know. The technical reason behind it is specific, well-documented, and worth understanding before you commission your own AI infrastructure.

The simplest reason: GPUs hate being abstracted

A modern AI accelerator is the most expensive computer in the building, and it spends most of its life waiting. CoreWeave estimates that up to 65% of the effective compute capacity in an enterprise's GPU fleet is lost to system inefficiencies. When you're paying eight figures for a cluster of H100s or B200s, every percentage point of lost utilization is real money walking out the door.

The hypervisor was a brilliant abstraction for the previous era of computing, when servers ran web apps and databases that didn't care exactly which CPU cycles they got. AI workloads care. They care about PCIe bandwidth between the GPU and the CPU, about the scheduling jitter introduced by a hypervisor's resource arbitration, about the cache locality of a long training run that mustn't be interrupted. Independent benchmarks consistently put the hypervisor tax at 5–15% of CPU and memory cycles, and Spectro Cloud's own historical research on bare metal Kubernetes put the figure at 7–10%. For a $40,000 GPU node running for three years, that's a sum that pays a small engineering team.

There's a counter-argument worth taking seriously here: VMware's recent MLPerf submissions claim that VCF with vGPU retains roughly 99% of bare metal performance for certain inference workloads. The operative word is certain. Steady-state inference of a small model on a single GPU is a fundamentally different beast from a multi-week, multi-thousand-GPU training run where one mis-scheduled tensor parallel boundary can write off a day's compute. The serious AI labs aren't running on virtualization for a reason, and that reason isn't ignorance.

NVIDIA's reference architectures take the same position

If you want a tiebreaker on this argument, NVIDIA has already cast its vote. The company's AI Enterprise documentation includes a full Kubernetes deployment guide that's specifically labeled as the bare metal path. NVIDIA's Enterprise Reference Architectures, which are the blueprints partners use to build certified AI factories, specify open-source Kubernetes with NVIDIA AI Enterprise and Run:ai as the software stack — running directly on bare metal compute. And NVIDIA's published architecture for confidential AI is described, in NVIDIA's own words, as a blueprint for building zero-trust AI factories on bare-metal infrastructure.

If you're building anything that touches NVIDIA's full AI stack (Spectrum-X networking, BlueField DPUs, Run:ai, NIMs, NeMo), the documented happy path goes through Kubernetes on bare metal. You can layer virtualization on top later if your specific use case needs it. Going the other way around is harder and slower.

Where bare metal shines for AI

The performance argument is the headline, but it isn't the whole story. Three other reasons keep coming up in real deployments.

GPU access is the obvious one. Containers running on bare metal Kubernetes get direct PCIe access to the accelerator, with no virtualization layer mediating calls to CUDA, NVLink or RDMA. For training jobs that span tens or hundreds of nodes, this matters more than any single performance number, because the slowest GPU in the job sets the pace for all the others.

Cost predictability is the second. Public cloud GPU pricing is volatile, capacity is constrained, and reserved instances lock you into specific hardware generations that may not be the right fit by the time the contract ends. Owning the hardware means you know what your AI workload costs to run for the next three years. For a workload that's reliably busy (a recommendation engine, a large fine-tuning pipeline, an enterprise inference platform serving thousands of users) that's a much better deal than spot pricing on someone else's GPU.

Sovereignty is the third, and it's becoming increasingly hard to ignore. If your training data is regulated (patient records, financial transactions, classified intelligence, intellectual property), there are an increasing number of places where it cannot legally leave your data center. Bare metal Kubernetes gives you the cloud-native operating model on infrastructure you actually own, in a building you actually control. That's the whole pitch behind the NVIDIA AI Factory for Government reference design, which Spectro Cloud is included in.

What bare metal isn't good at

Honesty is more useful than evangelism here. Bare metal Kubernetes is a genuinely poor fit for some AI workloads, and the discussion goes off the rails when people pretend otherwise.

Bursty experimentation is the obvious example. If your data scientists need 64 H100s for three days, then nothing for two weeks, then 200 H100s for a frantic experiment, the cloud is the right answer. The economics of buying hardware that's only used 30% of the time don't work, no matter how good your bare metal stack is. The same goes for short-lived proof-of-concept projects, multi-region inference for a globally distributed app, and most early-stage research before you know what you're going to need at scale. Even if you have clear visibility of your requirements, the hardware lead times on self-operated infrastructure can be brutal.

The pragmatic position most enterprises are landing on is hybrid. Bare metal Kubernetes for the workloads that are steady, predictable, performance-critical or compliance-bound. Cloud GPUs for the bursty, the experimental, and anything that needs to scale into a region you don't operate in. The useful question is which workloads belong where, decided deliberately, instead of defaulting everything into whichever model your platform team is most comfortable with.

Operating bare metal at scale is the real engineering problem

Running bare metal in production is significantly harder than running clusters in EKS or GKE. Hardware provisioning, firmware management, OS lifecycle, GPU operator deployment, networking fabric configuration, day-2 patching across heterogeneous fleets — all of this is your problem now. And that can really shine a spotlight on skills gaps in your teams.

This is the operational reality that drove enterprises into virtualization 15 years ago. It hasn't gone away. What's changed is that the tooling has caught up: Cluster API, MAAS, the NVIDIA GPU and Network Operators, Spectro Cloud's own contributions to the Cluster API ecosystem, and full-stack platforms like Palette that treat the OS, Kubernetes, networking and AI software as a single declarative artifact.

The win condition for bare metal Kubernetes at AI scale is treating physical servers like cloud resources: declaratively provisioned, version-controlled in cluster profiles, automated through their full lifecycle, and managed exactly the same way as your virtualized clusters and your hyperscaler EKS estate. Anything less and you're rebuilding the operational pain that the cloud was supposed to solve.

What to do next

If you're building or scaling AI infrastructure right now, three practical recommendations.

First, decide which AI workloads are steady enough to justify owning the hardware. If your team can name a specific training pipeline or inference service that's running 70%+ of the time on rented GPUs, that's the candidate.

Second, plan the operating model before you plan the hardware. The platform team you'll need to run bare metal Kubernetes well is the same team that runs your cloud Kubernetes — they shouldn't have to learn a different tooling stack just because the servers are in your building.

Third, don't try to build the full stack from scratch. The pieces — operating systems, Kubernetes distributions, GPU operators, schedulers, networking, storage, monitoring, AI runtimes — are all open source and all integrate with each other in theory. In practice, gluing them together is a year of platform engineering most teams don't have to spare. Use a platform that handles the integration so your engineers can focus on the workloads.

If you want to dig deeper, our introduction to bare metal Kubernetes covers the foundations, our five best practices for bare metal management covers the operational side, and the PaletteAI page walks through how the full-stack approach works in practice. The AI labs have already decided. The interesting question now is what your version looks like.

Jun 2, 2026