Sworn to secrecy
This organization is a leading federal technology provider supporting government agencies with infrastructure, systems integration, and managed services.
In this engagement, the company is functioning as a managed service provider (MSP), delivering and running a fully managed AI infrastructure environment inside a highly secure, air-gapped data center for a classified government customer.
While we have (obviously!) left out all the identifying details, this is a true story. All the challenges faced, decisions made, and results achieved are real.
At a glance
- Industry: Defense
- Footprint: Federal technology provider operating secure, classified infrastructure environments in support of government programs
- Objective: Deliver a production-ready, large-scale AI infrastructure platform within a 30-day deadline
- Key use cases: Distributed AI training on NVIDIA HGX systems, full-stack AI platform delivery in an air-gapped environment
- Solution: Spectro Cloud PaletteAI, airgapped
- Outcome: Production deployment within 30 days, no-sweat full cluster rebuild the day before go-live, GPU capacity doubled in days with immediate full utilization
When infrastructure became the critical path
After winning the contract for a major classified federal program, the technology provider committed to delivering and operating a production-ready AI infrastructure environment under tight timelines. Acting as a MSP, the team needed to deploy a high-performance stack built on NVIDIA HGX systems, Spectrum-X networking and Run:AI to support distributed training workloads.
This was not a typical Kubernetes deployment. It was the world’s first production deployment of the new NVIDIA Spectrum-X for Vanilla Kubernetes. Each host required slightly different networking configuration to support NVLink communication across GPUs, and distributed training depended on everything being perfectly aligned. If even one node was configured incorrectly, the system would not behave as expected.
For ten months, the team had been working on building out the infrastructure with Red Hat OpenShift, during which time they had been hand-configuring each individual host. The team lacked a consistent way to provision and operate the environment end to end, from bare metal through distributed training workloads, without rebuilding nodes individually.
Scaling the environment or rebuilding nodes meant revisiting complex setup steps again, often with direct vendor involvement. NVIDIA engineers remained deeply engaged just to keep the project moving, underscoring how high-touch and fragile the deployment had become.
They brought up a cluster in pre-production and could run workloads, but put simply, they didn’t trust it. Configuration varied from node to node and the team was not comfortable declaring the environment production-ready.
Now, with 30 days remaining before the planned go-live, the team leader faced mounting pressure and had to make a tough call. Stick with the hand-wired environment, and risk missing the deadline or delivering sub-par outcomes — or make a dramatic last-minute change of direction.
A shift from hand-built infrastructure to a full-stack blueprint
At this point, Spectro Cloud entered the conversation. Spectro Cloud had an existing technical relationship with NVIDIA as a Preferred Partner and integrated closely with NVIDIA’s AI stack, making it a credible alternative given the program’s choice of NVIDIA HGX systems. After initial discussions and a technical review, the federal technology provider chose to evaluate Spectro Cloud’s platform live.
From the outset, the difference was structural. Rather than layering Kubernetes components one by one onto an already complex setup, Spectro Cloud’s PaletteAI platform captured the customer’s complete environment in a declarative full-stack profile, covering bare metal provisioning, networking, storage integration, and the Run:AI layer on top. The team received a small, low-side environment to validate the approach, and managed to stand up a functioning cluster within a week.
What had previously required literally months of manual coordination across vendors now followed a clear and repeatable deployment model. That clarity proved even more important when work moved to the classified high-side environment where vendors couldn’t follow. Even though they were brand new to the Spectro Cloud platform, the federal provider’s technical team could replicate the PaletteAI build themselves, promoting artifacts and applying the same configuration pattern, without requiring on-site Spectro Cloud delivery resources.
This shift did more than accelerate deployment of the project’s initial 32-node cluster. The team could finally see a path not just to stabilization, but to an operating model they could own and manage directly. It gave them confidence that they could rebuild, expand, and update the environment in a controlled way. In a program where timeline and reliability carried contractual weight, that confidence became the deciding factor.
Delivering under pressure and redefining what was possible
The success in the first week gave the team confidence that choosing Spectro Cloud was the right move under intense time pressure. But the real impact became clear in the final stretch before go-live.
The day before go-live, Run:AI released a critical patch that required touching all 32 nodes. A traditional rolling upgrade process would have taken days and introduced risk at the worst possible time. Instead, the team used PaletteAI to create a new version of the cluster profile definition, wiped the entire environment, and rebuilt the full stack from bare metal. The next day, the customer went live as planned.
Under the previous build process, it would have been simply unthinkable to scrub and redeploying a GPU cluster of that size, the day before launch — in fact, any configuration change or rebuild was risky and unpredictable. But with PaletteAI’s declarative cluster profile in place, the team could treat the environment as cattle, not pets or snowflakes.
The gains did not stop at go-live. As demand for compute increased, the team doubled the number of HGX nodes in a matter of days. Additional GPU capacity went from loading dock to live in production faster than anyone could have hoped. Instead of months of incremental progress, expansion became a predictable operational motion.
When a later software issue required another disruptive cluster-wide update, the team created a new cluster profile version and executed rolling node upgrades without destabilizing workloads. Rather than revisiting complex setup steps or coordinating across multiple vendors, all they had to do was apply changes through the same declarative model that brought the system into production in the first place.
Vendor dynamics shifted as well, in a good way. NVIDIA’s experts could step back from hands-on troubleshooting to light validation and guidance. The provider’s engineers owned the deployment model directly, reducing operational friction and dependence on external intervention.
Most importantly, the team met its contractual commitment. They delivered a production-ready AI environment from a standing start within 30 days, and turned what had been a fragile, months-long effort into a repeatable service capability that the organization can now carry into future classified initiatives.
What they’re building toward next
With a production-proven model in place, the provider now plans to extend this architecture to additional federal programs. What began as a hail-mary rescue mission has become a blueprint for delivering large-scale AI infrastructure in secure, classified environments.
By standardizing how GPU-intensive environments are built and operated, the team can shift its focus from stabilizing infrastructure to accelerating mission-critical outcomes. The same operating model that protected a high-visibility contract now positions the organization to pursue future classified federal programs and AI initiatives with greater confidence and speed.






.png)
