Nvidia’s AI Factory Vision Comes Into Focus With Rubin CPX

At the InfraAI Global Summit’25, Nvidia announced a new member to its upcoming Vera Rubin data center AI product family. The Rubin CPX will complement the standard Rubin AI Graphics Processing Unit (GPU) in providing high-value inference content generation at a more cost-efficient price. More importantly, it fits into the data center infrastructure Nvidia has designed for a multi-AI GPU data center.

Tirias Research has consulted for Nvidia and other AI companies mentioned in this article.

Tirias Research has long forecasted the need for a variety of AI inference accelerators from companies like AMD, Intel, Nvidia and anyone else developing AI semiconductor solutions. Like any other data center workload, no two AI models are the same. As consumers and enterprises adopt AI and AI models continue to evolve, there will be an opportunity to optimize the hardware around an AI model or groups of models. However, GPUs will remain one of the best solutions for both AI training and AI inference processing for two key reasons, which Nvidia is building upon with the Rubin CPX announcement.

The Value Of The AI GPU

The first reason is the nature of the semiconductor industry. The tech industry swings like a pendulum. When new technology is introduced, there is a period of rapid innovation, or in the case of AI, daily innovation. When the pace of innovation slows, standards emerge. At this point, it makes sense to consider optimizing a functional task into a dedicated chip known as an application-specific integrated circuit (ASIC). In many cases, that function may eventually be integrated into a host processor like a Central Processing Unit (CPU) or GPU. However, developing a custom chip or functional block can take three or more years. With new models and ways to process these models changing rapidly, the GPU is a more practical solution than an ASIC for most IA applications.

The technology pendulum that swings with each new technology

Tirias Research

The second reason is the ability of GPUs to be partitioned to handle multiple AI models concurrently. There is a myth that a transition from AI training to AI inference is coming in the near future. With the deployment of models like OpenAI’s ChatGPT models, Google’s Gemini, Microsoft’s Copilot, DeepSeek’s R and V series models, Anthropic’s Claude, Perplexity AI and countless others, the vast majority of AI processing across the industry is already inference processing. If such a line existed, it would have been crossed several years ago. With the programmable efficiency of AI GPUs and the buildout of GPU-enabled data centers, the vast majority of AI workloads, especially generative AI and agentic AI, are running on GPUs because they are the most efficient option.

Nvidia’s AI GPU Buildout

At GTC 2025, Nvidia introduced several key technologies for building AI-centric data centers. These included the NVL144 rack design, KV Cache, Dynamo, data center blueprints and enhancements to the company’s NVLink, Spectrum-X, and Quantum-X networking technologies. KV cache allows for the storage of computed key and value tensors to be used in subsequent AI generation and between GPUs. Dynamo is an open-source inference framework for planning and routing AI workloads in the data center, essentially an data center workload orchestrator. The NVL144 rack design and Nvidia networking technologies form the infrastructure of the data center. And the data center blueprints running on Omniverse provide a digital twin for the design, construction, and operation of an AI data center, or AI factory as Nvidia refers to them. Now, Nvidia has introduced the Rubin CPX, an AI GPU inference accelerator optimized to do specific functions exceptionally well. With Rubin CPX, Nvidia takes another step in designing an AI factory that can be optimized for specific AI functions.

Using disaggregated processing for inference workloads

Nvidia

Nvidia refers to Rubin CPX as a context inference accelerator designed for very complex AI tasks, such as millions of lines of software development, hours of video generation, and deep research. The Rubin CPX works in conjunction with the Vera CPU and Rubin AI GPU. The Vera CPU and Rubin AI GPU ingest the large volumes of data, which require high compute performance. Then, the Rubin CPX receives a contextual input to begin generating the output or content. This generational phase is more reliant on memory and networking bandwidth. As a result, the Rubin CPX, while built on the same Rubin AI GPU architecture, is designed differently than the Rubin AI GPU, with 128GB of GDDR7 memory plus hardware encode and decode engines to support video generation. The Rubin CPX is capable of 30 petaFLOPs of performance using the NVFP4 data format, a 3x increase in attention acceleration compared to the GB300 NVL72, and of processing a one-million-token context window. The memory and architecture changes result in a reduction of approximately 20 petaFLOPS of overall performance but an increase in contextual token generation efficiency.

Nvidia plans to offer the Rubin CPX integrated into a single rack with the Vera CPU and Rubin AI GPU called the Vera Rubin NVL144 CPX, and as a separate accelerator rack to the standard Vera Rubin NVL144 rack. The Vera Rubin NVL144 CPX rack will be configured with 36 Vera CPUs, 144 Rubin AI GPUs, and 144 Rubin CPXs with 100 TB of high-speed memory and 1.7 PB/s of memory bandwidth. The result is eight exaFLOPs of NVFP4 performance, a 7.5x increase over the GB300 NVL72 rack. According to Nvidia, a $100 million CAPEX investment could result in up to a $5 billion return, a 30x to 50x return on investment (ROI). The dual rack solution will offer the same performance with an additional 50 TB of memory.

The Vera Ruben NVL144 CPX and Vera Rubin CPX + Vera Rubin NLV144 configurations

Nvidia

Expect More

The Rubin CPX is an AI GPU inference accelerator platform focused on high-end generational applications. We will likely see other versions of the Nvidia AI GPU architectures that concentrate on different segments of AI processing, such as smaller AI models, in the future. We could even see various versions of the CPX solutions optimized for even more specific applications. AI is not a single uniform workload, and optimizing the accelerator is just one step in the process. More importantly, Nvidia continues to focus on the entire data center as a single system to ensure that all potential performance bottlenecks are addressed, resulting in the highest possible performance efficiency and ROI.

A common question is whether the industry needs an annual cadence for new AI GPUs. At this point, the answer is that it needs new AI GPUs every year just to keep pace with the innovation in AI. Additionally, it requires optimized GPUs for the various types of AI workloads.

Forbes