NVIDIA just built a 550-billion-parameter model that activates 55 billion parameters at a time and runs faster than anything in its class. The weights are free. The speed is not.

A lone scribe writes on a continuous scroll in a focused beam of light, while dozens of other scribes sit motionless and dusty in the dark behind him. The active scribe’s scroll feeds multiple ledgers through a system of pulleys, symbolizing efficient distribution from a single focused effort.

Jensen Huang announced Nemotron 3 Ultra at Computex 2026. It is a Mixture-of-Experts model with 550 billion total parameters, 55 billion active parameters, a hybrid Mamba-Transformer architecture, and a 1 million token context window. The weights are on HuggingFace. The technical report is public. That is the surface story.

The real story is the business model. Nemotron 3 Ultra achieves up to six times higher inference throughput than state-of-the-art public LLMs on an 8K input, 64K output token benchmark: 5.9 times faster than GLM-5.1, 4.8 times faster than Kimi-K2.6, and 1.6 times faster than Qwen-3.5, according to NVIDIA Research’s technical report. That throughput advantage is not an accident of clever architecture. It is a direct function of NVFP4 quantization, a precision format native to NVIDIA’s Blackwell GPU generation. You can download the weights today. You cannot replicate the 300 tokens per second that Artificial Analysis measured on a pre-release DeepInfra endpoint without NVIDIA’s latest silicon. This is not a research paper. This is a hardware demo dressed as an open-source release.

A medieval merchant weighs small, bright embers on a delicate scale as a crowd watches; a rival merchant struggles with a scale overflowing with cold, grey stones. The contrast between the valuable embers and the rival’s useless stones illustrates the advantage of selective, high-value computation.

The family and the field

Nemotron 3 Ultra is the largest model in the Nemotron 3 family, which also includes Nano and Super sizes. NVIDIA released base, post-trained, and quantized checkpoints, along with training data and the recipe. The model is available through OpenRouter and as a NIM microservice on build.nvidia.com.

The competitive targets are explicit: DeepSeek, Qwen, Kimi, and GLM. These model providers built developer mindshare on dense or moderately sparse architectures served through high-throughput APIs. Nemotron 3 Ultra changes the terms of engagement. It shifts the axis of competition from total parameter count to active-parameter efficiency, and from benchmark accuracy to throughput-per-dollar on long-context agent workloads. The model scores 48 on the Artificial Analysis Intelligence Index, leading all U.S. open-weights models. That is the table stakes. The throughput is the weapon.

The three-part efficiency engine

NVIDIA’s technical report details three mechanisms that produce the model’s speed.

First, the hybrid Mamba-Transformer design. Standard Transformer attention scales quadratically with sequence length, which kills performance on the 1 million token contexts that agent workflows require. Mamba layers replace attention in portions of the network, reducing that computational burden while preserving accuracy on long-range dependencies.

Second, LatentMoE routing. Mixture-of-Experts models split the network into specialized sub-networks and route each token to a subset of them. Nemotron 3 Ultra routes tokens such that only 55 billion of its 550 billion parameters are active at any time. That 90 percent sparsity is the entire game. It means the model thinks with the compute footprint of a much smaller dense model while retaining the knowledge capacity of a giant one.

Third, NVFP4 pre-training combined with Multi-Token Prediction. NVFP4 is a 4-bit floating-point format designed for Blackwell GPUs. Training the model in that format means inference runs natively on Blackwell hardware without conversion overhead. Multi-Token Prediction generates multiple output tokens in parallel, improving generative speed on multi-turn agent tasks.

A fourth element, Multi-Teacher On-Policy Distillation, functions as a quality flywheel. NVIDIA trains Nemotron 3 Ultra using feedback from over ten domain-specific teacher models, creating a continuous improvement loop that specializes the model for agentic reasoning.

The metric that will eat the industry

The consensus reaction to Nemotron 3 Ultra celebrates it as a triumph of open science and efficiency. That framing is naive. The model’s 90 percent sparsity and NVFP4 quantization are so deeply tied to Blackwell architecture that the word “open” is a polite fiction. You can inspect the weights. You can fine-tune them. You cannot achieve the headline throughput numbers without NVIDIA hardware. This is vertical integration executed under the banner of transparency.

Here is the prediction: within 12 to 24 months, the active parameter ratio, not total parameter count, will become the primary marketing metric for enterprise LLMs. Nemotron 3 Ultra’s 55 billion active parameters producing 300-plus tokens per second establishes the new yardstick. Enterprise buyers purchasing inference for long-running agent pipelines will optimize for throughput-per-dollar on their specific workloads. Dense models with high total parameter counts and low active-parameter ratios will look economically irrational.

This dynamic forces a response from DeepSeek and Qwen. Both companies compete on API pricing and developer accessibility. If NVIDIA’s integrated stack delivers six times the throughput at comparable accuracy on agent benchmarks, the unit economics of serving dense or moderately sparse models collapse. DeepSeek and Qwen must ship their own Mamba-hybrid, ultra-sparse architectures optimized for long-context agent orchestration, or they cede the high-throughput, low-cost API market that sustains their developer ecosystems. The technical reports from both organizations already show movement toward MoE architectures. Nemotron 3 Ultra accelerates that trajectory.

The winners in this shift will be GPU-cloud providers offering NVFP4-native inference. They capture the margin that efficiency gains create. The losers will be pure-play model providers who lack a hardware moat. They can build a faster model, but they cannot monetize the efficiency if the inference cost savings flow to the cloud operator rather than to them. NVIDIA’s position is unique: it sells the GPUs, the networking, the inference microservice, and now the model itself as a loss leader that drives demand for the rest of the stack.

What operators should do now

Stop benchmarking on accuracy alone. Start measuring active-parameter throughput on your agent workloads. Nemotron 3 Ultra’s 48 on the Artificial Analysis Intelligence Index is a useful signal. The more actionable signal is the throughput data: over 300 tokens per second on a pre-release DeepInfra endpoint, and up to six times faster than competing models on long-context inference. Those numbers are a preview of what Blackwell will deliver at scale.

Test NVFP4-native inference endpoints immediately. Nemotron 3 Ultra is available on build.nvidia.com as a NIM microservice and on OpenRouter. Run your agent pipelines against it. The open-source release means you can benchmark without a procurement cycle. The throughput claims either hold for your use case or they don’t. You need to know which before your competitors do.

The 550-billion-parameter giant that uses only 10 percent of its brain is not a compromise. It is the new orthodoxy. The era of dense, compute-hogging models for enterprise agents is ending. NVIDIA just set the clock.