Analysis

The execution gap.

Peak compute per chip has risen roughly twentyfold in a single hardware generation. The share we actually use has not moved with it, and the space between the two is now the most expensive thing in the data center.

Dense FP16, primary NVIDIA datasheets. *Rubin is a projection, not a published specification. Bands are the fraction of peak actually delivered: best-case training (38-47% MFU, e.g. Llama 3), production inference (20-40% before optimisation), and a typical 18% line matching the homepage. Fleet-wide utilisation runs lower still (~5%, Cast AI 2026). The distance from any band up to peak is the stranded compute.

What we built, and what we use

The supply side of this story is clear. Within one NVIDIA generation, dense FP16 throughput climbed from 125 TFLOPS on a V100 to 312 on an A100, 989 on an H100, 2,250 on a B200, and 2,500 on a GB300: roughly a twentyfold gain in peak compute per chip.¹²³⁴

Delivered performance tells the opposite story. Even a well-tuned training run reaches only 35 to 45 percent of theoretical peak, and that is the ceiling, not the floor. Meta's Llama 3 run, executed across sixteen thousand H100s with careful optimization, landed at 38 to 43 percent.⁵⁶

Mixed production workloads fall far below even that. Direct measurement across tens of thousands of clusters in 2025 and 2026 puts average GPU use near 5 percent, which means organizations were paying for roughly twenty times the capacity they actually used.⁷⁸ Inference does better but is still low, usually 20 to 40 percent before optimization, and the pattern is not limited to accelerators: ordinary server use has sat in the low double digits for years, with a meaningful share of installed machines doing no useful work at all.⁹¹⁰¹¹¹⁷¹⁸¹⁹

The conclusion holds at every layer of the stack. Peak rose by a factor of twenty; the share we use barely moved.

The gap is the point

A low rate of use is easy to forgive when peak is small, because the waste it represents is small too. That stops being true once peak explodes. Hold the share you use flat and multiply it against a peak that has grown twentyfold, and the raw amount of wasted compute per chip grows almost in step with the peak itself.

This is what we call the execution gap. It is not a percentage that looks reassuringly stable on a chart; it is a growing pool of capacity that was bought, powered, and cooled, and never turned into useful work. And it is widest exactly where the work is most valuable.

Why the gap exists

The gap exists because the workload changed shape while the execution model stayed still.

Infrastructure was designed for predictable work: web applications, APIs, databases, services, all built on the assumption that a job could be placed on a machine and left there. Modern workloads break that assumption at every turn. They are larger and more dynamic, they move between CPU-heavy and GPU-heavy stages, and they span clouds, regions, clusters, networks, and edge environments, placing demands on the system underneath them that it was never designed to handle.¹²

When work does not fit the machine, teams compensate. They layer on orchestration, placement logic, and manual coordination; they reserve peak headroom and let it sit idle to protect a latency target; they hold whole accelerators against bursts that rarely arrive at the assumed scale.¹³²⁰ Each of these is a reasonable local decision, and together they add up to the industry's standing workaround for a gap it has not closed.

Low use, high cost, and rising complexity are not three problems. They are one mismatch seen from three angles.

Orchestration is compensation, not a fix

Orchestration manages the symptom, not the cause. It decides where work is placed and when it moves, but it changes neither the unit of work itself nor the path that work takes to reach the resource best able to run it.

The evidence is hard to argue with: use has not improved as orchestration has matured. In the very environments where orchestration has become the default foundation for AI, measured use has gone backwards.⁷¹⁴ More scheduling on top of the same execution model does not close the gap; it manages it.

What is worth noting is where the rest of the field has landed. Vendors and analysts arriving at the same diagnosis now describe the fix the same way: break execution apart and separate the stages so each one runs on the resource it actually needs.⁹¹²²¹²² That is a description of the problem TAHO was built to solve.

What TAHO changes

TAHO changes the unit of execution. It breaks a workload into smaller units, sends each unit to the resource best suited to run it, and runs it there.

This is not a replacement for orchestration but a layer beneath it and above the hardware. We do not compete with the scheduler; we change what the scheduler is scheduling.

The thesis is simple to say and hard to build: capacity is no longer the constraint, fit is. Teams that fit work to machines better will get more out of the hardware they already own, spend less to deliver the same result, and start to measure how well they execute rather than how much they spend. That efficiency is the value the execution gap has been hiding, and closing it is the opportunity.

A note on the numbers

Every peak figure on this page is dense FP16, taken from primary NVIDIA datasheets and held to a consistent precision so the generations compare cleanly. Sparse and lower-precision figures run higher; using them would make the gap look larger, not smaller, so we chose the conservative number on purpose. The Vera Rubin figure is a projection, not a published spec, and is labeled as such wherever it appears.¹⁵¹⁶

Use figures are reported with their source and what they measure, because fleet utilization, model FLOPs utilization, and server capacity utilization are three different things and should not be mixed. The argument does not rest on any single number. It rests on the shape of all of them together.

References

1NVIDIA A100 Tensor Core GPU Datasheet (80GB). Lists Peak FP16 Tensor Core at 312 TF dense / 624 TF with sparsity, and confirms the asterisk-equals-sparsity convention. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/a100-80gb-datasheet-update-a4-nvidia-1485612-r12-web.pdf
2NVIDIA H100 Tensor Core GPU Datasheet. FP16 dense 989 TFLOPS; the headline "2,000 TFLOPS*" figure is the sparse number. https://resources.nvidia.com/en-us-gpu-resources/h100-datasheet-24306
3NVIDIA HGX B200 Product Carbon Footprint Summary. Lists FP16/BF16 Tensor Core at 36 PFLOPS sparse across 8 GPUs (4,500 sparse / 2,250 dense per GPU). https://images.nvidia.com/aem-dam/Solutions/documents/HGX-B200-PCF-Summary.pdf
4GB300 NVL72 architecture and per-precision dense figures (B200 2,250 vs GB300 2,500 dense FP16/BF16, +11.1%). Verda. https://verda.com/blog/gb300-nvl72-architecture V100 dense FP16 of 125 TFLOPS corroborated at: https://www.spheron.network/blog/nvidia-a100-vs-v100/
5Llama 3 Herd of Models (Meta). Reports overall BF16 Model FLOPs Utilization of 38-43% during 405B pre-training. arXiv 2407.21783. https://ar5iv.labs.arxiv.org/html/2407.21783
6CoreWeave, NVIDIA H100 Benchmarks for Large-Scale Training. States MFU of 35-45% is common and documents optimized runs at 42-51%. https://www.coreweave.com/blog/nvidia-h100-gpu-benchmark-results-what-we-learned-from-large-scale-gpu-testing
7Cast AI, 2026 State of Kubernetes Optimization Report. Average GPU utilization of 5% measured across tens of thousands of production clusters (AWS, Azure, GCP); CPU 8%, memory 20%; both down year over year. https://cast.ai/reports/state-of-kubernetes-optimization/ Press release: https://cast.ai/press-release/2026-state-of-kubernetes-optimization-report/
8Independent reporting on the Cast AI findings: ~5% average across ~23,000 clusters, roughly 20x over-allocation. ITBrief. https://itbrief.co.uk/story/cast-ai-report-finds-5-gpu-use-in-kubernetes-clusters
9Production LLM inference commonly observed at 20-40% GPU utilization. Yotta Labs. https://www.yottalabs.ai/post/why-gpu-utilization-is-low-in-llm-inference-and-how-to-fix-it VentureBeat coverage of the cross-vendor convergence (Cast AI, Anyscale, Gartner) toward disaggregated inference: https://venturebeat.com/infrastructure/fomo-is-why-enterprises-pay-for-gpus-they-dont-use-and-why-prices-keep-climbing
10Fortune (opinion), "Data centers are eating the economy." States average server utilization hovers between 12-18%, even active servers rarely exceed 50%. https://www.fortune.com/2025/08/11/data-centers-are-eating-the-economy-and-were-not-even-using-them
11Koomey / Anthesis Group (2015). 30% of physical servers "comatose" (no useful work in 6+ months); ~10 million servers / ~$30B stranded capital; enterprise IT utilization "rarely exceeds six percent." https://www.koomey.com/koomey_blog/our-latest-research-on-comatose-servers/ Anthesis: https://www.anthesisgroup.com/insights/zombie-servers-hunting-down-the-lost-capital/
12Anyscale, GPU (In)efficiency in AI Workloads. Single-container packaging forces CPU and GPU stages to scale as one unit, guaranteeing low utilization; argues for disaggregated, multi-stage execution. https://www.anyscale.com/blog/gpu-in-efficiency-in-ai-workloads
13Defensive over-provisioning under scarcity drives reserved-but-idle capacity; "no-effort" baseline ~30% vs measured ~5%. Rack2Cloud analysis of the Cast AI data. https://www.rack2cloud.com/gpu-utilization-cloud-waste/
14SDxCentral, "Kubernetes efficiency is going backwards as AI drives GPU waste." Documents the year-over-year decline (CPU 10% to 8%, memory 23% to 20%, GPU 5%). https://www.sdxcentral.com/news/kubernetes-efficiency-is-going-backwards-as-ai-drives-gpu-waste/
15SemiAnalysis, "Vera Rubin - Extreme Co-Design." Rubin Tensor Core width doubling applies only to FP4/FP8; BF16/TF32 unchanged from Blackwell; dense FP16 scales ~1.6x via transistor/SM count, not datapath. https://newsletter.semianalysis.com/p/vera-rubin-extreme-co-design-an-evolution
16Spheron, "NVIDIA Rubin vs Blackwell vs Hopper." Notes Rubin FP16 (~8,000 TFLOPS) is projected at ~half the FP8 figure and is not NVIDIA-confirmed. https://www.spheron.network/blog/nvidia-rubin-vs-blackwell-vs-hopper/
17Sarathi-Serve (Microsoft Research / academic). Characterizes why production LLM inference under-utilizes GPUs: the decode phase is memory-bound and pipeline parallelism creates bubbles. arXiv 2403.02310. https://arxiv.org/pdf/2403.02310 Related production analysis of inference burstiness forcing peak-headroom reservation: https://arxiv.org/pdf/2403.02310
18Microsoft, "The LLM Inference Optimization Stack." Notes typical pre-optimization utilization of 30-40% on GPU node pools, with continuous batching able to push it to 80%+; single-request serving wastes the majority of capacity. https://techcommunity.microsoft.com/blog/appsonazureblog/the-llm-inference-optimization-stack-a-prioritized-playbook-for-enterprise-teams/4498818
19McKinsey, cited for an industry-average server utilization near 15% with reservation rates around 80%, and that internal IT virtualization typically reaches only ~35% versus Google's ~38%. NRDC independently reports a 12-18% range. McKinsey via Sardina Systems: https://sardinasystemsblog.medium.com/how-can-an-enterprise-achieve-over-50-server-utilization-with-the-industrys-average-of-15-5685bef65779 McKinsey "Sharpening data center due diligence": https://www.mckinsey.com/~/media/McKinsey/Business%20Functions/McKinsey%20Digital/Our%20Insights/Sharpening%20data%20center%20due%20diligence/Sharpening%20data%20center%20due%20diligence.pdf
20FinOps Foundation, FinOps for AI Working Group. Names GPU underutilization (commonly 15-30% of capacity), overprovisioning, and static provisioning as endemic causes of AI infrastructure waste; recommends reserving baseline capacity and bursting on-demand rather than holding idle accelerators. https://www.finops.org/wg/optimizing-genai-usage/ Tools and services guidance: https://www.finops.org/wg/finops-for-ai-tools-services-considerations/
21NVIDIA, introducing Dynamo. NVIDIA's own position that co-locating prefill and decode on one GPU "leads to inefficient resource use," and that disaggregating the phases onto different GPUs lets each be optimized and assigned independently. https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models Product page: https://www.nvidia.com/en-us/ai/dynamo/
22Gartner, "Build Strategic Differentiations for On-Premises AI Infrastructure Offerings" (doc 7211630, November 21, 2025). States AI infrastructure often suffers from low GPU utilization and poor cost-efficiency during on-prem LLM inference, and recommends shared GPU usage across siloed projects plus prefill-decode disaggregation with heterogeneous processors to cut inference cost and improve flexibility. https://www.gartner.com/en/documents/7211630 Related Gartner analyst framing of deliberate AI workload placement across hyperscalers, neoclouds, on-prem, and edge: https://www.computerweekly.com/opinion/Gartner-Why-neoclouds-are-the-future-of-GPU-as-a-Service