substack.com

How Much Do GPU Clusters Really Cost?

Brief

SemiAnalysis proposes a practical, bottom‑up Total Cost of Ownership (TCO) framework for GPU clusters that goes beyond headline $/GPU‑hr. Their monthly TCO formula sums GPU rental, storage (hot/warm/cold), networking, control plane, support uplift, plus two categories of implicit costs: Goodput Expense (lost useful work from failures and inefficiencies) and amortized engineering expenses for setup and debugging. The firm published two interactive tools — a GPU Cluster TCO Calculator and a Goodput Calculator — and populated defaults from an August 2025 pricing snapshot, hands‑on tests of 80+ neoclouds, and interviews with >150 users.

They evaluate three representative provider tiers (Gold‑tier, Hyperscaler, Silver‑tier) across three scenarios: Large LLM Pretrain (5,184 NVL72 GPUs, ~80% allocated; 500 TiB hot + 10 PiB cold storage; $4/GPU‑hr input), Multimodal RL Research (2,048 B200 cluster with ~12 TB/GPU storage; neocloud $2.40 vs hyperscaler $3.10/GPU‑hr), and Inference Endpoints (512 GPUs, 1 TB/GPU). Key results: holding GPU price equal, gold‑tier TCO was roughly 5–15% lower than silver in large training; in the pretrain scenario multi‑year cost ratios were Gold 1.00x, Hyperscaler 1.10x, Silver 1.15x. Goodput modeling—with formulas for checkpoint‑hot, checkpoint‑cold, and fault‑tolerant cases—shows large jobs suffer disproportionately from MTBF, detection latency, repair time, checkpoint frequency and blast radius: in the Large Pretrain example SemiAnalysis reports goodput losses of ~6.14% (gold), 10.53% (hyperscaler), and 20.91% (silver).

On fault tolerance, they compare TorchFT (Meta open source; easier recovery but >10% comms overhead due to GLOO vs NCCL), AWS SageMaker HyperPod checkpointless (announced Dec 2025; AWS claims ≈1m45s recovery and uses model redundancy over EFA), and TorchPass (commercial; no perf hit but requires idle spare nodes — their test used 32 spare GPUs, ~0.62% of cluster). The analysis concludes that hidden line items (support tiers, orchestration premiums, setup/debug engineering, and goodput losses) often make ostensibly cheaper GPU‑hour offers more expensive in practice. ClusterMAX 2.1 (Apr 2026) adds several providers (Core42, BitDeer, FPT, Radiant/Ori, etc.), and SemiAnalysis plans broader ClusterMAX 3.0 testing and MTBF data collection this summer.

Why it matters

SemiAnalysis published a Cluster TCO methodology and two free calculators (GPU Cluster TCO and Goodput Calculator) based on August 2025 pricing data and interviews with 150+ customers; the TCO formula explicitly includes GPUs, storage, networking, control plane, support, goodput, setup, and debugging.

Key details

  • When holding GPU price constant, SemiAnalysis finds gold‑tier providers have 5–15% lower TCO than silver‑tier providers across large training workloads; in their Large LLM Pretrain example (5,184 NVL72 GPUs, 80% of cluster used), relative multi‑year costs were Gold = 1.00x, Hyperscaler = 1.10x, Silver = 1.15x at $4/GPU‑hr.
  • Goodput (useful work) is modeled with three cases—G_chkpt‑hot, G_chkpt‑cold, and G_tolerant—using inputs like MTBF, time‑to‑identify failures, time‑to‑repair, checkpoint frequency, job size, and blast radius; in the Large Pretrain scenario reported goodput losses were ~6.14% (gold), 10.53% (hyperscaler), and 20.91% (silver).
  • SemiAnalysis details three fault‑tolerance approaches: TorchFT (open source, uses GLOO vs NCCL, observed >10% perf overhead), AWS SageMaker HyperPod checkpointless (launched Dec 2025, AWS claims ~1m45s recovery vs ~15m for checkpoint restart), and TorchPass (commercial, maintains baseline performance but requires idle spare nodes; their example used 32 idle GPUs = 0.62% of cluster).
  • In the Multimodal RL Research scenario (2,048 B200 cluster, high storage ratio ~12 TB/GPU), price differentials shift: Gold = 1.00x, Hyperscaler = 1.61x (GPU and orchestration premium), Silver = 1.15x; example B200 pricing used: neoclouds $2.40/GPU‑hr (25th pct), hyperscaler $3.10/GPU‑hr (50th pct).
  • For single‑node inference endpoints (512 GPUs, avg job 8 GPUs), provider reliability mattered little to TCO; hyperscalers can still be much more expensive due to GPU list prices—SemiAnalysis reported ~59% higher cost for hyperscaler in that inference example when using prevailing GPU pricing assumptions.
Cleaned source text

Calculating Cluster TCO, The Real Impact of Downtime, The Grand Unifying Theory Of Goodput, and a ClusterMAX 2.1 Update

͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­

Forwarded this email? Subscribe here for more

How Much Do GPU Clusters Really Cost?

Jordan Nanos, Bryan Shan, Cheang Kang Wen, Daniel Nishball, and Dylan Patel

Apr 20| | | ∙| | Preview

READ IN APP

Introduction: Rethinking the Total Cost of a GPU Cluster

Modern GPUs are unbelievably expensive. A single Blackwell GPU costs more than the average car, and uses more energy than a single family home. It is now common for unicorn startups to have thousands of these GPUs working for them, day and night. Many foundation model companies now spend an order of magnitude more money on GPUs than they do on employees. We know multiple companies spending over 80% of their initial funding on GPUs. Startup founders now have four important categories of spending to consider when building a financial plan for their company:

> 1\. GPU clusters

> 2\. Tokens

> 3\. Employees

> 4\. Everything else

Traditionally, when deciding where to get a cluster to solve that first category, companies evaluate neoclouds on a cost-per-hour basis, focusing on the most expensive line item: the GPUs. However, focusing solely on the price per GPU-hour a provider offers can be misleading. In practice, two cloud offerings with identical pricing per GPU-hour can have very different TCO, once you account for everything that goes into training a model or building inference endpoints. Factors such as downtime, setup time, debugging time, and required performance tuning of networking and storage can dramatically impact how much useful work users can do per dollar spent. Additional costs for non-GPU expenses such as CPU compute, networking, storage, orchestration software, and support can also be hidden and not considered. In other words, what appears to be a cheaper cluster can in many cases end up being more expensive.

Source: SemiAnalysis Cluster TCO Calculator

The central premise of SemiAnalysis ClusterMAX™ research is that cluster quality varies significantly across GPU cloud providers, and that these differences have a meaningful impact on end user experience, productivity and as a result, TCO. Many of these factors are not captured in hardware specs, reference architectures, or one-time performance benchmarks. Differences in reliability, networking behavior, storage performance, and support affect the only metric that matters: time-to-research-objective.

In this article, we introduce a methodology for calculating the TCO of GPU clusters that goes beyond raw price per GPU-hour. We define a framework that incorporates direct costs such as compute, storage, networking, and support, as well as indirect costs related to reliability, debugging, and setup. Using this framework, we compare three classes of ClusterMAX rated providers: a gold tier neocloud provider, a silver-tier hyperscaler, and a silver-tier neocloud. We apply this methodology to three representative cluster configurations, covering Large LLM Pretrain, Multimodal RL Research, and Inference Endpoints.

In order to conduct this comparison we use our GPU Cluster TCO Calculator and our Goodput Calculator, which we release for free on our ClusterMAX website. Anyone reading this can plug in their own values for custom scenarios and see the results. We explain the formulae behind this calculator later in this article and introduce our Grand Unifying Theory of Goodput.

These calculators are supported by input data from our GPU Rental Pricing data series, hands-on experience testing 80+ neoclouds, and interviews with over 150 end-user customers of neoclouds which were conducted during the research effort for ClusterMAX 1.0, ClusterMAX 2.0, and continue to this day for ClusterMAX 3.0.

Our findings demonstrate why providers in the ClusterMAX gold-tier command a pricing premium, (or win deals at equal price). Specifically, we find that when we hold GPU pricing constant, the TCO of a gold-tier provider is lower than a silver-tier provider by roughly 5-15% across a representative set of large training workloads, but the difference is reduced to near zero when considering fault tolerant workloads like single node inference clusters. In other words, we put real dollar values behind the intuition that users have built when understanding the benefits of fault tolerance.

Definitions and Key Terms

To evaluate GPU cloud offerings on equal footing, we break down the TCO of a GPU cluster as follows.

Source: SemiAnalysis GPU Rental Price Dashboard

2\. Storage [$/GB-mo]: The cost of storing data. This includes high-performance “hot” storage (e.g. NVMe-based parallel file systems), lower-tier “warm” or object storage for less frequently accessed data, and “cold” archival storage. We also include any data access costs: for instance, API call costs on object storage or data egress charges if data leaves the cloud. These can be substantial during training when moving around large datasets and model checkpoints, and during inference when considering storing logs and metrics (now including image, video, and audio data). Based on customer surveys, we adjust our assumptions across different cluster scenarios from a low point of 2TB/GPU to a high point of 25TB/GPU. We also track the public pricing (standardized to per GB, per month) across various providers and release this data for free as a dropdown menu in the Cluster TCO Calculator. Notably, storage performance can vary massively even between different offerings even from the same provider. For example, AWS FSx for Lustre has 4 different throughput tiers (ranging from 125 MB/s/TiB to 1,000 MB/s/TiB) and charges about 3x more for 4x more throughput at list price. We allow for a consideration of this difference during inputs (e.g. for job init time) in goodput calculations discussed later.

3\. Networking [$/hr or $/GB-mo]: The cost of frontend/N-S networking features. Networking services include public IPs, firewalls/security groups, load balancers, data egress, and data transfer. For example, transferring training data or model weights out of AWS or between AWS regions can incur significant fees. For the backend/E-W network, we make a simplifying assumption that all clusters eventually perform at a similar level with a high bandwidth interconnect (i.e. InfiniBand, RoCE, EFA, etc.) after setup. As a result the cost differences are considered later in Setup Expense and Debugging Expense.

4\. Control Plane [$/hr]: The cost of managing the cluster. In terms of the orchestration software control plane, nodes for login, code development, and job submission. Extra CPU-based nodes for data processing and environments for RL rollouts can be considered here too.

5\. Support [% uplift]: The cost of support. For example, on AWS, this is an extra charge on the entire cloud bill, with three different options that range anywhere from an initial 10% to a final 3% of the bill as the monthly spend graduates to higher tiers. Of course, different tiers of support mean better response in the event of an outage or performance issue.

6\. Goodput Expense [% uplift]: The first item that is not showing up on a monthly bill and is an implicit cost associated with using lower-tier providers. We use this percentage to build in an additional cost of downtime on the cluster in the form of more rental time required, or less useful work being completed. In practice, the actual amount of downtime, or number of job interruptions depends on the provider, the individual datacenter, hardware, and workload. Inputs used to calculate this expense include the total number of interruptions/failures, time to identify the failure, and the time to repair/replace a node. The impact of a single failure/interruption also depends on the cluster design, e.g. the blast radius of the failures, training initialization time, average job size, checkpoint frequency and/or use of fault tolerant software frameworks. The inputs to this piece of the calculator is also an opportunity for users to price in the risk of a bad SLA from a risky provider, on a total % basis. For example, a 95% cluster uptime SLA commitment from the provider allows for 5% downtime with no response and not credits. Since this input is so complicated we have an entire second tab with multiple scenarios covered. More on this later.

Source: SemiAnalysis Goodput Expense Calculator

7\. Setup Expense [$/hr]: The cost of having engineers setup the cluster, and tune performance. For example, on AWS, POC’s are not free, and users report that tuning NCCL + EFA parameters in order to reach the same level of performance as InfiniBand or RoCE networks can take weeks to months of effort by multiple engineers. Since in many cases this requires an entire cluster to be dedicated to this work, the additional line items of expense includes both engineering hours and the cluster time spent on performance tuning.

8\. Debugging Expense [$/hr]: The cost of having engineers debug the cluster over time, i.e. the cost of engineering headaches. For example, on AWS, users report that debugging NCCL + EFA issues involves 4 or 5 layers of indirection from their pytorch code, through the driver stack and into the NIC/switch firmware/hardware recipe. In other words, these line items of expense include the engineering time spent on an ongoing basis, and the cluster time spent on failed jobs.

Next, we describe how both calculators work.

Our Proposed TCO Formula for GPU Clusters

The following formula is used to calculate the Total Cost of a GPU Cluster on a monthly basis:

Where…

%257D%257D%2520%255C%255C%255B6pt%255D%250A%255Ctext%257BDebugging%257D%2520%2526%253D%2520%255C%2524_%257B%255Ctext%257Bengineering-hr%257D%257D%2520%255Ccdot%2520t_%257B%255Ctext%257Bdebugging%257D%257D%2520%255C%255C%255B6pt%255D%250A%255Ctext%257BGoodput%257D_%257B%255C%2524%252F%255Ctext%257Bmo%257D%257D%2520%2526%253D%2520%255Cleft%255B%255C%252C%2520G_%257B%255Ctext%257Bchkpt-hot%257D%257D%2520%255Cmid%2520G_%257B%255Ctext%257Bchkpt-cold%257D%257D%2520%255Cmid%2520G_%257B%255Ctext%257Btolerant%257D%257D%2520%255C%252C%255Cright%255D%250A%255Cend%257Balign*%257D%26version%3D9)

Note: setup is amortized over the contract term (3mo to 3yr). in other words, spending time setting up a cluster you will use for 3 years is not a big deal. Spending weeks setting up a cluster you will use for 3 months is.

Next, we define G_chkpt-hot, G_chkpt-cold, and G_tolerant, i.e. the different ways to calculate goodput expense.

The Grand Unifying Theory Of Goodput

First, what is goodput?

In the context of training, goodput is defined as the amount of useful work users can perform on their cluster. Goodput plays on the term throughput to mean that not all throughput is “good”. Lots of training throughput can be “bad” if a GPU fell of the bus, NCCL is stalling, or there is an OOM hiding around the corner during the next checkpoint save.

These issues are much more pronounced at scale. As we demonstrate below, larger jobs on larger clusters are much more impacted by individual failures or interruptions. If 80% of your cluster is running one job, and that job has to restart (a process that can take 10-15 minutes depending on storage, networking, CPUs, caching setup, etc.) this is costing you all of those 10-15 minutes of cluster time for job initialization time, plus all the wasted compute you did from the last checkpoint to the time of the failure/interruption/crash.

As we explained in ClusterMAX 2.0, cluster-level MTBF also plays a role here. Since all GPUs eventually fail, the bigger your job, the less time you have to do useful work (goodput) in between failures.

Here we use a convenient table to illustrate the concept. As node failures get more common (moving down the y-axis of the chart) and cluster size gets bigger (moving to the right across the x-axis of the chart), the time between failures (MTBF) gets smaller and smaller.

Source: AWS

As a result, we really need to know which providers are:

1. Running clean datacenters with talented ops teams

2. Capable of identifying failures quickly (or even predicting them before they occur)

3. Able to recover from failures quickly (e.g. running hot spare pools of nodes with capacity guarantees)

We summarize all of this in our TCO Calculator as “Goodput Expense”, where the following formulae are used to calculate Goodput Expense under three scenarios:

%2520%252B%2520t_%257B%255Ctext%257Binit%257D%257D%2520%252B%2520t_%257B%255Ctext%257Brepair%257D%257D%255Cright%255D%2520j_%257B%255Ctext%257Bsize%257D%257D%2520%255Ccdot%2520%255C%2523_%257B%255Ctext%257Bfailures%257D%257D%2520%255Ccdot%2520%255C%2524_%257B%255Ctext%257BGPU-hr%257D%257D%2520%255C%255C%255B14pt%255D%250AG_%257B%255Ctext%257Bchkpt-cold%257D%257D%2520%2526%253D%2520%255Cleft%255C%257B%255Cleft%255B%255Cmax%255Cleft\(t_%257B%255Ctext%257Bid%257D%257D%252C%255C%252C%2520%255Cfrac%257Bt_%257B%255Ctext%257Bchkpt%257D%257D%257D%257B2%257D%255Cright\)%2520%252B%2520t_%257B%255Ctext%257Binit%257D%257D%255Cright%255D%2520j_%257B%255Ctext%257Bsize%257D%257D%2520%252B%2520t_%257B%255Ctext%257Brepair%257D%257D%2520%255Ccdot%2520b_%257B%255Ctext%257Bradius%257D%257D%255Cright%255C%257D%2520%255C%2523_%257B%255Ctext%257Bfailures%257D%257D%2520%255Ccdot%2520%255C%2524_%257B%255Ctext%257BGPU-hr%257D%257D%2520%255C%255C%255B14pt%255D%250AG_%257B%255Ctext%257Btolerant%257D%257D%2520%2526%253D%2520%255Cleft%255B\(t_%257B%255Ctext%257Bid%257D%257D%2520%252B%2520t_%257B%255Ctext%257Bfailover%257D%257D\)%2520j_%257B%255Ctext%257Bsize%257D%257D%2520%252B%2520t_%257B%255Ctext%257Brepair%257D%257D%2520%255Ccdot%2520b_%257B%255Ctext%257Bradius%257D%257D%255Cright%255D%2520%255C%2523_%257B%255Ctext%257Bfailures%257D%257D%2520%255Ccdot%2520%255C%2524_%257B%255Ctext%257BGPU-hr%257D%257D%250A%255Cend%257Balign*%257D%26version%3D9)

G_chkpt-cold = goodput expense when jobs restart from a checkpoint via a spare node that is “cold” (typically, provider managed). In other words, the jobs wait until a repair/replace happens. This is the worst case scenario, since these kinds of repairs typically take hours or days.

G_chkpt-hot = goodput expense when jobs restart from a checkpoint via a spare node that is “hot” (typically, customer managed but can also be from top-tier providers). In other words, the jobs (depending on defined priorities) can restart immediately on idle nodes (customer managed), pre-empt lower-priority jobs (also customer managed), or restart on a node that gets brought into the cluster from a spare pool (provider managed). Of course, a provider-managed spare pool also depends on some capacity guarantee from the customer (i.e. if one of your machines fail and you report it for repair/replacement, there needs to be spares available). Top-tier providers that are experienced running multi-tenant clusters at 4k+ GPU scale tell us that they will leave anywhere from 2-6% of their nodes in this spare pool to be used for hot-swaps.

G_tolerant = goodput expense when jobs are “fault tolerant”, i.e. they can keep running in the event of a hardware issue. This scenario is well understood for single-node inference, where a framework such as llm-d or ome or kserve will just have the load balancer stop sending traffic to the failed node and resend any failed requests to the healthy nodes. The scenario is less well understood in training.

Individual terms are…

%257D%2520%255C%255C%255B4pt%255D%250At_%257B%255Ctext%257Bchkpt%257D%257D%2520%2526%253D%2520%255Ctext%257Bfrequency%2520of%2520checkpoints%2520\(customer%2520configured\)%257D%2520%255C%255C%255B4pt%255D%250At_%257B%255Ctext%257Binit%257D%257D%2520%2526%253D%2520%255Ctext%257Btime%2520to%2520initialize%2520training%2520job%257D%2520%255C%255C%255B4pt%255D%250At_%257B%255Ctext%257Brepair%257D%257D%2520%2526%253D%2520%255Ctext%257Btime%2520to%2520repair%2520or%2520replace%2520a%2520failed%2520node%252C%2520i.e.%255C%2520MTTR%257D%2520%255C%255C%255B4pt%255D%250At_%257B%255Ctext%257Bfailover%257D%257D%2520%2526%253D%2520%255Ctext%257Btime%2520to%2520failover%2520to%2520a%2520hot%2520spare%2520node%257D%2520%255C%255C%255B4pt%255D%250Ab_%257B%255Ctext%257Bradius%257D%257D%2520%2526%253D%2520%255Ctext%257Bblast%2520radius%252C%2520e.g.%255C%25208-way%2520HGX%2520or%252064-way%2520in%2520NVL72%257D%2520%255C%255C%255B4pt%255D%250Aj_%257B%255Ctext%257Bsize%257D%257D%2520%2526%253D%2520%255Ctext%257Baverage%2520job%2520size%257D%2520%255C%255C%255B4pt%255D%250A%255C%2523_%257B%255Ctext%257Bfailures%257D%257D%2520%2526%253D%2520%255Ctext%257Bnumber%2520of%2520failures%252C%2520i.e.%255C%2520MTBF%257D%2520%255C%255C%255B4pt%255D%250A%255C%2524_%257B%255Ctext%257BGPU-hr%257D%257D%2520%2526%253D%2520%255Ctext%257Bprice%2520per%2520GPU%2520hour%257D%250A%255Cend%257Balign*%257D%26version%3D9)

Notably, from the user’s perspective, there are two very different approaches at the software level that we have observed on training clusters. The first is checkpoint restart (still the most common option for small and medium-scale clusters), and the second is fault tolerant training frameworks. In both cases, the inputs to the calculations depend on the approach of recovering from idle nodes vs pre-emption vs relying on the provider, and how long repair/replace flows actually take.

Source: Meta, https://arxiv.org/abs/2410.21680v2

In the scenario of a fault-tolerant training framework, we consider three options, which we describe in more detail below:

TorchFT (open-source from meta-pytorch)

AWS SageMaker HyperPod Checkpointless training (restricted to AWS only)

TorchPass (licensed product from clockwork.io)

TorchFT

TorchFT is the open source standard for fault tolerant training frameworks. The framework easily integrates with existing torchtitan code, and allows for training jobs on large clusters to continue running in the event of a hardware failure. No need for checkpoints (or really, you can checkpoint less frequently). However, the blast radius is the entire replica group.

Source: PyTorch blog on TorchFT

Since TorchFT’s blast radius is the entire replica group (i.e. an FSDP shard within an HSDP job), when any GPU or node within a group fails, the whole group’s torchrun process crashes. This means that all GPUs in that group are idle until recovery completes. As a result, with FSDP shard=16 a single GPU failure takes out all 16 GPUs. With shard=32, it takes out 32 GPUs, etc.

Specific to FSDP, the relevant failure domain is the communication group, not just the raw cluster size. Because parameters are all-gathered before computation and gradients are reduce-scattered in backward, a single failed or hung rank can stall the entire participating group. In practice, HSDP makes this more explicit: blast radius becomes a topology decision at the replica-group level rather than a property of the whole cluster.

This has a tradeoff. When a replica group dies, you lose that whole group’s GPUs until the node is replaced, a surviving group serializes its full model + optimizer state via `state_dict()`, serves it over HTTP to the recovering group, calls `load_state_dict()`, syncs its step counter, and rejoins the quorum. This whole process is orchestrated by the TorchFT lighthouse server, which you must install on the cluster.

Source: Source: Meta, https://arxiv.org/abs/2410.21680v2

Not every large-scale failure looks like a dead GPU or dead node. A meaningful share of incidents first appear as stuck collectives or watchdog timeouts, which are just symptoms. From a TCO perspective, that means goodput loss includes not only repair or replacement time, but also the time required to detect, attribute, and unwind a hung collective across the participating ranks.

Checkpointing itself can be part of the failure tax. On FSDP2, converting a DTensor state dict back to a full tensor for saving issues an all-gather across ranks Checkpoint frequency is a reliability parameter and a communication and failure-surface parameter.

However, this fault tolerance comes at a performance cost. Since TorchFT requires the use of GLOO vs NCCL for comms across replica groups, there is a per-iteration overhead for an allreduce through the CPU via frontend TCP instead of the backend RDMA network. In initial testing we saw a performance difference of over 10% on comparable HSDP jobs. As a result, when considering goodput expense, we allow for this performance difference to be considered in a “Network overhead (%)” line item if the user chooses to run TorchFT.

Fault tolerance can affect training semantics, not just recovery latency. The number of healthy participants, and therefore effective batch, could change from step to step as replica groups dropped in and rejoined. When comparing TorchFT to checkpoint-restart or live-migration approaches, some methods preserve forward progress by accepting temporary degraded participation, which may affect optimizer dynamics and throughput accounting.

Notably, TorchFT is scheduler agnostic, so it supports kubernetes or slurm.

AWS SageMaker HyperPod Checkpointless Training

AWS introduced checkpointless training for their SageMaker Hyperpod EKS clusters in December 2025. This is a kubernetes-only, and NeMo megatron-only solution to the same fault tolerance problem described earlier. Amazon developed this technology internally for training their Nova models and has proven it at 1k+ GPU scale.

The core of checkpointless training is the concept of model redundancy. In other words, the model and optimizer states are contained to the replica group, and then synced cross-replica group (though AWS calls them node groups). Similar to TorchFT, the presence of this cross-group sync allows for recovery of failed nodes and groups without interrupting the running job. Blast radius is proportional to the size of the group relative to the full job size. At runtime, each GPU maintains redundant copies of its model shards on peer GPUs, meaning when a failure occurs the recovering process loads state via RDMA over EFA. This process is managed by CheckpointManager and is a relatively simple code change as long as you’re scheduling your jobs on via the SageMaker HyperPod Training Operator.

There is a clear tradeoff for memory overhead here. To quote AWS docs: “The high-precision master model weights/gradients and optimizer states will be affected. Adding one redundant model replica increases device memory usage by roughly the equivalent of one DCP checkpoint size.” In other words, to run with this approach to fault tolerance you will introduce GPU memory pressure (proportional to the size of your replica groups relative to total job size) and OOMs. The result is running at reduced batch size or different parallelism strategies, which relative to a checkpoint restart job generally means a performance impact. As a result, when considering goodput expense, we allow for this performance difference to be considered in a “Memory overhead (%)” line item if the user chooses to run with checkpointless training.

Source: AWS Checkpointless Training Docs

Notably, checkpointless training is integrated with AWS node lifecycle management and deep health checks, which means it is quick to swap in pre-warmed hot spares (i.e. idle nodes in the cluster) for replacement. AWS claims recovery times of 1min 45 seconds for checkpointless training, vs 15 mins for checkpoint restart. Our hands on testing confirms this recovery time for a simple megatron training job on a 4-node H200 cluster. We also tested deep health checks and saw simulated hardware failures identified in under 2 minutes, and health nodes replaced in the cluster in under 20 mins.