No Priors: Artificial Intelligence | Technology | Startups

Baseten CEO Tuhin Srivastava on the AI Inference Crunch, Custom Models, and Building the Inference Cloud

2026-05-01 · 19:34 UTC ·No Priors: Artificial Intelligence | Technology | Startups ·34 min read

Brief

Tuhin Srivastava, CEO of Baseten, framed the conversation around the current “inference crunch”: rapid customer adoption of custom models, strained global GPU capacity, and the operational work required to run inference at scale. He began by noting Baseten’s recent hypergrowth—~30x year‑over‑year and a forecast above $1B revenue—and argued that most real‑world production workloads today are custom or post‑trained models (he estimated >95% of tokens). Srivastava used companies such as Abridge and Open Evidence to illustrate why an independent application layer remains viable: firms with unique user signals and workflow integrations will continually fine‑tune models and create differentiated value that labs alone cannot replicate.

The middle of the interview focused on infrastructure realities. Baseten runs 90 clusters across 18 clouds and operates at mid‑90s utilization, which Srivastava says exposes a hard supply shortage and a thin set of reliable suppliers. He described current commercial terms for large purchases—3–5 year contracts with ~20–30% prepaid TCV—and said this dynamic changes financing strategy (e.g., earlier IPOs, debt or other structures). Srivastava also explained why Baseten bought a post‑training research team: post‑training, quantization and inference are tightly coupled and customers demand both performance and continual retraining loops. On hardware, he argued Nvidia (H100) still dominates because of supply chain and CUDA ecosystem advantages, but he expects specialized inference and decoder chips to emerge over time. Operational scale issues—kernel panics, logging overloads, immature LLM runtimes—underscore his view that winning requires excellent software, ops culture (24/7 readiness) and partnerships around evals, sandboxes and training APIs. He ended optimistic: lower inference costs will drive more agentic, long‑horizon workflows and a proliferation of personalized “concierge” agents, increasing overall demand rather than saturating it.

Why it matters

Baseten grew ~30x over the prior year and Tuhin Srivastava said the company is expecting to exceed $1 billion in revenue in 2026.

Key details

Tuhin Srivastava reported that >95% of tokens served on Baseten are for custom/post‑trained models (dedicated inference), rather than vanilla open‑source weights.
Baseten runs 90 clusters across 18 cloud providers, operating at mid‑90s percent utilization; Srivastava warned of a severe supply crunch and said large capacity deals today commonly require 3–5 year contracts with ~20–30% TCV prepay.
Baseten acquired a post‑training research team (a former customer) to add expertise in post‑training, quantization and the tight coupling between training and inference.
Srivastava emphasized that the inference software layer is highly sticky: none of Baseten’s top 30 customers have churned and the company sees ~400% annual NDR, arguing software + operations wins over raw GPU-as-a-service.
On chips, Srivastava said the H100 remains dominant (about 4–4.5 years old), Nvidia’s supply chain and CUDA ecosystem give it a big short‑term edge, but he expects an eventual multi‑chip world with inference‑specific decoders and diversified hardware.

Reader · no content

No body text on file.

Open the original to read the full piece.