The AI is 5% of the work.
The 95% that breaks:
→ Observability (Langfuse, Braintrust, Helicone) - you can't debug what you can't see
→ Evals - regression suites for non-deterministic software. The new CI.
→ Durable runtime (Temporal, Inngest) - so a 10-minute agent run survives a server restart
→ Guardrails - prompt injection detection, PII redaction, output filtering
→ Memory layer - vector DBs (Pinecone, pgvector, Turbopuffer), retrieval, session state
→ Tools layer - MCP servers, sandboxed code execution (E2B, Modal), browser automation (Browserbase)
→ Auth + multi-tenancy - your agent calling Salesforce for customer A must NEVER see customer B's anything
→ Cost controls - agents in runaway loops burn $$ in minutes
→ Human-in-the-loop - approval gates for "spend more than $X" or "send external email"
→ Prompt versioning - prompts are code, treat them like code
→ Orchestration - plan-act-observe-repeat. Most serious teams are moving toward minimal orchestration + explicit state machines over heavy frameworks.
→ Model routing - LiteLLM, Portkey, OpenRouter for fallback, prompt caching, and version pinning so a vendor update doesn't silently change your product
A CTO with 10+ years shipping production gave me the honest version: "Observability + evals + durable runtime + guardrails is the minimum viable production stack.
Skip those four and you get the works-in-demo → on-fire-in-prod gap killing agent startups right now."
The LLM is the easy part.
Everything around it is the actual company.