Essays - Benedict Evans

The Deep Research problem

Brief

Benedict Evans critiques OpenAI's Deep Research (published 2025-02-18) by auditing its smartphone-market sample: the tool cited Statista/Statcounter and reported Japan as 69% iOS, but primary sources (Kantar via Statista and a JFTC survey) show Android majorities of ~63% and ~53%. He concludes LLMs infer probabilistically but struggle with exact data retrieval and source judgement, making them fast 'infinite interns' yet unreliable without expert verification.

Why it matters

OpenAI's Deep Research sample table cited Statista and Statcounter and claimed Japan smartphone share was 69% iOS / 31% Android; Evans checked sources and found Kantar (via Statista) actually shows ~63% Android / 36% iOS, and a JFTC regulator survey reports ~53% Android / 47% iOS.

Key details

  • Evans argues LLMs are probabilistic, not deterministic: they can infer what you probably mean but perform poorly at precise information retrieval and source-selection, so generated data tables can be ‘mostly right’ but contain critical, hard-to-detect errors.
  • Two structural concerns: we don't know if model error rates will ever vanish (a binary—not incremental—change if they do), and foundation-model providers lack durable moats beyond capital, making products like Deep Research likely to be thin wrappers around commoditized APIs amid competition (e.g., Perplexity).
Cleaned source text

title: The Deep Research problem

author: Benedict Evans

content_type: article

publication: Essays - Benedict Evans

published: 2025-02-18T14:51:22

source_url: https://www.ben-evans.com/benedictevans/2025/2/17/the-deep-research-problem

word_count: 1666

Most what I do for a living is research and analysis. I think of data I’d like to see and go looking for it; I compile and collate it, make charts, decide they’re boring and try again, find new ways and new data to understand and explain the issue, and produce text and charts that try to express what I’m thinking. Then I go and talk to people about it. This often involves a huge amount of manual labour - there’s an iceberg beneath each chart - and OpenAI’s Deep Research looks like it should be tailor-made for me. So, does it fit? I could test it myself with a new problem, but before I burn time and credits, as luck would have it OpenAI’s own product page has a sample report on something I know quite a lot about - smartphones. Let’s have a look.

View fullsize

This table looks great - hours of work compiling this data all done for for me by a machine. Before we give it to a client, though, let’s just check a few things. First, what’s the source? Ah. We have two sources: Statista and Statcounter. Statcounter is a problematic measure of ‘adoption’ - it’s a measure of traffic , and as we all know, different devices are used differently, higher-end devices are used more, and the iPhone skews to the high-end and also skews to more use. You can’t really use that for this, as I’d explain to an intern (I often compare AI to interns). Statista, meanwhile, aggregates other people’s data, makes sure it ranks highly in SEO, and then tries to get you to register or pay to see the result. I think Google should ban this company from the index, but even if you disagree, saying this is the source is like saying the source is ‘a Google search result’. Again, this is an intern-level issue. Setting that aside, though, let’s dig some more, and look at one number - Japan. Deep Research says that the Japanese smartphone market is split 69% iOS and 31% Android. That prompts two questions: is that what those sources say, and are they right? These are very different kinds of question. First, Statcounter, despite over-weighting iPhones as noted above, doesn’t actually say 69%, or at any rate hasn’t in over a year. Hmm.

If we check Statista, we have to jump through a bunch of hoops, but eventually find that the actual source is the research firm Kantar Worldpanel, and the numbers it gives are pretty much the exact opposite of what Deep Research claims - 63% Android and 36% iOS. Oh.