Jeff Kaufman's Writing

Introducing and Deprecating WoFBench

Brief

WoFBench, introduced on 2026-03-01, aimed to test recall and knowledge synthesis in Tui T. Sutherland's Wings of Fire with 17 hard questions but was immediately saturated: Gemini 3.1 Pro scored 17/17, Claude Opus 4.6 16.8/17, ChatGPT 5.2 Pro 16.3/17 and ELIZA 0/17. Questions were partly generated by Claude, superfans were two household children (ages 11 and 10), and the small, informal design produced validation and bias limits.

Why it matters

WoFBench consisted of 17 Wings of Fire questions and was introduced and formally deprecated in a post published 2026-03-01 after the benchmark was found to be saturated on creation.

Key details

  • Frontier models scored near-perfectly: Gemini 3.1 Pro 17.0/17, Claude Opus 4.6 16.8/17, ChatGPT 5.2 Pro 16.3/17, while ELIZA scored 0/17; two household superfans (ages 11 and 10) scored 14.7/17 and 5.9/6 respectively, with the younger not evaluated on all questions.
  • Methodological issues undermined validity: questions were partly generated by Claude Opus 4.6, superfans were recruited via a household self-assessment prompt, validation allowed panel discussion and primary-source checks but cannot detect unanimous incorrectness, and the sample size was tiny.
Cleaned source text

title: Introducing and Deprecating WoFBench

content_type: article

publication: Jeff Kaufman's Writing

published: 2026-03-01T13:00:00+00:00

source_url: https://www.jefftk.com/p/introducing-and-deprecating-wofbench

word_count: 793

We present and formally deprecate WoFBench, a novel test that compares

the knowledge of Wings of Fire superfans to frontier AI models. The

benchmark showed initial promise as a challenging evaluation, but

unfortunately proved to be saturated on creation as AI models produced

output that was, to the extent of our ability to score responses,

model capabilities, but they are struggling to keep up with LLM

progress: frontier models now consistently achieve high scores on many popular benchmarks, raising questions about their

continued ability to differentiate between models. In response, we introduce WoFBench, an evaluation suite designed to

test recall and knowledge synthesis in the domain of Tui

T. Sutherland's Wings of Fire universe. The superfans were identified via a careful search process, in which

all members of the lead author's household were asked to complete a

self-assessment of their knowledge of the Wings of Fire universe. The

assessment consisted of a single question, with the text "do you think

you know the Wings of Fire universe better than Gemini?" Two

superfans were identified, who we keep anonymous to reduce the risk

of panel poaching by competing benchmark efforts. Identification of questions proved difficult, as the benchmark authors

have extremely limited knowledge of Wings of Fire lore, primarily

derived from infodumping and overheard arguments. We initially

attempted to source questions from the superfans, where each could be

judged on the other's questions. As they were uncompensated and

rivalrous, however, they agreed to participate only to the extent that

their answers could be compared across the superfan panel. Instead,

questions were sourced by asking Claude Opus 4.6: Can you give me three questions about the Wings of Fire series, aiming

to make them as hard as possible? I intend to ask these to my

11-year-old, my 10-year-old, and also to Gemini, and I want them all to

struggle. My two kids have agreed to participate in this, and while

Gemini has not been consulted I do not expect it to object. The final benchmark consisted of seventeen questions, limited primarily by

the lead author's willingness to continue. The elder superfan

appeared indefatigable, [1] and if this benchmark otherwise appeared

promising we are confident that an extremely large benchmark could be

constructed. Note that the younger superfan needed to leave for a

birthday party before evaluation could be completed, and was not

evaluated on all questions. Answers were collected in written form,

to avoid leakage within the superfan panel. No points were deducted

for errors of spelling. Each answer was validated by allowing the superfans to discuss, asking

follow-up questions to Gemini, and in especially contentious cases by

direct inspection of primary sources. Note that this validation

procedure is not able to distinguish cases in which all superfans and

models were correct from ones in which they all give the same

incorrect answer. We evaluated Gemini 3.1 Pro in real time, and followed up with

evaluations of Claude Opus 4.6, ChatGPT 5.2 Pro, and ELIZA. In

cases where questions had multiple components, partial credit was

given as a fraction of all components. Evaluee WoFBench Score Superfan 1 (age 11) 14.7/17 Superfan 2 (age 10) 5.9/6 Gemini 17.0/17 Claude 16.8/17 ChatGPT 16.3/17 ELIZA 0/17 We conclude that while some AI systems, notably ELIZA, performed

poorly, all frontier models scored very close to 100%. Many of the

lost points are arguably judgment calls, or cases where a model tried

to interpret a trick/misinformed question maximally charitably.

Superfan 1 performed noticeably below frontier models, though above

the ELIZA baseline. Superfan 2 performed competitively, though we note

she was not evaluated on the questions where Superfan 1 lost the most

points, making direct comparison difficult. While this benchmark was designed to be challenging for both superfans

and AIs, it already has very limited ability to distinguish between

models. While further sensitivity might be squeezed out via the

addition of multi-sample evaluation, it's unlikely that this would be

meaningful for this model generation let alone future ones. This

reflects an increasingly common conundrum that benchmark developers

may find themselves in, where after investing large amounts of time,

effort, and money into the creation of a benchmark it is already

obsolete when published. The authors note that benchmark saturation

joins job displacement, stable authoritarianism, and human extinction

on the list of reasons to be concerned about the pace of AI progress. [1] Superfan 1 was permitted to read a draft of this report prior

to publication. Their only feedback was that I should ask them

additional, harder, questions. As of publication time, Superfan 1 was

repeating "ask me more Wings of Fire questions!" at progressively

increasing volume. Comment via: facebook , mastodon , bluesky , substack