Twitter/X

An ex-OpenAI safety researcher said OpenAI spent 18 months building internal eval…

Brief

An ex-OpenAI safety researcher told @xmayeth at an Austin afterparty that OpenAI's 18 months of internal eval work (calibration, verification, structured reasoning) was effectively replicated by Polymarket's open-source RAG repo. He runs an autonomous agent scoring markets every 20 minutes across 9 sources, using >12% edge entries and quarter-Kelly sizing — yielding +$61,000 in five months; the author tried it with $750 and saw +$340 in week one.

Why it matters

An ex-OpenAI safety researcher said OpenAI spent 18 months building internal eval pipelines (confidence calibration, multi-source verification, structured reasoning) and claimed 'GPT is optimized to be agreeable' while 'Claude is optimized to be uncertain' — uncertainty is the edge in prediction markets.

Key details

  • Polymarket open-sourced a RAG pipeline repo the researcher described as a $200M quant-desk equivalent: 9 live data sources, LLM backbone, autonomous execution. His agent scores every market every 20 minutes (Reuters, AP, odds feeds, on-chain flow, social sentiment), enters when edge >12% using quarter-Kelly sizing, exits instantly; worst month -$800, best (February) +$19,000, total +$61,000 in 5 months.
  • The author replicated it with $750, running four agents: first week +$340 and a 74% hit rate. The researcher who revealed this left OpenAI four months earlier and warned 'Don't post about this.'
Reader · no content

No body text on file.

Open the original to read the full piece.