title: @rahulgs: we’re launching a robust set of benchmarks evaluating frontier models on real world, non-swe tasks
...
author: @rahulgs
contenttype: tweet
publication: Twitter/X
published: 2026-03-11T17:01:06+00:00
sourceurl: https://x.com/rahulgs/status/2031777447792738393
word_count: 213
we’re launching a robust set of benchmarks evaluating frontier models on real world, non-swe tasks
my favorite task is contextual OCR, which evaluates models on their ability to identify and reapply patterns in long context tasks
our goal is to save every click and key press with accounts payable automation - our agent needs to do everything it can so you can just hit “pay” on a pdf bill
for this, we feed models with large prompts with a set of documents + how they were processed
in one particular case, models correctly identified and reapplied the pattern of adding “-“ to invoice numbers, as per a customers revealed preference. INV-ABC-1234
in other cases models had to deduce what payment date a customer would want to set based on past preferences and current invoice due date and terms, combining multimodal performance (pdf parsing), long horizon performance (large context size), reasoning, gsm and instruction following
learn more about this task and more in the blog post
Ramp (@tryramp)
Word on the timeline is that agents will go from automating coding to knowledge work in 2026. So we benchmarked frontier LLMs on doing financial tasks to see what's good.
The result: American models are outperforming their Chinese counterparts in both reliability + performance.
— https://nitter.net/tryramp/status/2031767848452821233#m