Ramp launched benchmarks on 2026-03-11 to evaluate frontier LLMs on real-world…

title: @rahulgs: we’re launching a robust set of benchmarks evaluating frontier models on real world, non-swe tasks

...
author: @rahulgs
contenttype: tweet
publication: Twitter/X
published: 2026-03-11T17:01:06+00:00
sourceurl: https://x.com/rahulgs/status/2031777447792738393

word_count: 213

we’re launching a robust set of benchmarks evaluating frontier models on real world, non-swe tasks

my favorite task is contextual OCR, which evaluates models on their ability to identify and reapply patterns in long context tasks

our goal is to save every click and key press with accounts payable automation - our agent needs to do everything it can so you can just hit “pay” on a pdf bill

for this, we feed models with large prompts with a set of documents + how they were processed

in one particular case, models correctly identified and reapplied the pattern of adding “-“ to invoice numbers, as per a customers revealed preference. INV-ABC-1234

in other cases models had to deduce what payment date a customer would want to set based on past preferences and current invoice due date and terms, combining multimodal performance (pdf parsing), long horizon performance (large context size), reasoning, gsm and instruction following

learn more about this task and more in the blog post

Ramp (@tryramp)

Word on the timeline is that agents will go from automating coding to knowledge work in 2026. So we benchmarked frontier LLMs on doing financial tasks to see what's good.

The result: American models are outperforming their Chinese counterparts in both reliability + performance.

— https://nitter.net/tryramp/status/2031767848452821233#m

Ramp launched benchmarks on 2026-03-11 to evaluate frontier LLMs on real-world…

Brief

Why it matters

Key details

word_count: 213