Twitter/X

Ramp launched benchmarks on 2026-03-11 to evaluate frontier LLMs on real-world…

Brief

Ramp is introducing a benchmark suite for frontier models focused on knowledge-work automation in finance rather than coding tasks. The examples center on accounts payable workflows, where models receive large prompts containing documents and prior processing examples, then must infer formatting and payment preferences across long contexts. Ramp says these evaluations show stronger reliability and performance from American models than Chinese competitors.

Why it matters

Ramp launched benchmarks on 2026-03-11 to evaluate frontier LLMs on real-world non-software-engineering financial tasks, especially accounts payable automation.

Key details

  • One featured benchmark, "contextual OCR," tests whether models can infer and reapply customer-specific patterns from long document contexts, such as formatting invoice numbers as "INV-ABC-1234" by adding hyphens based on prior processed documents.
  • The benchmarks also test multimodal PDF parsing, long-context reasoning, arithmetic/instruction following, and preference inference, including choosing payment dates from past customer behavior, invoice due dates, and payment terms; Ramp claims American models outperform Chinese models on both reliability and performance.
Source evidence

title: @rahulgs: we’re launching a robust set of benchmarks evaluating frontier models on real world, non-swe tasks

...
author: @rahulgs
contenttype: tweet
publication: Twitter/X
published: 2026-03-11T17:01:06+00:00
source
url: https://x.com/rahulgs/status/2031777447792738393

word_count: 213

we’re launching a robust set of benchmarks evaluating frontier models on real world, non-swe tasks

my favorite task is contextual OCR, which evaluates models on their ability to identify and reapply patterns in long context tasks

our goal is to save every click and key press with accounts payable automation - our agent needs to do everything it can so you can just hit “pay” on a pdf bill

for this, we feed models with large prompts with a set of documents + how they were processed

in one particular case, models correctly identified and reapplied the pattern of adding “-“ to invoice numbers, as per a customers revealed preference. INV-ABC-1234

in other cases models had to deduce what payment date a customer would want to set based on past preferences and current invoice due date and terms, combining multimodal performance (pdf parsing), long horizon performance (large context size), reasoning, gsm and instruction following

learn more about this task and more in the blog post

Ramp (@tryramp)

Word on the timeline is that agents will go from automating coding to knowledge work in 2026. So we benchmarked frontier LLMs on doing financial tasks to see what's good.

The result: American models are outperforming their Chinese counterparts in both reliability + performance.

— https://nitter.net/tryramp/status/2031767848452821233#m