ArXiv

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

2026-05-11 · 17:25 UTC ·Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme... ·1 min read

Authors: Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme...
Categories: cs.CV, cs.AI
arXiv: https://arxiv.org/abs/2605.10873v1
PDF: https://arxiv.org/pdf/2605.10873v1

Brief

CADBench is a unified multimodal benchmark for recovering editable CAD programs from 2D/3D inputs, assembled from 18,000 samples across six families and five input modalities with six evaluation metrics. The authors evaluated 11 systems (>1.4M generated programs) and found mesh-to-CAD specialists beat code-generating VLMs under ideal inputs; they report brittleness under modality shift and complexity-related failures. (Summary based on abstract; full text not available.)

Why it matters

CADBench provides 18,000 evaluation samples spanning six benchmark families (derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse), five input modalities (clean meshes, noisy meshes, single-view renders, photorealistic renders, multi-view renders), and six metrics covering geometric fidelity, executability, and program compactness; STEP-based families are stratified by B-rep face count and diversity-sampled.

Key details

The authors benchmarked 11 CAD-specialized and general-purpose vision-language systems, generating over 1.4 million CAD programs; under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, and CADBench exposes three failure modes: quality drops with geometric complexity, CAD-specialized models are brittle under modality shift, and model rankings vary by metric.

Source evidence

Abstract

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://huggingface.co/datasets/DeCoDELab/CADBench.