ArXiv

Counterfactual Stress Testing for Image Classification Models

Authors
Moritz Stammel, Fabio De Sousa Ribeiro, Raghav Mehta...
Categories
cs.CV
arXiv
https://arxiv.org/abs/2605.10894v1
PDF
https://arxiv.org/pdf/2605.10894v1

Brief

Counterfactual stress testing for medical image classifiers uses causal generative models to create realistic, anatomically-preserving ‘‘what-if’’ images by intervening on attributes like scanner type and patient sex. Evaluated on chest X‑ray and mammography across three architectures and several shift scenarios (Stammel et al., 2026), the method more reliably predicts real OOD performance than simple brightness/contrast perturbations, improving robustness assessment prior to deployment.

Why it matters

Moritz Stammel et al. (arXiv 2026-05-11) propose a counterfactual stress testing framework that uses causal generative models to intervene on attributes (e.g., scanner type, patient sex) and synthesize anatomically-preserving “what-if” images for targeted distribution shifts.

Key details

  • Evaluated on chest X‑ray and mammography across three model architectures and multiple shift scenarios, the counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing direction, relative magnitude of performance changes, and model ranking.
Source evidence

Abstract

Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in demographics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes. Although stress testing has emerged as a tool to assess this, current methods typically rely on simple, uninformed perturbations (e.g., brightness or contrast changes), which fail to capture clinically realistic variation and can overestimate robustness. In this work, we introduce a counterfactual stress testing framework based on causal generative models that create realistic "what if" images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity, enabling controlled and semantically meaningful evaluation under targeted distribution shifts. Across two imaging modalities (chest X-ray and mammography), three model architectures, and multiple shift scenarios, we show that counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing the direction and relative magnitude of performance changes as well as model ranking. These results suggest that causal generative models can serve as practical simulators for robustness assessment, offering a more reliable basis for evaluating medical AI systems prior to deployment.