ArXiv

It Just Takes Two: Scaling Amortized Inference to Large Sets

Authors
Antoine Wehenkel, Michael Kagan, Lukas Heinrich...
Categories
cs.LG, cs.AI, stat.ML
arXiv
https://arxiv.org/abs/2605.07972v1
PDF
https://arxiv.org/pdf/2605.07972v1

Brief

It Just Takes Two introduces a simple, theoretically grounded strategy for amortized neural posterior estimation that decouples representation learning from posterior modeling. The authors train a mean-pool Deep Set on sets of size ≤2 to produce an encoder that generalizes to arbitrary set sizes, then finetune an inference head on aggregated embeddings; this makes training cost essentially independent of deployment set size N. Across diverse benchmarks (scalar, image, multi-view 3D, molecular, high-dimensional generation) with N in the thousands, the method matches or outperforms baselines while using much less compute.

Why it matters

Method: Train a mean-pool Deep Set encoder on sets of size at most two to learn representations that generalize to arbitrary deployment set size N; then finetune the inference head on pre-aggregated embeddings so training compute/memory is essentially independent of N.

Key details

  • Results & provenance: Authors Wehenkel, Kagan, Heinrich, and Pollard (arXiv 2026-05-08) evaluate on scalar, image, multi-view 3D, molecular, and high-dimensional conditional-generation benchmarks with N in the thousands, matching or outperforming standard baselines at a fraction of the compute.
Source evidence

Abstract

Neural posterior estimation has emerged as a powerful tool for amortized inference, with growing adoption across scientific and applied domains. In many of these applications, the conditioning variable is a set of observations whose elements depend not only on the target but also on unknown factors shared across the set. Optimal inference therefore requires treating the set jointly, which in turn requires training the estimator at the deployment set size -- a regime where memory and compute quickly become prohibitive. We introduce a simple, theoretically grounded strategy that decouples representation learning from posterior modeling. Our method trains a mean-pool Deep Set on sets of size at most two, producing an encoder that generalizes to arbitrary set sizes. The inference head is then finetuned on pre-aggregated embeddings, making training cost essentially independent of the deployment set size N. Across scalar, image, multi-view 3D, molecular, and high-dimensional conditional generation benchmarks with N in the thousands, our approach matches or outperforms standard baselines at a fraction of the compute.