Twitter/X

Yann LeCun argues that JEPA should predict in representation space rather than…

Brief

Yann LeCun lays out his view of JEPA training: prediction should happen in latent representation space, not raw input space, because unpredictable detail makes input-level reconstruction a bad objective. He frames collapse prevention as the key challenge, contrasts EMA-based and Infomax-based solutions, rejects EMA and sample-contrastive approaches as poor fits for high-dimensional settings, and explicitly favors dimension-contrastive methods like SIGReg/LeJEPA.

Why it matters

Yann LeCun argues that JEPA should predict in representation space rather than input space because reconstruction or prediction in raw input space is counterproductive for high-dimensional, continuous, noisy data where many details are inherently unpredictable.

Key details

  • He says the central technical problem in JEPA is avoiding collapse without reconstruction loss, and groups existing solutions into two families: EMA target-encoder methods such as I-JEPA, V-JEPA, DINO, and BYOL; and Infomax regularization methods that maximize representation information content.
  • LeCun claims sample-contrastive Infomax methods like Siamese nets, DrLIM, and SimCLR scale poorly in high dimensions and need large batches and hard negative mining, while dimension-contrastive methods such as Barlow Twins, VICReg, SIGReg/LeJEPA, MMCR, and MCR2 are the more promising direction.
Source evidence

title: @ylecun: I think you missed the main ideas.
- The basic premise of JEPA is that training by reconstructio/pre...
author: @ylecun
contenttype: tweet
publication: Twitter/X
published: 2026-01-04T20:11:15+00:00
source
url: https://x.com/ylecun/status/2007907701989232684

word_count: 215

I think you missed the main ideas.
- The basic premise of JEPA is that training by reconstructio/prediction in input space is evil (or counterproductive). The details are almost always unpredictable. Hence prediction must take place in representation space, where unpredictable details are eliminated.
- The main issue with JEPA is how to prevent collapse (in the absence of reconstruction loss). There are two classes of methods:
(1) EMA: Using weights in target encoder that are an exponential moving average (EMA) of the weights in other encoder (I-JEPA, V-JEPA, DINO, BYOL).
(2) Infomax: Using a regularizer that attempts to maximize the information content of the representation (e.g. over a batch). There are two sets of methods for that:
(2a) sample-contrastive methods: that want to make each representation vector different from the others (Siamese nets, DrLIM, SimCLR, etc). They tend to not work well in high dimension, to require large batches, and hard negative mining
(2b) dimension-contrastive methods: that want to make each variable independent from the others (Barlow Twins, VICReg, SIGReg/ LeJEPA, MMCR, MCR2....)
Bottom line:
A. SSL by reconstruction/prediction doesn't work for high-dim, continuous, noisy data
B. EMA sucks: no loss function being minimized, requirement for weightmsharing....
C. Sample-contrastive informax doesn't scale to high dimension
D. My money is on dimension-contrastive methods like SIGReg/LeJEPA