Yann LeCun argues that JEPA should predict in representation space rather than…

Brief

Yann LeCun lays out his view of JEPA training: prediction should happen in latent representation space, not raw input space, because unpredictable detail makes input-level reconstruction a bad objective. He frames collapse prevention as the key challenge, contrasts EMA-based and Infomax-based solutions, rejects EMA and sample-contrastive approaches as poor fits for high-dimensional settings, and explicitly favors dimension-contrastive methods like SIGReg/LeJEPA.

Key details

He says the central technical problem in JEPA is avoiding collapse without reconstruction loss, and groups existing solutions into two families: EMA target-encoder methods such as I-JEPA, V-JEPA, DINO, and BYOL; and Infomax regularization methods that maximize representation information content.

LeCun claims sample-contrastive Infomax methods like Siamese nets, DrLIM, and SimCLR scale poorly in high dimensions and need large batches and hard negative mining, while dimension-contrastive methods such as Barlow Twins, VICReg, SIGReg/LeJEPA, MMCR, and MCR2 are the more promising direction.

title: @ylecun: I think you missed the main ideas.
- The basic premise of JEPA is that training by reconstructio/pre...
author: @ylecun
contenttype: tweet
publication: Twitter/X
published: 2026-01-04T20:11:15+00:00
sourceurl: https://x.com/ylecun/status/2007907701989232684

word_count: 215

I think you missed the main ideas.
- The basic premise of JEPA is that training by reconstructio/prediction in input space is evil (or counterproductive). The details are almost always unpredictable. Hence prediction must take place in representation space, where unpredictable details are eliminated.
- The main issue with JEPA is how to prevent collapse (in the absence of reconstruction loss). There are two classes of methods:
(1) EMA: Using weights in target encoder that are an exponential moving average (EMA) of the weights in other encoder (I-JEPA, V-JEPA, DINO, BYOL).
(2) Infomax: Using a regularizer that attempts to maximize the information content of the representation (e.g. over a batch). There are two sets of methods for that:
(2a) sample-contrastive methods: that want to make each representation vector different from the others (Siamese nets, DrLIM, SimCLR, etc). They tend to not work well in high dimension, to require large batches, and hard negative mining
(2b) dimension-contrastive methods: that want to make each variable independent from the others (Barlow Twins, VICReg, SIGReg/ LeJEPA, MMCR, MCR2....)
Bottom line:
A. SSL by reconstruction/prediction doesn't work for high-dim, continuous, noisy data
B. EMA sucks: no loss function being minimized, requirement for weightmsharing....
C. Sample-contrastive informax doesn't scale to high dimension
D. My money is on dimension-contrastive methods like SIGReg/LeJEPA

Brief

Why it matters

Key details

word_count: 215