title: @ylecun: I think you missed the main ideas.
- The basic premise of JEPA is that training by reconstructio/pre...
author: @ylecun
contenttype: tweet
publication: Twitter/X
published: 2026-01-04T20:11:15+00:00
sourceurl: https://x.com/ylecun/status/2007907701989232684
word_count: 215
I think you missed the main ideas.
- The basic premise of JEPA is that training by reconstructio/prediction in input space is evil (or counterproductive). The details are almost always unpredictable. Hence prediction must take place in representation space, where unpredictable details are eliminated.
- The main issue with JEPA is how to prevent collapse (in the absence of reconstruction loss). There are two classes of methods:
(1) EMA: Using weights in target encoder that are an exponential moving average (EMA) of the weights in other encoder (I-JEPA, V-JEPA, DINO, BYOL).
(2) Infomax: Using a regularizer that attempts to maximize the information content of the representation (e.g. over a batch). There are two sets of methods for that:
(2a) sample-contrastive methods: that want to make each representation vector different from the others (Siamese nets, DrLIM, SimCLR, etc). They tend to not work well in high dimension, to require large batches, and hard negative mining
(2b) dimension-contrastive methods: that want to make each variable independent from the others (Barlow Twins, VICReg, SIGReg/ LeJEPA, MMCR, MCR2....)
Bottom line:
A. SSL by reconstruction/prediction doesn't work for high-dim, continuous, noisy data
B. EMA sucks: no loss function being minimized, requirement for weightmsharing....
C. Sample-contrastive informax doesn't scale to high dimension
D. My money is on dimension-contrastive methods like SIGReg/LeJEPA