@jxmnop: people on here are dumb. the latest subquadratic attention trick might produce a model that *process...

people on here are dumb. the latest subquadratic attention trick might produce a model that processes 1M tokens (or 12M..) without going insane, but that doesn't make it good

the real problem isn't the architecture, it's the data. humans haven't generated many contiguous linear spans of 1M tokens. so of course we can't learn this distribution. it doesn't exist

dr. jack morris (@jxmnop)

"1M context" models after 100k tokens

— https://nitter.net/jxmnop/status/2051708683521040476#m