people on here are dumb. the latest subquadratic attention trick might produce a model that processes 1M tokens (or 12M..) without going insane, but that doesn't make it good
the real problem isn't the architecture, it's the data. humans haven't generated many contiguous linear spans of 1M tokens. so of course we can't learn this distribution. it doesn't exist
dr. jack morris (@jxmnop)
"1M context" models after 100k tokens
— https://nitter.net/jxmnop/status/2051708683521040476#m