title: @alexocheema: Also, it’s e2e inference time with Q4KM and bs=1, ISL=4096, OSL=128.
RTX 5090 has native support ...
author: @alexocheema
contenttype: tweet
publication: Twitter/X
published: 2026-04-03T18:31:25+00:00
sourceurl: https://x.com/alexocheema/status/2040135094891905085
word_count: 42
Also, it’s e2e inference time with Q4KM and bs=1, ISL=4096, OSL=128.
RTX 5090 has native support for 4-bit at hardware level so of course will be much faster in prefill with Q4. M3 Ultra at hardware level is fp16 and fp32 only.