Twitter/X

The benchmark cited is end-to-end inference time using Q4_K_M quantization with…

2026-04-03 · 18:31 UTC ·@alexocheema ·0 min read

Brief

Alex Ocheema frames a hardware comparison around an end-to-end LLM inference setup: Q4KM, bs=1, ISL=4096, and OSL=128. The core claim is that RTX 5090 performance benefits from native 4-bit hardware support, especially during prefill, while Apple’s M3 Ultra is limited to fp16/fp32 hardware paths and therefore is less advantaged on Q4 workloads.

Why it matters

The benchmark cited is end-to-end inference time using Q4_K_M quantization with batch size 1, input sequence length 4096, and output sequence length 128.

Key details

Alex Ocheema claims the RTX 5090 has native hardware-level 4-bit support, which makes it much faster on Q4 prefill workloads.
The post contrasts Apple’s M3 Ultra as supporting fp16 and fp32 at the hardware level, implying weaker acceleration for 4-bit quantized inference than the RTX 5090.

Source evidence

title: @alexocheema: Also, it’s e2e inference time with Q4KM and bs=1, ISL=4096, OSL=128.

RTX 5090 has native support ...
author: @alexocheema
contenttype: tweet
publication: Twitter/X
published: 2026-04-03T18:31:25+00:00
sourceurl: https://x.com/alexocheema/status/2040135094891905085

word_count: 42

Also, it’s e2e inference time with Q4KM and bs=1, ISL=4096, OSL=128.

RTX 5090 has native support for 4-bit at hardware level so of course will be much faster in prefill with Q4. M3 Ultra at hardware level is fp16 and fp32 only.