Twitter/X

@alexocheema claims small language models should consistently run faster on an…

Brief

Alex Cheema contrasts NVIDIA’s RTX 5090 with Apple’s M3 Ultra, arguing the 5090 should win on smaller models while the best performance for larger models may come from using both systems together. He also cautions that llama.cpp underrepresents Apple Silicon performance and points to MLX as the better Mac inference stack.

Why it matters

@alexocheema claims small language models should consistently run faster on an NVIDIA RTX 5090 than on Apple’s M3 Ultra.

Key details

  • The post argues llama.cpp is not a fair benchmark for Apple Silicon because MLX delivers better performance on Macs.
  • A cited NVIDIA AI PC post says Google Gemma 4 31B runs up to 2.7× faster on RTX with llama.cpp, crediting @ggerganov for optimization work.
Source evidence

title: @alexocheema: Direct comparison of NVIDIA RTX 5090 to M3 Ultra 👀

Small models should always be faster on the 5090...
author: @alexocheema
contenttype: tweet
publication: Twitter/X
published: 2026-04-03T18:12:24+00:00
source
url: https://x.com/alexocheema/status/2040130311065886772

word_count: 81

Direct comparison of NVIDIA RTX 5090 to M3 Ultra 👀

Small models should always be faster on the 5090. The best perf for large models is to use both together (more on that soon).

Using llama.cpp isn’t super fair given the performance is not great on Apple Silicon. MLX is better

NVIDIA AI PC (@NVIDIAAIPC)

.@GoogleGemma 4 31B is up to 2.7X faster on RTX using llama.cpp.

Thanks to @ggerganov for working with us to make this model fast.

— https://nitter.net/NVIDIAAIPC/status/2039787452643131696#m