Twitter/X

Karpathy (created 2026-03-11) recommends ending LLM prompts with the exact phrase…

Brief

Karpathy recommends appending 'structure your response as HTML' to LLM prompts to render outputs in a browser, argues audio will be humans' preferred input while vision becomes AI's preferred output, and outlines a progression from text→markdown→HTML→interactive neural videos (eventually diffusion‑generated), urging improved multimodal inputs before BCIs.

Why it matters

Karpathy (created 2026-03-11) recommends ending LLM prompts with the exact phrase 'structure your response as HTML' to view generated output in a browser and reports similar success asking for slideshow-style output.

Key details

  • He asserts audio is humans' preferred input and vision is AIs' preferred output, noting ~one-third of the brain is dedicated to vision, and predicts a progression: raw text → markdown → HTML → interactive neural videos/simulations, ultimately from diffusion neural nets.
  • He urges better multimodal inputs (pointing/gesturing on-screen) before moving to Neuralink‑style BCIs, calling the current phase an ongoing 'input/output mind‑meld' with substantial work remaining.
Source evidence

This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc.

More generally, imo audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision, it is the 10-lane superhighway of information into brain. As AI improves, I think we'll see a progression that takes advantage:

1) raw text (hard/effortful to read)
2) markdown (bold, italic, headings, tables, a bit easier on the eyes) <-- current default
3) HTML (still procedural with underlying code, but a lot more flexibility on the graphics, layout, even interactivity) <-- early but forming new good default
...4,5,6,...
n) interactive neural videos/simulations

Imo the extrapolation (though the technology doesn't exist just yet) ends in some kind of interactive videos generated directly by a diffusion neural net. Many open questions as to how exact/procedural "Software 1.0" artifacts (e.g. interactive simulations) may be woven together with neural artifacts (diffusion grids), but generally something in the direction of the recently viral nitter.net/zan2434/status/2046982…

There are also improvements necessary and pending at the input. Audio nor text nor video alone are not enough, e.g. I feel a need to point/gesture to things on the screen, similar to all the things you would do with a person physically next to you and your computer screen.

TLDR The input/output mind meld between humans and AIs is ongoing and there is a lot of work to do and significant progress to be made, way before jumping all the way into neuralink-esque BCIs and all that. For what's worth exploring at the current stage, hot tip try ask for HTML.

Thariq (@trq212)

x.com/i/article/205279610060…

— https://nitter.net/trq212/status/2052809885763747935#m