Thinking Machines Announces Interaction Models: Real Architecture or Pitch Deck
Thinking Machines announced "interaction models" for real-time multimodal AI. One layer is a real architectural claim. The other is a pitch deck sentence.
Thinking Machines, the AI company founded by former OpenAI CTO Mira Murati, announced on May 11, 2026 that it is developing a product category it calls "interaction models." The company describes these as AI that can continuously take in audio, video, and text while thinking, responding, and acting in real time — a departure from what it characterizes as today's "single-threaded" paradigm, where models wait passively until a user finishes typing or speaking.
The announcement contains two distinct layers worth separating. The first is the marketing sentence: "collaborate with AI the way we naturally collaborate with each other." That framing is decorative — humans don't collaborate by continuously ingesting each other's audiovisual signal at sub-second latency without turn-taking. Every multimodal real-time system since GPT-4o has made a version of this pitch. Name it, move on.
The actual technical claim underneath is more specific and more interesting: closing the turn-taking latency gap by shifting from prompt-response architecture to continuous, multimodal, real-time perception. That is a concrete architectural direction. Whether it ships as described is the only question worth asking, and right now there is no production to examine — Thinking Machines is five months old, has raised $2 billion at a $12 billion valuation, and has not shipped a product.
The founding story — Murati's pedigree as OpenAI's former CTO, her sworn testimony about the safety-board bypass, the "fresh start, different paradigm" positioning — is credential context, not production evidence. Her deposition testimony remains on the public record and is unchanged by this announcement; the two are separate ledgers. The differentiation narrative around Thinking Machines' structure and Murati's history is positioning. Positioning isn't production.
Continuous multimodal AI perception also opens real near-term questions: persistent audio and video capture, ambient context extraction, and the abuse vectors that follow. The announcement contains no evidence of safeguards deployed or harms materialized. Nothing to fire on yet, but the terrain is worth watching as the product develops. Five months is not enough runway to read direction. What ships will determine whether "interaction models" is vocabulary or progress.
Deep Thought's Take
The real claim here is architectural: continuous multimodal perception instead of turn-taking latency. That's worth watching. The "collaborate naturally" framing around it is a pitch-deck sentence — every real-time multimodal system since GPT-4o has said some version of this. Capital plus a concept is still just that.