Alea
Back to Podcast Digest
Latent Space··1h 6m

Moonlake: Multimodal, Interactive, and Efficient World Models — with Fan-yun Sun and Chris Manning

TL;DR

  • Moonlake’s core bet is “structure, not just scale” for world models — Fan-yun Sun and Chris Manning argue that predicting pixels alone won’t get you to real spatial intelligence, because useful world models need abstract representations of geometry, physics, affordances, and action consequences over minutes, not just the next frame.

  • They define a real world model as action-conditioned, not merely photorealistic — Manning draws a sharp line between Sora/Genie-style video generation and systems that can answer “if I do X, what happens next?”, using their bowling demo to show score changes, pin physics, resets, and repeatable practice rather than just pretty visuals.

  • Language and symbolic representations are not optional side channels in intelligence — In a direct contrast with Yann LeCun’s JEPA worldview, Manning says language is a “cognitive tool” in Dan Dennett’s sense, and that humans’ leap beyond chimps came from symbolic reasoning layers that support planning, abstraction, and long-horizon consistency.

  • Moonlake splits the problem into two models: a reasoning world model and a renderer called Rey — The first handles causality, persistence, logic, and determinism; Rey is a diffusion model that “restyles” the persistent world state into photorealistic or arbitrary aesthetics, which Sun says could eventually replace rasterizers or DLSS as a programmable rendering layer.

  • Their product thesis is gaming first, embodied AI next — The company, currently about 18 people and moving from Seattle to San Francisco, wants creators and users to drive a data flywheel: game developers can express intent and build interactive worlds, while robotics teams could generate environment distributions to train and evaluate drones, vacuum robots, or rescue policies.

  • Evaluation is still a mess, and they’re unusually candid about it — Sun says world-model benchmarks depend entirely on the end use case, whether that’s time spent in a game or robustness of a robot policy after sim training, while Manning notes that even LLM evaluation has drifted from clean QA benchmarks into “walking with their feet” and vibe-based model choice.

The Breakdown

Why Moonlake exists: synthetic worlds as a missing ingredient

Fan-yun Sun says the company came out of a very practical gap he saw while working with Nvidia Research: lots of money was being spent on interactive worlds and synthetic data for training and evaluating RL and embodied AI systems, but most people were still treating the opportunity as “video generation” instead of learning the consequences of actions. His framing is half commercial, half philosophical — yes, there’s a real market, but also a belief that intelligence needs to model causality in the world, not just generate plausible footage.

Chris Manning’s case against pixel-only world understanding

Manning zooms out from NLP to argue that vision research got stuck at object recognition, and that today’s vision-language systems still lean on language for “90% of the work” while vision barely works. Moonlake’s answer is a richer symbolic or semantic layer over visual data: not abandoning scale, but refusing to believe that raw pixel prediction is the only road to understanding.

What counts as a world model, and why Sora isn’t enough

The key distinction is action-conditioning. A system that predicts the next frame may look magical, but Moonlake says you only really have a world model if you can ask what changes when an action is taken, and keep that coherent over long horizons — minutes, not seconds. Manning’s human analogy is memorable: people don’t process every pixel that hits their eyes, they reason over sparse semantic abstractions, and that’s exactly why abstraction is the economically useful representation.

The “bitter lesson” debate and the JEPA disagreement

Sun is careful to say they are not anti-scale — his byte-level analogy makes the point that the purest bitter-lesson approach would be next-byte prediction across all media, but the compute bill would be absurd. Manning then gets more philosophical, drawing a real line with Yann LeCun: JEPA-style latent representation learning is interesting, but LeCun undervalues language and symbols, whereas Moonlake sees them as the thing that lets intelligence vault from perception into planning and causal reasoning.

The bowling demo: reasoning traces instead of vibes

Their bowling example is where the pitch becomes concrete. Sun walks through the hidden chain of reasoning needed to make the world interactive — ball hits pins, pins fall, audio triggers fire, score increments, timers reset, and the environment remains usable for learning how to bowl rather than merely resembling bowling. He also throws a little shade: unlike some Google Genie or World Labs demos, this is meant to be a world you can actually practice in.

Code, tools, and the idea of a world model as an agent

When asked if this is “just writing Unity code,” Sun’s answer is basically: partly, and that’s fine. Physics engines, code, and other software are “cognitive tools” the model can call into, much like tool use in agents, with the model deciding what representation or engine is needed for the task. That makes the world model less like a monolithic generator and more like a reasoning system orchestrating tools under the hood.

Rey: a diffusion renderer that keeps the game state intact

Moonlake separates persistence from appearance. Their second model, Rey, takes the stable world representation from the reasoning model and pushes it toward a target pixel distribution — photorealistic, stylized, or whatever else the creator wants — without losing the causality and consistency underneath. Sun’s most ambitious claim here is that this could become the “next paradigm of rendering,” replacing rasterization or DLSS and even becoming part of gameplay itself, like turning bullets into apples after you collect 10 apples.

Gaming first, embodied AI next, and no easy benchmark in sight

The company is commercializing through gaming because it gives creators a clear place to inject intent, but both founders keep returning to embodied AI as the broader platform vision: tell the system your goal, and it generates a distribution of environments for training and evaluation. On evaluation, they’re refreshingly blunt — there is no single benchmark that captures success, whether for game design, shopping advice from an LLM, or robot robustness — so, at least for now, adoption and user preference matter as much as formal metrics.