The publication of a Chinese-developed multimodal model in Nature has reframed a global debate about how artificial intelligence should scale beyond text. With the appearance of “悟界·Emu3” in the journal’s main edition, researchers are now openly questioning whether diffusion-based systems remain the inevitable path for images and video — or whether the same “next-token prediction” logic that transformed language models can also underpin world models and embodied intelligence.
In its editorial assessment, Nature highlighted a striking point: Emu3 achieves large-scale unified learning across text, images and video using only next-token prediction, reaching performance comparable to specialised systems on both generation and perception tasks. That, the editors noted, gives the work particular relevance for building scalable multimodal assistants, world models and embodied AI.
The result matters now because it arrives in the wake of DeepSeek’s rise and amid intense global competition over foundational AI architectures. Rather than proposing a hybrid or task-specific workaround, Emu3 makes a more radical claim: that a single, autoregressive decoder-only architecture can unify multimodal intelligence at scale.
Betting on a single route
Emu3 was released by the Beijing-based Beijing Academy of Artificial Intelligence (BAAI, also known as Zhiyuan Research Institute) in October 2024. From the outset, the model was positioned as a technical wager against prevailing trends. While diffusion transformers had become dominant in multimodal generation, and perception systems relied heavily on composite pipelines, Emu3 pursued a pure autoregressive route: predicting the next token, regardless of modality.
That decision was formalised earlier, in February 2024, when BAAI assembled a dedicated team of about 50 researchers to test whether discrete tokenisation and a single Transformer could unify text, images and video from scratch. The approach required all modalities to be discretised into a shared representational space and jointly trained on mixed multimodal sequences.
The gamble carried risks. Discretising visual data meant compressing information-dense images into tokens aligned with language-like representations, a process that proved technically difficult and at times demoralising for the team. Externally, the strategy also ran against industry sentiment in China, where many large-model teams prioritised GPT-4-style language replication and abandoned costly multimodal efforts.
Yet the bet held. Former OpenAI policy head and current Anthropic co-founder Jack Clark later described Emu3’s design as powerful precisely because of its simplicity, noting that avoiding architectural “tricks” gave it unusual scaling potential. BAAI president Wang Zhongyuan has argued that this minimalism reduces development complexity and cost, making the approach more productive and industrially viable over time.
Performance, scale and the road to world models
Technically, Emu3 demonstrated that a single autoregressive model could compete with task-specific leaders. On image generation benchmarks such as MSCOCO-30K23, it surpassed diffusion models like SDXL. In video generation, it achieved a VBench score of 81, exceeding Open-Sora 1.2. Its visual-language understanding score of 62.1 slightly outperformed LLaVA-1.6. While such numbers may appear routine today, they were notable when the work began two years earlier.
Crucially, Emu3 did not rely on external encoders such as CLIP or pre-trained language backbones. It supported text-to-image, text-to-video, future prediction, visual-language understanding, interleaved image-text generation and embodied operation within one framework. It could generate five-second videos at 24 frames per second and extend sequences by predicting future frames token by token, producing causally consistent video rather than diffusion-based noise refinement.
This autoregressive video extension enabled early simulation of physical dynamics, human and animal behaviour — a step toward what researchers increasingly describe as “world models”. By October 2025, BAAI released Emu3.5, explicitly framing it as a native multimodal world model capable of long-horizon, spatially consistent reasoning. Emu3.5 surpassed models such as Google’s Nano Banana and introduced what the team called a “multimodal scaling paradigm”, showing that physical understanding improved predictably with more data and parameters.
The architecture itself remained austere: a decoder-only Transformer derived from large language model designs such as Llama-2, an expanded embedding layer for visual tokens, a unified tokenizer, two-stage training (large-scale pretraining followed by preference-aligned post-training), and an efficient inference backend supporting classifier-free guidance. Large ablation studies supported claims about scaling laws, unified discretisation efficiency and the applicability of direct preference optimisation to visual generation.
From academic recognition to industrial impact
The Nature paper — titled Multimodal learning with next-token prediction for large multimodal models — formalised Emu3’s standing in the international research community. But its influence has extended beyond academia. Industry practitioners say the model has already shaped how companies think about unifying perception, generation and action, particularly in robotics and embodied AI.
BAAI’s Emu line did not emerge overnight. Development began in 2022, with a sequence of releases: the first Emu model in July 2023 unified multimodal input and output; Emu2 in December 2023 scaled autoregressive multimodal pretraining; Emu3 in October 2024 eliminated diffusion and composite pipelines entirely; and Emu3.5 in October 2025 moved from predicting tokens to predicting states.
This trajectory sits within a broader institutional strategy. Founded in 2018, BAAI previously released China’s first large language model, WuDao 1.0, and the MoE-based WuDao 2.0, and has been described as a training ground for AI talent. Since 2020, it has emphasised open research: more than 200 models and 180 datasets have been released, with hundreds of millions of global downloads.
Wang Zhongyuan has said BAAI will continue to open not only model weights but also training code and industrial case studies, arguing that open ecosystems are essential for long-term progress. In June 2025, the institute launched its next-generation “悟界” model series, aimed at bridging digital and physical intelligence, supported by the open-source FlagOS software stack.
The timing has amplified the impact. Beijing has positioned itself as an “open-source capital” for AI, supported by policy frameworks released since 2023 to accelerate foundational research. Recent months have seen a wave of milestones, from the public listing of Zhipu AI to new multimodal releases from Baidu and Moonshot AI.
Against that backdrop, Emu3’s appearance in Nature marks more than a single paper. It signals international recognition of a Chinese-led, original technical route — one that challenges assumptions about how multimodal intelligence must be built. Whether diffusion models fade or converge with autoregressive approaches remains unresolved. But Emu3 has made one thing harder to ignore: the possibility that a unified, next-token worldview could underpin the next generation of world models and embodied machines.
