gen‑ai.news
← Back
Video

Microsoft Research's Mirage gives video generation a persistent spatial memory that doesn't forget what's around the corner

Most video generation models struggle with spatial continuity - pan the camera away from a scene and return to it, and details have often shifted or disappeared entirely. Mirage, a collaborative project from Microsoft Research and several universities, addresses this by giving the model a persistent memory of the space it has already generated, so previously seen areas remain coherent when revisited.

The core technical distinction in Mirage is where scene information is stored. Traditional approaches often rely on pixel-based point clouds - explicit 3D representations derived from rendered frames. Mirage instead encodes and retains scene data directly in latent space, the compressed internal representation that diffusion-based models already work within. This means the system does not need to reconstruct geometry from pixels every time it needs to reference what came before.

That design choice has practical consequences beyond consistency. Working in latent space rather than maintaining dense point cloud structures cuts both processing time and graphics memory consumption meaningfully, which matters for research scalability and any potential downstream deployment. The result is a model that can handle extended camera trajectories - moving through a corridor, circling a room - without the scene fragmenting or contradicting itself across segments.

The system is not without its current boundaries. Mirage handles static environments well but has not yet solved the harder problem of tracking moving objects reliably across video segments. A person or vehicle that exits the frame and re-enters may not be rendered consistently, which limits the model's usefulness for dynamic scene simulation. That gap points to the next natural area of development for world models of this kind - integrating persistent spatial memory with robust object-level tracking to handle scenes where not everything stays still.

Enjoy this story? Get the next one in your inbox.

Twice a week: the most important stories in generative image and video AI, distilled into a 2-minute read.

Free. Unsubscribe any time. No spam, ever.

Your next read

No image
Video

Snap spins off AI video team into new company, Dotmo, due to costs

Snap is spinning off its internal AI video team into a new independent company called Dotmo, with the move driven primarily by the high costs of developing generative video technology in-house. The staff involved are departing Snap to focus solely on AI video work under the new entity. It marks another instance of Snap shedding an internal unit rather than continuing to absorb the expense of frontier AI development.

Video

Amazon, Nvidia, and AMD bet $310 million on AI startup building 3D world models

Odyssey ML has raised $310 million from Amazon, Nvidia, and AMD, pushing its valuation to $1.45 billion. The startup is focused on building 3D world models - AI systems that can understand and generate structured representations of physical space. The round also draws in notable backers including Google chief scientist Jeff Dean and CIA-linked venture fund IQT.

No image
Video

Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation

The Qwen team has released Qwen-RobotSuite, a collection of three specialized models targeting different challenges in embodied AI: physical manipulation, world modeling, and navigation. Each model draws on existing Qwen language and vision foundations while introducing architecture and training choices tuned for robotics tasks. The release comes with benchmark results and details on the data pipelines used to train each system.