gen‑ai.news
← Back
Video

Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation

The Qwen team has released Qwen-RobotSuite, a set of three models designed to address distinct problem areas in embodied AI. Rather than a single general-purpose system, the suite takes a modular approach - each component is built and evaluated for a specific robotics challenge, from controlling a robot arm to predicting how a scene will evolve over time to navigating through an environment.

The first model, RobotManip, is a Vision-Language-Action (VLA) model built on top of the Qwen3.5-4B language backbone. VLA models aim to connect visual perception and language understanding directly to physical actions, and RobotManip applies this framing to manipulation tasks - the kind of precise, contact-rich interactions that remain difficult for robotic systems. Using a capable base language model as the foundation is intended to give the system stronger generalization from language instructions.

RobotWorld takes a different angle, functioning as a language-conditioned video world model. Its architecture centers on a 60-layer Multimodal Diffusion Transformer (MMDiT), the same class of architecture that has driven recent progress in video generation. The idea is that a model able to predict plausible future video frames - given a language instruction and a current observation - can serve as a planning or data-generation tool for downstream robotics systems. World models of this type are increasingly being explored as a way to simulate robot behavior without requiring physical rollouts.

RobotNav addresses spatial navigation and is built on Qwen3-VL, available in three sizes - 2B, 4B, and 8B parameters - giving users a range of compute trade-offs. Navigation requires reasoning about spatial relationships, following instructions over longer horizons, and adapting to new environments, all areas where vision-language models have shown potential. The Qwen team has published architecture details, data pipeline descriptions, and benchmark comparisons for all three models, offering a relatively transparent look at how each system was constructed and where it stands relative to prior work.

Enjoy this story? Get the next one in your inbox.

Twice a week: the most important stories in generative image and video AI, distilled into a 2-minute read.

Free. Unsubscribe any time. No spam, ever.

Your next read

No image
Video

Snap spins off AI video team into new company, Dotmo, due to costs

Snap is spinning off its internal AI video team into a new independent company called Dotmo, with the move driven primarily by the high costs of developing generative video technology in-house. The staff involved are departing Snap to focus solely on AI video work under the new entity. It marks another instance of Snap shedding an internal unit rather than continuing to absorb the expense of frontier AI development.

Video

Amazon, Nvidia, and AMD bet $310 million on AI startup building 3D world models

Odyssey ML has raised $310 million from Amazon, Nvidia, and AMD, pushing its valuation to $1.45 billion. The startup is focused on building 3D world models - AI systems that can understand and generate structured representations of physical space. The round also draws in notable backers including Google chief scientist Jeff Dean and CIA-linked venture fund IQT.

Video

Cutback launches AI tool to automate long-form video editing

Cutback has introduced Selects, an AI editing assistant designed to handle the early, time-consuming stages of long-form video editing. The tool ingests raw footage, organizes it automatically, and produces a draft edit based on a single text prompt. It targets creators and editors who spend significant time just getting footage into a workable shape before any real editing begins.