gen‑ai.news
← Back
Video

Google Launches Gemini Omni for Video Generation and Editing

Google introduced Gemini Omni Flash at its I/O 2026 conference, positioning it as a model that can accept any combination of text, image, audio, or video input and produce edited or generated video as output. The conversational editing interface lets users refine results through back-and-forth prompts rather than rewriting full generation instructions from scratch, a workflow shift that brings video creation closer to how people interact with text-based LLMs.

Google's pitch for the model's realism rests partly on its training foundation. The company argues that grounding the model in broad factual knowledge - physics, history, cultural context - helps it handle things like fluid dynamics, lighting interaction, and gravity more convincingly than models trained on visual data alone. Whether that claim holds up at scale remains to be tested by users and benchmarks outside Google's own demos.

The model also supports user-defined visual language, meaning creators can specify a style, motion character, or effects palette and have it applied consistently across a generation. Digital avatars with the user's own voice are included as a feature, framed partly as a safety mechanism to govern how likeness is used in AI-generated content.

All output from Gemini Omni carries an embedded SynthID watermark, connecting it to the broader provenance infrastructure Google has been building. The model is rolling out initially to AI Ultra subscribers, with broader access expected to follow.

Enjoy this story? Get the next one in your inbox.

Twice a week: the most important stories in generative image and video AI, distilled into a 2-minute read.

Free. Unsubscribe any time. No spam, ever.

Your next read

Video

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

NVIDIA has released Cosmos 3, an open omnimodal foundation model that combines a vision-language reasoning component with a diffusion-based video generator in a two-tower architecture. The system is designed to support physical AI applications by linking language-grounded reasoning with the generation of plausible world states and robot actions.

Video

Nvidia bets big on physical AI at GTC Taipei with a new world model, driving brain, and open humanoid robot

Nvidia used GTC Taipei to unveil several new tools aimed at physical AI applications, including a new world model, a larger autonomous driving model, and an open reference platform for humanoid robots. The announcements signal a continued push to make simulation and synthetic data central to how robots and vehicles are trained. Here is a closer look at what was shown and why it matters.