Back to stories
Models

Alibaba's Qwen 3.5-Omni Displays Emergent Ability to Write Code From Voice and Video

Michael Ouroumis2 min read
Alibaba's Qwen 3.5-Omni Displays Emergent Ability to Write Code From Voice and Video

Alibaba's Qwen team has released Qwen 3.5-Omni, a native multimodal model that processes text, images, audio, and video within a single unified architecture. The model has attracted attention not just for its benchmark performance but for displaying an unexpected emergent capability: the ability to write functional code from spoken voice instructions and video input without being specifically trained to do so.

Architecture and Capabilities

Qwen 3.5-Omni uses a novel Thinker-Talker architecture with Hybrid-Attention Mixture of Experts across all modalities. The model comes in three sizes — Plus, Flash, and Light — with the flagship Plus variant supporting a 256,000-token context window. That is enough to process over ten hours of audio or more than 400 seconds of 720p video at one frame per second.

The model was pre-trained from the ground up on more than 100 million hours of audiovisual data, making it a truly native multimodal system rather than a text model with audio capabilities added afterward. It supports speech recognition across 113 languages and dialects, including 74 languages and 39 Chinese dialects.

The Emergent Surprise

The most striking finding is what the Qwen team calls "audio-visual vibe coding." In tests, the model demonstrated the ability to take a hand-drawn sketch held up to a camera, listen to spoken instructions describing the desired behavior, and generate a working React webpage. The team reports that this capability was never explicitly trained — it emerged as a byproduct of scaling native multimodal processing.

As reported by The Decoder, this represents a meaningful step toward AI systems that can interact with developers the way a human collaborator would: by watching, listening, and understanding context simultaneously rather than relying solely on text prompts.

Benchmark Results

Qwen 3.5-Omni achieved state-of-the-art results on 215 audio and audio-visual understanding subtasks. According to Alibaba's technical reports, the Plus variant surpasses Google's Gemini 3.1 Pro in general audio understanding, reasoning, recognition, and translation, while achieving parity in audio-visual comprehension.

A Strategic Shift on Openness

Notably, Alibaba has broken from its established pattern of open-sourcing Qwen models. The most capable Qwen 3.5-Omni variants are closed-source and available only through API access, as reported by WinBuzzer. This marks a significant departure for a company that had built developer goodwill through consistent open releases.

The decision likely reflects the competitive and commercial pressures facing Chinese AI labs as their models approach frontier performance levels. With DeepSeek V4 expected to launch under Apache 2.0 in the coming weeks, Alibaba may be calculating that its most advanced capabilities are too valuable to give away.

What It Means

The emergence of untrained capabilities in multimodal models adds weight to the argument that scaling native multimodal training produces qualitatively different results from bolting modalities onto text-first systems. For developers, the practical implications are significant: if future models can reliably interpret spoken instructions paired with visual context, the interface for building software could shift dramatically from text-based prompting toward more natural, conversational interaction.

Learn AI for Free — FreeAcademy.ai

Take "AI Essentials: Understanding AI in 2026" — a free course with certificate to master the skills behind this story.

More in Models

Google's Gemini 3.5 Flash Beats the Pro Tier on Agent Benchmarks — and Ships a Managed Agents API
Models

Google's Gemini 3.5 Flash Beats the Pro Tier on Agent Benchmarks — and Ships a Managed Agents API

At I/O 2026 Google shipped Gemini 3.5 Flash, a Flash-tier model that outscores Gemini 3.1 Pro on coding and agentic benchmarks at less than half the cost of comparable frontier models, alongside a Managed Agents API that spins up tool-using, code-executing agents in a single call.

4 hours ago2 min read
Google ships Gemini 3.2 Flash at I/O 2026, undercuts GPT-5.5 by 15-20x on inference cost
Models

Google ships Gemini 3.2 Flash at I/O 2026, undercuts GPT-5.5 by 15-20x on inference cost

Gemini 3.2 Flash debuts at $0.25/M input and $2.00/M output tokens, hitting ~92% of GPT-5.5 on coding and reasoning while rolling out across Search, Maps, Gmail, and Chrome simultaneously.

1 day ago2 min read
Thinking Machines Lab Debuts 'Interaction Models' — Mira Murati's First Step Into Frontier AI
Models

Thinking Machines Lab Debuts 'Interaction Models' — Mira Murati's First Step Into Frontier AI

Mira Murati's Thinking Machines Lab released a research preview of 'interaction models,' a new class of full-duplex multimodal AI that listens, sees and speaks at the same time, with turn-taking latency reported at about 0.4 seconds.

1 week ago2 min read