When AI Pretends to Listen
Modern multimodal models can describe videos, answer questions about them, and reason over visual and audio streams. But do they really listen? Or do they merely infer what the sound should be from what they see? This article discusses the paper When Vision Speaks for Sound , which investigates a hidden