Artificial Intelligence 7 min read

Yann LeCun: Today's AI Still Below Dog Level – Inside Meta’s Voicebox, MusicGen & I‑JEPA

Meta’s chief AI scientist Yann LeCun warned that current large language models still fall short of human and even dog intelligence, citing their lack of real‑world understanding, while Meta unveiled three new generative AI models—Voicebox for speech, MusicGen for music, and I‑JEPA for image reasoning—showcasing both progress and remaining limitations.

Programmer DD

Jun 20, 2023

Yann LeCun: Today's AI Still Below Dog Level – Inside Meta’s Voicebox, MusicGen & I‑JEPA

Meta’s chief AI scientist Yann LeCun said at the Viva Tech conference in Paris that today’s AI systems, such as ChatGPT, have not reached human‑level intelligence and are even less capable than a dog.

He argued that large language models (LLMs) are not truly intelligent because they cannot understand or interact with reality; they merely generate text based on massive language training and miss the bulk of human experience that is not language‑based.

LeCun illustrated the gap with examples: an AI can pass the U.S. bar exam but cannot install a dishwasher—a skill a ten‑year‑old can learn in ten minutes. He also compared infant perception, noting that a five‑month‑old sees a floating object without questioning it, while a nine‑month‑old is surprised because it knows objects should not float, a capability current AI cannot replicate.

Meta is currently training AI on video and envisions future machines that will act as helpful assistants, more intelligent than their users, without being a threat.

He also dismissed the notion that robots will dominate the world, emphasizing that intelligence does not imply a desire to take over.

Meta Releases Voicebox: A Generative Speech Model

Voicebox is a non‑autoregressive flow‑matching model trained on more than 50,000 hours of raw, unfiltered speech. It can perform speech generation tasks such as editing, sampling, and style transfer directly from context, without task‑specific training.

The model supports zero‑shot text‑to‑speech synthesis in single or multiple languages, noise removal, content editing, style conversion, and diverse sample generation. It achieves lower word error rates (1.9 % vs. 5.9 %) and higher audio similarity (0.681 vs. 0.580) than the state‑of‑the‑art English model VALL‑E, while running 20× faster.

Meta Open‑Sources MusicGen: Text‑to‑Music Generation

MusicGen converts text prompts and optional melody inputs into full musical pieces. Built on the 2017 Transformer architecture, it was trained on 20,000 hours of licensed music using Meta’s EnCodec encoder, which splits audio into small units for parallel processing, improving efficiency and speed.

The model can combine textual descriptions with existing melodies, for example generating “a light‑hearted track” blended with Beethoven’s “Ode to Joy.”

Meta Introduces I‑JEPA: A “Human‑Like” Vision Model

I‑JEPA learns abstract world representations through self‑supervised image learning, delivering more accurate analysis and completion of unfinished images than existing models.

Inspired by LeCun’s human‑like reasoning approach, I‑JEPA reduces common generative‑image errors such as extra fingers and offers higher computational efficiency compared with widely used vision models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

artificial-intelligence computer vision large language models Generative AI Speech synthesis music generation

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.