Harsh Words from Prominent AI Scientist
Pixel Imperfect
A leading AI model for video generation called Sora, developed by OpenAI, has garnered significant attention after its recent release. However, Yann LeCun, the chief AI scientist at Meta, has expressed skepticism regarding the model’s potential.
In response to OpenAI’s claim that Sora will pave the way for building “general purpose simulators of the physical world,” LeCun argues that the approach taken by OpenAI in creating a “world simulator” is fundamentally flawed.
In a recent post on social media platform X, LeCun stated, “Modeling the world for action by generating pixels is as wasteful and doomed to failure as the largely-abandoned idea of ‘analysis by synthesis’.”
Generation Complication
Known as one of the prominent figures in the field of AI, LeCun has always been forthright in his views, distinguishing him from his counterparts. While other influential individuals in the field have expressed concerns and reflected on their contributions, LeCun has continued to pursue his work at Meta and is unafraid to critically assess his competitors.
In his latest comments, LeCun touches upon the ongoing debate between generative models and discriminative models in machine learning. He argues that the generative approach, specifically generating pixels “from explanatory latent variables,” is inefficient and struggles to handle the uncertainty associated with complex predictions in a three-dimensional space.
To simplify, LeCun suggests that these models attempt to infer excessive details that lack relevance, similar to calculating a soccer ball’s trajectory by dissecting its constituent materials rather than focusing on factors such as its mass and velocity.
LeCun clarified, “This is not problematic if your goal is to generate videos. However, if your objective is to understand how the world functions, it is ultimately a futile pursuit.”
The Alternative
LeCun acknowledges that the generative approach has achieved notable success with large language models like ChatGPT, as text is discrete and possesses a finite number of symbols. However, simulating the complex world, as intended by Sora, involves considering much more than just a few characters.
Offering a competing approach, LeCun has been actively developing his own model at Meta, known as the Video Joint Embedding Predictive Architecture (V-JEPA), which was unveiled recently.
“Unlike generative approaches that attempt to fill in every missing pixel,” explained Meta in a blog post, “V-JEPA possesses the flexibility to discard unpredictable information, resulting in superior training and sample efficiency, with gains ranging from 1.5x to 6x.”
Although LeCun’s work may not generate the same level of buzz as OpenAI’s cutting-edge image and text generation, it is a notable departure from the conventional approaches currently pursued by OpenAI and its emulators.
For more on the recent developments in the field of AI, check out our latest article.