Nvidia
According to a damning report by 404 Mediabacked by internal Slack chats, emails and documents obtained by the outlet, Nvidia helped itself to “a human lifetime viewing experience worthy of training data per day,” admitted Ming-Yu Liu, vice president of research at Nvidia and leader of the Cosmos project, in a May email.
Anonymous former Nvidia employees told 404 that they had been asked to scrape video content from Netflix, YouTube and other online sources to obtain training data for use with the company’s various AI products. These include Nvidia’s Omniverse 3D world generator, self-driving car systems and “digital human.”
When those employees asked about the legality of the project, called Cosmos internally, management assured them that they had received permission from the highest levels of the company to use that content.
The project sought to build a basic model, similar to Gemini 1.5, GPT-4 or Llama 3.1, “that encapsulates light transport simulation, physics and intelligence in one place to unlock several critical downstream applications for Nvidia.”
To do this, the Cosmos project reportedly used an open-source video downloader and employed machine learning to hop IPs, thereby bypassing YouTube’s attempts to block it. According to emails viewed by 404, project managers discussed using up to 30 virtual machines running on Amazon Web Services to download 80 years of full-length videos and clips every day.
For its part, Nvidia does not claim any wrongdoing. “We respect the rights of all content creators and are confident that our models and research efforts fully comply with the letter and spirit of copyright law,” an Nvidia spokesperson told 404 Media via email. “Copyright law protects particular expressions, but not facts, ideas, data, or information. Anyone is free to learn facts, ideas, data, or information from another source and use it to make their own expressions. Fair use also protects the ability to use a work for a transformative purpose, such as training models.”
This is far from the first time that Nvidia (not to mention the vast majority of the rest of the AI field) has taken a “scrape first and maybe apologize later” approach to its AI training efforts. In July, Nvidia was named in another report about illegal scraping of copyrighted videos alongside Anthropic and Salesforce.
At CES 2024, the company sparked an internet storm with its ambiguous responses about how its new generative AI engine for games was trained. In response, Nvidia reiterated that its tools were “commercially safe.”