Home » today » Technology » OpenAI has transcribed over a million hours of YouTube videos to train ChatGpt

OpenAI has transcribed over a million hours of YouTube videos to train ChatGpt

OpenAI has transcribed more than a million hours of video YouTube to train his artificial intelligence.

He writes it New York Timesrevealing how the company he created ChatGptincreasingly hungry for data with which to train AI, has drawn on the popular video video platform without having informed Alphabetthe company that controls both YouTube and Google.

OpenAI, and especially the former president Greg Brockman who personally selected some of the videos, knew there was a possibility of breaking the rules. But she would complete the data collection based on the fair usea US legal rule that allows limited use of copyrighted material without having to ask permission. But only when the use is for educational, critical or journalistic purposes and does not damage the market value of the original work.

In the specific case of the data collected by OpenAI there is doubt that fair use does not hold. The company led by Sam Altman, in fact, he used copyrighted content to create a product that not only chases a profit, but actually represents a threat to companies whose data has been stolen.

All this – writes the Now – was clear to OpenAI, but it did not deter it from the intent to get its hands on more valuable content: those produced by real people.

In fact, “human” data is the most valuable for companies developing artificial intelligence.

Artificial intelligence

The contents of the social network Reddit will be used to train AI: 60 million dollar agreement

by Pier Luigi Pisa


We knew that OpenAI, in recent years, has plundered public and private platforms without asking for permission, in most cases.

OpenAI also collected data from the New York Times, which recently responded to this practice – deemed unfair – with a lawsuit. What we didn’t know, however, is that the San Francisco company has exhausted the databases available to feed ChatGpt even in 2021. A year before its popular chatbot was opened to the public.

At that point – according to what the sources of the Now – OpenAI thought of transcribing everything: you give video, you have podcast.

And for this he created an AI model, Whisperwhich is one of the most powerful and accurate around when it comes to transcription.

In recent days the CEO of YouTube, Neal Mohan, confirmed to Bloomberg the possibility that OpenAI used the platform’s videos for training Sorahis model of AI that generates videos.

Artificial intelligence

Seven filmmakers tried OpenAI’s Sora with impressive results

by Pier Luigi Pisa



Google spokesperson, Matt Bryantsaid the Mountain View company takes “technical and legal measures” to prevent such unauthorized use “when there is a clear legal basis for doing so.”

According to NYT sources, Google also transcribed YouTube videos to train Gemini, its artificial intelligence. Bryant said the company has trained its AI models “on some YouTube content, in accordance with creator agreements.”

In short, it appears an epic clash between two AI giants.

But it is also interesting to note that both Google and OpenAI are in the same boat when it comes to the data needed by their AI, which is being consumed more and more rapidly.

And that’s a problem as such artificial intelligences progress only if I can absorb ever-increasing amounts of content produced by humans and not, for example, by other AI.

Il Wall Street Journal he wrote this week that quality content on which to train AI they could run out by 2028.

It is worth remembering that precisely the Wsja few weeks ago, asked the CTO of OpenAI, Mira Murati: “Was Sora trained on YouTube or Facebook data?”

And she replied: “I’m not sure”.

#OpenAI #transcribed #million #hours #YouTube #videos #train #ChatGpt
– 2024-04-09 11:59:59

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.