Researchers from the University of Washington and the Allen Institute for AI trained an artificial intelligence system to “interpret” what happens in the videos.
And to achieve this goal, the researchers used millions of YouTube videos to train the AI.
Although it is not important for us to interpret what happens in an image or video, since we can understand all the elements that intervene in its context, this becomes a great challenge even for the most advanced artificial intelligence systems.
Considering that to interpret a scene in a simple photograph, an AI has to analyze hundreds of data, we can imagine that the amount of data and patterns it needs to interpret what happens in a video is unimaginable.
However, researchers continue to develop models that can get closer to this goal. One of the latest studies was shared by a group of researchers from University of Washington and the Allen Institute of AI:
Introducing MERLOT, a model that learns multimodal scripting knowledge by watching millions of YouTube videos with transcribed voice, in a fully supervised, tag-free way. By pre-training with a combination of frame (spatial) and video (temporal) level targets, our model not only learns to match images to corresponding words temporally, but also to contextualize what is happening globally along weather.
As the researchers mention, the AI was trained with millions of YouTube videos covering different topics. The objective was that the artificial intelligence system will be able to contextualize the representations of the videos, and understand the events and situations, ordering the frames with the corresponding transcripts.
And according to the data they have shared (up to 80.6 precision), Merlot has managed to overcome some of the challenges of this dynamic. However, this AI learning system has some limitations. For example, the AI could acquire “undesirable patterns” considering the limited segment of videos that were used in the training, either because of the language or the themes.
Although they still have a long way to go, the results are more than promising to continue advancing in these artificial intelligence models.
– .