LLM Colosseum: 14 Large Language Models Battle in 'Quick Fight' Game to Determine the Strongest Winner

There are currently many LLM large language models on the Internet. As far as AI chatbots are concerned, the more training data, the more powerful they will basically be. However, this is not the case when applied to fighting games. Recently, some people abroad have compared LLM with “Quick Fighting Tornado” Combined with the game, 14 large language models were tested, and the final winners were all small models.

Overseas, LLM is combined with the game “Quick Fight” to compare 14 large language models to see who is the strongest.

This open source project is called LLM Colosseum, developed by Stan Girard and Quivr Brain. According to the introduction, this game runs in an emulator, allowing LLM to operate the characters in the game and compete (the character is limited to Ken), and everyone can Download and install this project to test it yourself.

Amazon employee Banjo Obayomi shared an article a few days ago about the results of his use of this project to test 14 LLMs. The content also detailed how LLM controls the characters in the game “Tornado”. LLM will continuously read the current state of the game, such as character position, health and scores. These data will be translated into a prompt, such as actions that can be taken and recommended strategies, to facilitate LLM’s understanding and use:

After receiving this prompt, LLM will analyze the current game status and decide the next action, convert it into game instructions, and implement them in the simulator, such as approaching, retreating, wave fist, and rising dragon fist. For details, please refer to the video below:

From the video shared by Matthew Berman, a well-known foreign YouTube channel, you can see a relatively complete duel. On the left is the MISTRAL SMALL model, and on the right is the MISTRAL MEDIUM model. The two models fight quite smoothly, but there are some details to pay attention to. These Both models seem to have no so-called defensive actions, just movement and attack. If it were a fight with humans, no surprise humans would win easily:

Anyway, this is a battle between LLMs, and MISTRAL SMALL wins in the end, the small model is stronger than the big model. It can be seen that unlike AI chat, fighting games value speed and reaction most, and LLM small models usually have lower latency and speed.

Matthew Berman In the second half of the video, there are instructional steps for installing the LLM Colosseum project. It is recommended for those who want to play around with it themselves.

Among the 14 large language models tested by Banjo Obayomi, the final winner was claude_3_haiku, with a total of 314 games. He also found that small models have lower latency, faster reaction times and more movements in each game, so it is not surprising that Anthropic’s Claude won the front position:

However, although LLM is very smart, it is not without its shortcomings. Sometimes there will be some special situations, such as “hallucination” and “refuse to play”. In addition, each LLM also has its own unique play style. Some like aggressive attacks, while others adopt more defensive counterattacks. There are even spam attacks that repeatedly send the same actions:

Source: Banjo Obayomi

LLM Colosseum: 14 Large Language Models Battle in ‘Quick Fight’ Game to Determine the Strongest Winner

Overseas, LLM is combined with the game “Quick Fight” to compare 14 large language models to see who is the strongest.

Related posts:

Apple wants to get rid of Qualcomm for 5G, but when?

In defense of democracy - El Nuevo Día

Can I no longer obtain this messenger emblem once the season cap is reached?

"Visa LINE Pay prepaid card" with the industry's highest standard of 2% reduction-Keitai Watch

Related

Chinese woman convicted of money laundering in the UK with 61,000 Bitcoins, mastermind flees

Saudi Ministry of Transport Launches Safe Holidays Campaign for Eid Vacation Drivers

Leave a Comment Cancel reply

Overseas, LLM is combined with the game “Quick Fight” to compare 14 large language models to see who is the strongest.

Related posts:

Apple wants to get rid of Qualcomm for 5G, but when?

In defense of democracy - El Nuevo Día

Can I no longer obtain this messenger emblem once the season cap is reached?

"Visa LINE Pay prepaid card" with the industry's highest standard of 2% reduction-Keitai Watch

Share this:

Related

Chinese woman convicted of money laundering in the UK with 61,000 Bitcoins, mastermind flees

Saudi Ministry of Transport Launches Safe Holidays Campaign for Eid Vacation Drivers

Leave a Comment Cancel reply