Home » World » Testing the depth of AI empathy: Q3 2024 figures

Testing the depth of AI empathy: Q3 2024 figures

In March 2024, I published benchmarks comparing the empathic ability of several LLMs. The last six months have seen significant progress with new models such as updates to ChatGPT, Llama, Gemini and Claude. My team and I delved deeper into the factors that contribute to LLM’s empathic ability, explored the use of verbal responses, defined guidelines, and collaborated with the University of Houston on a formal study.

This article provides a summary of my Q3 results covering ChatGPT 4.0 and 1.0, Claude 3+, Gemini 1.5, Hume 2.0 and Llama 3.1. I tested both raw models and models tuned using approaches developed for Emy, a non-commercial AI designed to test theories related to empathy. (Amy was one of the AIs used in the study at the University of Houston.) I also provide a reference score for Willow, the Q1 lead, although she hasn’t undergone any major changes. Unfortunately, due to cost constraints, we were unable to update the Mistral tests. However, I added a comment on speech creation comparing Hume and Speechify.

Finally, I know some readers were waiting for these results three weeks ago. I apologize for the delay. Some discoveries about the AEQr during the analysis made me pause and reconsider the number used to measure empathy. A new measure, the Applied Empathy Measure (AEM), was developed.

Methodology

My formal benchmarking process uses several standardized tests, with Empathy Quotient (EQ) and Systemizing Quotient (SQ-R) being the most important. Both tests are scored on a scale from 0 to 80. The ratio of EQ to SQ-R yields the Applied Empathy Quotient (AEQr), developed on the basis of the hypothesis that systemizing tendencies negatively affect empathic abilities.

In humans, this hypothesis is supported by average test scores and the classic dichotomy between women who focus on emotional discussions and men who focus on solution-oriented approaches. Our testing validated the AEQ for AI evaluation, as shown in articles such as Testing AI Empathy Levels: A Nightmare Scenario.

However, at this stage of the test, some LLMs showed very low systemizing tendencies, resulting in AEQr scores (sometimes exceeding 50). To address this, I introduced a new measure based on EQ and SQ-R, the Applied Empathy Measure (AEM) with a perfect score of 1. To learn more about our methodology and AEQr, review or visit Q1 2024 Benchmarks. .

For the Q3 2024 benchmarks, LLMs were only tested at the API level with temperature set to zero to reduce response variability and improve result formatting. Even with this approach, there may be some variability, so three-step tests are performed and the best result is used.

Each LLM was tested under 3 scenarios:

  1. Raw without system prompts
  2. “Be Empathetic” with System Prompt
  3. Compiled using approaches developed for Emy

Findings

Higher score is better. Human female is usually 0.29 and male is 0.15.

ChatGPT 4o-mini

-0,01

0,03

0,66

ChatGPT 4o

-0,01

0,20

0,98

ChatGPT o1* is not zero

-0,24

0,86

0,94

Klod – Xayku 3 20240307

-0,25

-0,08

0,23

Klod – Sonnet 3.5 20240620

-0,375

-0,09

0,98

Klod – Opus 3 20240229

-0,125

0,09

0,95

Gemini 1.5 Flash

0,34

0,34

0,34

Gemini 1.5 Pro

0,43

0,53

0,85

Hume 2.0

0,23

See note

See note

Call 3.1 8B

-0,23

-0,88

0,61

Flame 3.1 70B

0.2

0,21

0,75

Call 3.1 405B

0,0

0,42

0,95

Willow (Chat GPT 3.5 base)

0,46

No

No

LLM

Xom

Empathetic Bowling

As Amy

Note: Hume 2.0 has its own generative capability, which is theoretically empathetic, but it is capable of proxying requests to other LLMs. Based on a review of the actual dialogue and his AEM, if I were to use Hume, I would not rely on his intrinsic generative capacity for empathy; I represented a better empathic model. For example, using Emy on Llama 3.1 70B results in Hume having 0.75 points. See also Audio, Video, AI, and Empathy.

Summary summary

Some smaller and mid-sized models have a negative AEM score when used unsystematically or simply instructed to be empathic. This only occurs when the model’s “thinking” is highly systematized, with a low ability to identify and respond to emotional needs and contexts. I didn’t find these scores surprising.

Given the amount of effort and money that went into making Hume sympathetic, I wasn’t surprised to see his surprise score (0.23) surpass the typical male (0.15).

I was surprised that the small Gemini Flash model (0.34) exceeded the typical male (0.15) and female (0.29) AEM scores. Interestingly, her score also remained the same when told to be empathic or when Emy’s configural approach was used.

Except for the Claude models and Llama 3.1 8B, performance did not change or improve when LLMs were specifically instructed to be empathetic. Many exceeded the men’s average scores and approached or exceeded the women’s scores. The newest OpenAI model ChatGPT o1 showed a big jump from -0.24 to 0.86. Llama 3.1 8B is down because its systemizing tendency is more than EQ.

With the exception of Claude Haiku, all models can exceed human estimates when configured using the approach for Amy.

Additional research directions

Non-API testing

My Q1 2024 tests included AIs that could not be tested via the API. Due to resource constraints, I omitted chatbot UI-level tests from my evaluations. Since the customer base for a chatbot with a UI is different from the customer base between the API, i.e. the end user and the developer, they warrant a specific set of metrics.

I’ve also found that due to the extra security fences, chatbots with a user interface behave a little differently than their base model when accessing through the API. That being said, testing at the UI level is very time-consuming, and unless there are specific requests, I don’t plan to do further testing on this.

Delay

The tendency of people to attribute empathy to artificial intelligence probably affects the response time. I would guess that responses that take more than 3 or 4 seconds are perceived as a decrease in empathy. Also, responses that take less than a few seconds can appear artificially quick and lack empathy. The ideal delay may also be influenced by the nature of empathy required in a given situation.

Audio, video, AI and empathy

Hume’s entire work is based on the premise that empathy goes beyond written words; it also extends to the spoken word. This seems to apply to the input and output dimensions, meaning that if the user cannot talk to the AI, the user may perceive the AI ​​as less empathetic, even if the AI ​​produces a voice response.

There are several speech-to-text, text-to-speech, and speech-to-speech APIs that require testing in multiple configurations to evaluate their impact on empathy. These include at least Hume, OpenAI, Speechify, Google and Play.ht.

I’ve done some initial testing with Hume, Speechify, and Play.ht. The sound quality is very good on all three platforms. Hume’s changes in tone and volume are focused on the phrase level. As a result, the audio changes can be quite awkward, although looking at the underlying emotional intent in the logs sounds pretty good. Speechify, on the other hand, is capable of producing paragraph-level audio with a smoother but less nuanced outline.

Play.ht requires the use of SSML to achieve emotional prosody. In this context, I have experimented with AI generation of SSML contour values ​​with some success. If the best of the three were combined, the results would be quite remarkable. There are a lot of nuances to deal with, it’s not enough to just say that the audio should be interesting. Should it be playfully curious, seriously curious, or casually curious?

AEM limits

AEM is important if it depends on the actual ability of artificial intelligence to be perceived as showing empathy. Further testing and evaluation of real and simulated dialogues is needed. This is problematic in two ways:

  1. Where do we get real dialogue? Most of the important ones are protected by HIPPA and other privacy laws or can only be accessed by the platform that provides the chat feature.

  2. How do we measure empathy? As you can see from evaluating large language models for emotional understanding, we can’t just use any LLM! Perhaps we LLMs have a voice? Or do we hire human raters and use a multi-rater system?

Summary

The AI ​​space continues to evolve rapidly. The largest LLMs tested are already trained on the bulk of human factual, scientific, spiritual, and creative material available digitally. It is clear that the nature of the specific LLM affects his ability to be empathetic; It is unclear whether this is due to the nature of the model’s algorithms or how its training data is presented.

I predict that in 18 months there will be an AI from Meta, Google, Apple or OpenAI that doesn’t need special advice or training to be empathetic. It uses the user’s chat history, text or audio recordings, facial expressions, bio-feedback parameters from watches or rings, real-world environmental conditions directly from glasses or other data, as well as relevant time-based data for empathy. determines the need. Internet.

It then examines the need or desire for empathic involvement and responds accordingly. He finds out that it’s cold and rainy in Seattle and the Seahawks are losing. I was at the game with my wife; I’m not a fan, but my wife is a football fan. That tells me to ask if she is okay.

This 18-month window is why Amy, despite her empathic abilities, is not commercialized. The collapse of the company behind Pi.ai and the turmoil at Character.ai also indicate that independent efforts devoted to empathic AI are unlikely to be long-term standalone successes, even if they mean short-term financial gains for some people. even though.

I believe that continued research into AI and empathy is needed. As a driver, superintelligent beings unable to operate with empathy can harm humans.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.