CoNLL'26 Presentation | Who Generates More Empathetic Responses—Humans or LLMs?

In this video, I present the paper “Who Generates More Empathetic Responses—Humans or LLMs? A Comparative Evaluation with Human and LLM Judges” by Anuradha Welivita, Fawzia Zeitoun, and Pearl Pu. The paper investigates an important question in modern NLP and conversational AI: who generates more empathetic responses, humans or large language models? To answer this, the authors compare human-written responses with responses generated by four LLMs: GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, and Mixtral-8×7B-Instruct. The study uses 2,000 dialogue prompts from the EmpatheticDialogues dataset, covering 32 positive and negative emotions. A total of 1,000 human participants evaluate the empathetic quality of responses, and the same responses are also evaluated using an LLM-as-judge approach with GPT-4o-mini. The results show that LLM-generated responses are rated as more empathetic than human-written responses by both human judges and the LLM judge. However, the paper also finds that both humans and LLMs tend to favor responses generated by their own group, highlighting the need to interpret empathy evaluations carefully. In this presentation, I explain the motivation, dataset, experimental design, human evaluation setup, LLM-as-judge framework, key results, limitations, and ethical considerations of the study. This video is for educational and research discussion purposes. #NLP #LLMs #EmpatheticAI #ConversationalAI #DialogueSystems #AIResearch #ComputationalLinguistics #HumanAIComparison