Decoders Laugh as Loud as Encoders

Published 5 Sep 2025 in cs.CL and cs.AI | (2509.04779v1)

Abstract: From the dawn of the computer, Allen Turing dreamed of a robot that could communicate using language as a human being. The recent advances in the field of LLMs shocked the scientific community when a single model can apply for various NLP tasks, while the output results are sometimes even better than most human communication skills. Models such as GPT, Claude, Grok, etc. have left their mark on the scientific community. However, it is unclear how much these models understand what they produce, especially in a nuanced theme such as humor. The question of whether computers understand humor is still open (among the decoders, the latest to be checked was GPT-2). We addressed this issue in this paper; we have showed that a fine-tuned decoder (GPT-4o) performed (Mean F1-macro score of 0.85) as well as the best fine-tuned encoder (RoBERTa with a Mean of F1-score 0.86)

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that fine-tuned decoders, such as GPT-4o, can achieve humor classification performance equivalent to encoder models like RoBERTa-base, challenging conventional presumptions.
The paper employs a multi-class classification framework across five humor types and a non-humorous category, using curated datasets and standardized metrics like F1-macro for evaluation.
The paper's findings highlight the critical role of data quality, fine-tuning strategies, and evaluation protocols in advancing LLM performance for nuanced tasks such as humor comprehension.

Decoders Laugh as Loud as Encoders

Introduction

The exploration of humor comprehension by AI models has long posed a challenge due to the nuanced and culturally-dependent nature of humor. As LLMs like GPT and Claude demonstrate sophisticated text generation capabilities, understanding whether these models genuinely comprehend humor becomes pertinent. This paper, "Decoders Laugh as Loud as Encoders" (2509.04779), investigates the ability of various LLM architectures, particularly fine-tuned decoders and encoders, to classify humorous text effectively.

Methodology

The study employs a methodological framework that evaluates LLMs on their capability to classify humor into predefined categories. English humor is segmented into five distinct types: absurdity, dark, irony, wordplay, and social commentary. Additionally, a sixth category includes regular, non-humorous sentences as negative examples. The models tested span three architectures: Encoders, Encoder-Decoder models, and Decoders.

Data Collection and Preprocessing

Data was meticulously gathered from web sources such as Reddit and other humor-centric media platforms, ensuring a wide coverage of humor styles. The datasets underwent rigorous cleansing to address ambiguity by filtering out multi-faceted jokes and ensuring a straightforward classification task. Categories of jokes prone to wordplay elements were particularly curated to maintain a multi-class classification structure.

Model Training and Evaluation

The research leverages various LLM architectures. Encoders such as BERT and RoBERTa were fine-tuned over several epochs with careful attention to achieving optimal F1-macro scores, accounting for category bias by adjusting for class imbalance. Encoder-Decoder and Decoder models, like BART and GPT-4o, were evaluated using zero-shot and few-shot learning paradigms. Notably, GPT-4o underwent fine-tuning using the OpenAI API, albeit with acknowledged constraints due to data sparsity and potential overfitting, which was mitigated by employing multiple runs with different seeds.

Results

The empirical results reveal intriguing insights into the humor comprehension capabilities of LLMs:

Encoder Models: RoBERTa-base emerged as the superior model, achieving an F1-macro score of 0.8566, marginally outperforming other encoder variants. Interestingly, performance was relatively consistent between RoBERTa's base and large variations, indicating model depth had a minimal impact within this architecture.
Decoder Model: Fine-tuned GPT-4o displayed remarkable proficiency, closely matching RoBERTa's performance with an F1-macro score of 0.8522. This outcome challenges the traditional view that encoders outperform decoders in classification tasks, demonstrating the potential of decoders when adequately fine-tuned.
Zero-Shot and Few-Shots Learning: Encoder-Decoder models like Flan-T5 exhibited underwhelming results in few-shot learning compared to zero-shot settings, underscoring the difficulties smaller models face with extensive prompts.

Discussion

These findings present significant implications for AI research, particularly in the field of natural language understanding and humor. The equivalence of fine-tuned decoders to encoders in this context suggests potential paradigms shifts wherein decoders could be as viable as encoders for certain classification tasks when trained appropriately. The study also underscores the importance of dataset quality and label clarity to drive model performance.

Conclusion

In conclusion, this paper demonstrates that fine-tuned decoders, exemplified by GPT-4o, can rival encoders in classifying humor. This finding contributes to the ongoing discourse on LLM architecture capabilities and paves the way for future explorations into model training methodologies, particularly in areas demanding nuanced comprehension. Future research may explore using larger datasets and address current model constraints, as well as investigate the implications of integrating GPT-5 once publicly available.

Markdown Report Issue