- The paper demonstrates that fine-tuned decoders, such as GPT-4o, can achieve humor classification performance equivalent to encoder models like RoBERTa-base, challenging conventional presumptions.
- The paper employs a multi-class classification framework across five humor types and a non-humorous category, using curated datasets and standardized metrics like F1-macro for evaluation.
- The paper's findings highlight the critical role of data quality, fine-tuning strategies, and evaluation protocols in advancing LLM performance for nuanced tasks such as humor comprehension.
Decoders Laugh as Loud as Encoders
Introduction
The exploration of humor comprehension by AI models has long posed a challenge due to the nuanced and culturally-dependent nature of humor. As LLMs like GPT and Claude demonstrate sophisticated text generation capabilities, understanding whether these models genuinely comprehend humor becomes pertinent. This paper, "Decoders Laugh as Loud as Encoders" (2509.04779), investigates the ability of various LLM architectures, particularly fine-tuned decoders and encoders, to classify humorous text effectively.
Methodology
The study employs a methodological framework that evaluates LLMs on their capability to classify humor into predefined categories. English humor is segmented into five distinct types: absurdity, dark, irony, wordplay, and social commentary. Additionally, a sixth category includes regular, non-humorous sentences as negative examples. The models tested span three architectures: Encoders, Encoder-Decoder models, and Decoders.
Data Collection and Preprocessing
Data was meticulously gathered from web sources such as Reddit and other humor-centric media platforms, ensuring a wide coverage of humor styles. The datasets underwent rigorous cleansing to address ambiguity by filtering out multi-faceted jokes and ensuring a straightforward classification task. Categories of jokes prone to wordplay elements were particularly curated to maintain a multi-class classification structure.
Model Training and Evaluation
The research leverages various LLM architectures. Encoders such as BERT and RoBERTa were fine-tuned over several epochs with careful attention to achieving optimal F1-macro scores, accounting for category bias by adjusting for class imbalance. Encoder-Decoder and Decoder models, like BART and GPT-4o, were evaluated using zero-shot and few-shot learning paradigms. Notably, GPT-4o underwent fine-tuning using the OpenAI API, albeit with acknowledged constraints due to data sparsity and potential overfitting, which was mitigated by employing multiple runs with different seeds.
Results
The empirical results reveal intriguing insights into the humor comprehension capabilities of LLMs:
- Encoder Models: RoBERTa-base emerged as the superior model, achieving an F1-macro score of 0.8566, marginally outperforming other encoder variants. Interestingly, performance was relatively consistent between RoBERTa's base and large variations, indicating model depth had a minimal impact within this architecture.
- Decoder Model: Fine-tuned GPT-4o displayed remarkable proficiency, closely matching RoBERTa's performance with an F1-macro score of 0.8522. This outcome challenges the traditional view that encoders outperform decoders in classification tasks, demonstrating the potential of decoders when adequately fine-tuned.
- Zero-Shot and Few-Shots Learning: Encoder-Decoder models like Flan-T5 exhibited underwhelming results in few-shot learning compared to zero-shot settings, underscoring the difficulties smaller models face with extensive prompts.
Discussion
These findings present significant implications for AI research, particularly in the field of natural language understanding and humor. The equivalence of fine-tuned decoders to encoders in this context suggests potential paradigms shifts wherein decoders could be as viable as encoders for certain classification tasks when trained appropriately. The study also underscores the importance of dataset quality and label clarity to drive model performance.
Conclusion
In conclusion, this paper demonstrates that fine-tuned decoders, exemplified by GPT-4o, can rival encoders in classifying humor. This finding contributes to the ongoing discourse on LLM architecture capabilities and paves the way for future explorations into model training methodologies, particularly in areas demanding nuanced comprehension. Future research may explore using larger datasets and address current model constraints, as well as investigate the implications of integrating GPT-5 once publicly available.