Artificial Intelligence 9 min read

How a Multimodal ‘Joke‑King’ Model Beats GPT‑4 at Humor Generation

A research team from Sun Yat‑sen University, Sea AI Lab and Harvard built a multimodal large model that learns to generate creative jokes and memes by training on the Oogiri‑GO dataset, introducing a Leap‑of‑Thought (LoT) paradigm and CLoT fine‑tuning, which outperforms GPT‑4 and other state‑of‑the‑art models in humor tasks.

NewBeeNLP

Apr 13, 2024

How a Multimodal ‘Joke‑King’ Model Beats GPT‑4 at Humor Generation

Background

The work investigates how to give multimodal large models genuine creative ability, moving beyond the sequential reasoning of traditional chain‑of‑thought (CoT). By using the Japanese improvisational comedy game “Oogiri” (大喜利) as a testbed, the authors train models to generate surprising, humorous responses to visual and textual prompts.

Dataset Construction – Oogiri‑GO

Oogiri‑GO is a tri‑language (Chinese, English, Japanese) dataset collected from crowd‑sourced Oogiri sessions. It contains three interaction types that match the input‑output format of multimodal models:

Image‑to‑Text (图到文) : an image is shown and the model must produce a witty caption.

Image‑and‑Text‑to‑Text (图文到文) : an image with a partial caption is provided; the model fills in the missing part creatively.

Text‑to‑Text (文到文) : a textual prompt is given and the model replies with a humorous response.

All examples are high‑quality, crowd‑curated humor data that can be directly used for instruction fine‑tuning.

Leap‑of‑Thought (LoT) Paradigm

Traditional CoT reasoning follows a linear chain of logical steps, which limits creativity. LoT is a non‑sequential paradigm that encourages distant associative jumps, allowing the model to explore remote connections between concepts rather than strictly logical progressions.

CLoT Training Pipeline

Correlation‑guided instruction fine‑tuning : Oogiri‑GO examples are transformed into generative and discriminative instruction templates. The model is first fine‑tuned on this data to acquire an initial ability to produce innovative, humor‑oriented responses.

Exploratory self‑adjustment : Weakly related condition words are introduced to prompt the model to generate diverse, far‑associated answers. A filtering pipeline (e.g., heuristic quality checks and model‑based scoring) selects high‑quality creative outputs, which are then added to the training set as new LoT data. A second round of instruction fine‑tuning on this expanded set further enhances creative capability.

The exploratory stage consists of two sub‑steps: (a) encouraging remote associative generation, and (b) self‑refinement through filtering and re‑training.

Evaluation

Multiple‑choice and ranking questions are built on the Oogiri‑GO test split to quantitatively assess humor quality. Experiments show that CLoT substantially improves the performance of multimodal models such as Qwen and CogVLM, surpassing GPT‑4‑V and other strong baselines. User studies confirm that CLoT‑enhanced models produce more amusing content.

Generalization is evaluated on two additional benchmarks:

See‑Cloud‑Guess‑Object (CGG) : a visual object‑guessing task.

Divergent‑Thinking (DAT) : a test of associative creativity.

On both tasks, CLoT achieves higher accuracy than baseline models, demonstrating strong transferability of the learned creative reasoning.