Artificial Intelligence 18 min read

Can AI Truly Be Creative? Inside the CreativeBench Benchmark

This article examines the CreativeBench benchmark, which redefines machine creativity by measuring both the quality and novelty of generated solutions, explains its combinatorial and exploratory task designs, details the self‑evolving task construction process, and discusses key findings and the EvoRePE enhancement method.

PaperAgent

Mar 21, 2026

Can AI Truly Be Creative? Inside the CreativeBench Benchmark

Why Reconsider Machine Creativity?

Large language models have shown impressive abilities in knowledge QA, math reasoning, and code generation, but their success raises a fundamental question: are they merely recombining existing knowledge or truly exhibiting creative behavior?

Can models reorganize dispersed knowledge into new solutions?

When familiar paths are blocked, can models find alternative approaches?

Are diverse outputs genuine innovations or just noisy deviations?

These questions highlight that machine creativity must be modeled and evaluated as a distinct capability.

What Is CreativeBench?

CreativeBench is a benchmark focused on code‑generation scenarios that goes beyond checking whether a model can produce a correct answer. It evaluates whether the model can generate solutions that are both effective and novel.

1. Combinatorial Creativity

Models must reorganize previously separate knowledge, structures, or methods into a new, effective solution, demonstrating the ability to integrate across concepts.

2. Exploratory Creativity

When standard pathways are unavailable, models must continue exploring within constrained solution spaces to discover new viable approaches.

Key Innovation: Self‑Evolving Tasks

CreativeBench introduces automatic task generation mechanisms for both combinatorial and exploratory creativity, eliminating the need for manually crafted “creative” questions.

Combo tasks are built via reverse construction: first generate a multi‑skill solution, then derive the problem.

Explore tasks use an interactive self‑play mechanism between a Constraint Generator and a Solver.

How Combo Tasks Are Built

The process, called CreativeBench‑Combo, follows four steps:

Step 1: Construct a Cross‑Domain Program

The system fuses code components from different domains into a single, functional program, ensuring deep integration rather than superficial stitching.

Step 2: Verify Candidate Solution

The generated program is executed in a sandbox; only solutions that run correctly and meet the intended behavior are kept.

Step 3: Auto‑Generate Tests from the Reference Solution

Based on the verified program, the system creates input‑output test pairs and assembles them into a standard test function.

Step 4: Reverse‑Engineer the Problem Statement

The reference program and its behavior are translated back into a natural‑language description, completing the task.

Why This Construction Matters

Because the solution exists before the problem, the task is guaranteed to be solvable and its creativity stems from the solution’s internal composition rather than superficial problem wording.

How Exploratory Tasks Are Built

CreativeBench‑Explore starts from an existing problem and its baseline solution, then iteratively adds constraints to force the model away from familiar strategies.

Core Idea: Adding Constraints Is Easier Than Re‑Solving

Introducing a new restriction is often simpler than finding a completely new solution under that restriction.

Constraint Generator : Identifies key tactics used by the current solution and proposes negative constraints that block them.

Solver : Attempts to solve the problem under the accumulated constraints.

Step 1: Start from the Original Problem and Baseline Solution

The process begins with a standard “route” as the baseline.

Step 2: Impose New Negative Constraints

The generator adds targeted constraints (e.g., banning certain sorting operations or control‑flow patterns) that invalidate the current solution.

Step 3: Accumulate Constraints

Constraints are layered rather than replaced, progressively shrinking the solution space and forcing the model to devise fundamentally different algorithms.

Step 4: Solver Refines Under Feedback

The solver receives two feedback signals: sandbox (does the program still run?) and judge (does it obey the new constraints?). Both correctness and compliance are required.

Step 5: Terminate When No Further Progress

If the solver cannot find a feasible solution after many attempts, the construction stops, yielding a high‑difficulty exploratory task that measures sustained creative reasoning.

Difference from Traditional Benchmarks

Traditional benchmarks focus on whether a model can solve a problem. CreativeBench asks whether the model can solve it in a novel, effective way, separating correctness from genuine creativity.

Definition of Creativity

Creativity = Quality × Novelty

Quality means the solution is correct, functional, and meets the problem requirements.

Novelty means the solution deviates noticeably from common paths, introducing new ideas, structures, or methods.

Both dimensions are required; lacking either results in either meaningless deviation or mere replication.

Key Findings

1. Scale Helps Combination More Than Exploration

Larger models possess richer knowledge representations, making them better at integrating multiple elements (combinatorial creativity). However, they do not necessarily improve exploratory creativity.

2. Stronger Models Tend to Be More Conservative

As performance improves, models become more accurate and stable, often preferring high‑probability, safe answers, which can shrink the novelty space.

3. Reasoning Ability Is Crucial for Exploratory Creativity

Exploratory tasks require continuous reasoning, constraint handling, and the ability to rebuild solutions under new restrictions, highlighting the importance of structured analytical skills.

Enhancing Creativity with EvoRePE

EvoRePE extracts “creative‑biased” representations from successful evolutionary search trajectories and injects them lightly during inference, guiding models toward more novel solutions without full retraining.

No need to retrain the entire model.

Can be integrated into existing inference pipelines.

Compatible with other evolutionary methods.

Relatively low additional computational cost.

Value of EvoRePE

EvoRePE demonstrates that creativity can be enhanced not only by costly search but also by engineering representations that bias models toward innovative thinking.

Overall Value of CreativeBench

CreativeBench reshapes AI evaluation from “can the model solve the task?” to “can the model solve it in a new, high‑quality way?” This shift encourages research to focus on measuring and fostering genuine innovation in AI systems.