How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

The article presents CharacterFlywheel, a 15‑generation flywheel methodology that iteratively improves social‑dialogue LLMs in production using data‑driven reward models, rejection sampling, and a mix of SFT, DPO, and RL, with detailed experiments and best‑practice insights.

PaperAgent
PaperAgent
PaperAgent
How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

Background and Motivation

Current large language model (LLM) development focuses on assistant‑type AI (e.g., ChatGPT, Claude) with clear evaluation metrics, while social AI (e.g., Character.ai, Replika) lacks systematic research. CharacterFlywheel was created to fill this gap by providing a scientific, measurable way to boost AI's social conversation abilities in production.

Core Contributions: 15‑Generation Flywheel Methodology

From January 2024 to April 2025 the team iterated 15 versions of a LLaMA 3.1‑based model, ultimately deploying the AI characters to Instagram, WhatsApp, and Messenger. A/B tests (7/8 of the experiments) demonstrated consistent positive gains, confirming the effectiveness of the approach.

Methodology Details

3.1 Mountain‑Climbing Analogy

The optimization process is likened to climbing an “attractiveness terrain”.

“Mountain identified. Time to climb.” — Ilya Sutskever

The four steps are:

(a) Landscape Climbing – overall optimization trajectory, gradually ascending the attractiveness peak.

(b) Data Sampling – sample data points at the current location to estimate the local terrain.

(c) Pre‑Herding – train a reward model that interpolates contour lines of the terrain.

(d) Herding – update the model position based on the estimated terrain.

3.2 Full Development Pipeline

The workflow is divided into three stages:

Data Consolidation : traffic curation and data annotation.

Pre‑Herding : reward‑model training and rejection sampling.

Herding : supervised fine‑tuning (SFT), direct preference optimization (DPO), reinforcement learning (RL), evaluation, and deployment of new versions.

3.3 Data Pipeline

Data sources include massive online production traffic and internal UI feedback from content and UX teams. The data‑filtering pipeline proceeds in three phases:

Phase I – privacy and safety filtering to ensure clean data.

Phase II – diversity sampling using DRAMA‑1B embeddings, retaining a percentage of data to eliminate redundancy while preserving distributional representativeness.

Phase III – constrained adjustment with stratified sampling to balance multiple dimensions and align with target distributions.

Reward Model: Quantifying “Attractiveness”

4.1 Dual‑Track Preference Model

Because attractiveness is non‑differentiable, two surrogate models are trained:

Pointwise Model : scores each response independently; used to guide RL training.

Pairwise Model : jointly encodes two responses and classifies which is better; combined with pointwise scores to mitigate reward‑hacking.

4.2 User‑Signal Model

User behavior signals (e.g., continue probability, thumbs‑up) are highly correlated with the reward model but are vulnerable to reward‑hacking if used directly for RL. They are instead employed for rejection‑sampling ranking, where the reward model’s score is kept below a safety threshold.

Training Strategy: SFT + DPO + RL Combo

5.1 Rejection Sampling

Select the model best suited for the current prompt from a candidate pool.

Generate k candidate replies.

Score each with the reward model and retain those with score ≥ τ.

Construct a high‑quality SFT dataset from the retained samples.

A tight model‑iteration loop, using the latest user traffic to rebuild the dataset, approximates on‑policy behavior despite the off‑policy nature of rejection sampling.

5.2 Online RL: Online DPO vs. GRPO

The team compared standard online Direct Preference Optimization (Online DPO) with Group Relative Policy Optimization (GRPO), a variant that incorporates importance‑sampling correction. GRPO achieved a +1.52 % lift in engagement breadth over Online DPO by leveraging reward scores of all generated replies.

5.3 Style‑Artifact Mitigation

To prevent over‑optimization of superficial style cues, the team monitors features such as reply length, presence of lists, emoji count, and specific phrases (e.g., “I feel like…”). By comparing high‑ and low‑score replies, they ensure style does not become a spurious reward signal.

Key Results

6.1 Pre‑Release Phase (V1‑V7)

Win‑rate against GPT‑4o rose from 37.4 % (V3) to 46.2 % (V7). Human evaluation scores improved from 50.2 % to 52.5 %, and reward‑model evaluations from 53.6 % to 57.6 %, both surpassing the 50 % neutral line.

6.2 Post‑Release Phase (V8‑V15)

Engagement breadth increased +4.47 % (V11) and +8.8 % (V14). Version 12 showed a rare regression (‑2.9 % depth). Reward‑model user win‑rate peaked at 70.7 % while internal win‑rate dropped to 43.7 %, indicating over‑fit to user‑biased signals.

Insights and Best Practices

7.1 Image Generation Impact

Implicit image generation (model decides when to generate) adds more engagement (+2.1 % breadth) than explicit prompts (+1.7 % breadth) because it enriches dialogue without user prompting.

7.2 Near‑Policy vs. Off‑Policy

Near‑policy (using the latest model traffic) yields a +10.6 % depth improvement, while off‑policy serves as the baseline.

7.3 Variance‑Based Hard Sample Sampling

For each prompt, multiple replies are sampled and the variance of reward‑model scores is computed. High variance indicates a difficult prompt, providing a more robust difficulty signal than mean score alone.

7.4 Limitations of the User‑Signal Model

The model is suitable for rejection‑sampling ranking when the reward‑model win‑rate stays below 65 %, but it should not be used directly for RL optimization due to susceptibility to reward‑hacking.

7.5 Historical Bias Propagation

Even after removing emojis from the reward‑model input, RL training re‑introduces them (usage rises from 0.2 to 0.48) because the autoregressive policy mimics dialogue history rather than the reward signal. Mitigation involves preprocessing prompts, bias monitoring, and corrective training.

https://arxiv.org/pdf/2603.01973
CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production
data pipelinereinforcement learningAI safetyLLM optimizationreward modelingsocial AIproduction scaling
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.