Artificial Intelligence 11 min read

Training a Positive Review Generator with RLHF and PPO

This article demonstrates how to use Reinforcement Learning from Human Feedback (RLHF) with a PPO algorithm and a sentiment‑analysis model to train a language model that generates positive product reviews, covering task definition, data sampling, reward evaluation, model optimization, and experimental results.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
Training a Positive Review Generator with RLHF and PPO

With the recent popularity of ChatGPT, many are interested in the core idea of Reinforcement Learning from Human Feedback (RLHF). Using reinforcement learning instead of supervised learning allows a model to explore updates more freely and break the performance ceiling of supervised methods.

The article presents a concrete example: training a "positive review generator". The model receives a prompt such as "Just received the goods, feeling" and must complete it. Initially the model produces neutral or negative completions, which are then guided by rewards.

Rewards are obtained by feeding each generated sentence into a sentiment‑analysis model (implemented with the transformers sentiment‑analysis pipeline) that outputs a probability of positive sentiment. This probability (0.0–1.0) serves as the reward for the PPO update.

The training pipeline consists of three stages:

2.1 Generation (Rollout)

Prompts are sampled from a predefined pool and fed to a GPT‑2 model to generate candidate sentences.

prompts = [
    '刚收到货,感觉',
    '这部电影很',
    '说实话,真的很',
    '这次购物总的来说体验很'
]
for _ in range(config['batch_size']):
    random_prompt = random.choice(prompts)
    tokens = gpt2_tokenizer.encode(random_prompt)
    batch['tokens'].append(tokens)
    batch['query'].append(random_prompt)
    query_tensors = [torch.tensor(t).long().to(device) for t in batch["tokens"]]
for i in range(config['batch_size']):
    gen_len = config['gen_len']
    response = gpt2_model.generate(query_tensors[i].unsqueeze(dim=0), max_new_tokens=gen_len, **gen_kwargs)
    response_tensors.append(response.squeeze()[-gen_len:])

After generation, a list of model outputs is obtained.

[
    '刚收到货,感觉 很 一 般',
    '这部电影很 俗 而 且 很 无 趣',
    '这次购物总的来说体验很 烂 不 是 我 想 要 的',
    ...
]

2.2 Reward Evaluation

The sentiment model is initialized and applied to the concatenated prompt‑response texts to produce reward scores.

senti_tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-jd-binary-chinese')
senti_model = AutoModelForSequenceClassification.from_pretrained('uer/roberta-base-finetuned-jd-binary-chinese')
sentiment_pipe = pipeline('sentiment-analysis', model=senti_model, tokenizer=senti_tokenizer, device=pipe_device)
texts = [q + r for q,r in zip(batch['query'], batch['response'])]
pipe_outputs = sentiment_pipe(texts)

The resulting rewards might look like:

[
    0.4,
    0.3,
    0.3,
    ...
]

2.3 Model Optimization (PPO)

The PPO trainer updates the language model using the computed rewards. The core update call is a single line:

ppo_trainer.step(query_tensors, response_tensors, rewards)  # PPO Update

During PPO, two losses are computed:

pg_loss : actor loss calculated from advantages and importance ratios.

value_loss : critic loss measuring the error between the value head’s prediction and the target return.

Key code snippets for these losses include:

loss_p, loss_v, train_stats = self.loss(logprobs, values, rewards, query, response, model_input)
loss = loss_p + loss_v
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

A custom GPT‑2 model with a scalar value head is defined as follows:

class GPT2HeadWithValueModel(GPT2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        config.num_labels = 1
        self.transformer = GPT2Model(config)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.v_head = ValueHead(config)  # add Value Head
        self.init_weights()

class ValueHead(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.summary = nn.Linear(config.hidden_size, 1)  # hidden_size -> 1

The value loss is computed as the mean‑squared error between the predicted values and the returns:

returns = advantages + values
logits, _, vpred = self.model(model_input)
vf_losses1 = (vpred - returns) ** 2  # MSE

3. Experimental Results

The training curve shows the average reward rising from around 0.68 to 0.85 as training progresses.

Early in training the model generates random, often negative comments, while later it learns to produce predominantly positive sentiment outputs.

The full source code is available at github.com/HarderThenHarder/transformers_tasks/tree/main/RLHF .

sentiment analysisreinforcement learningRLHFPPOlanguage modelgpt
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.