Session‑Level Sample Organization for Decoder‑Only LLM Fine‑Tuning

This article explains how to restructure multi‑turn dialogue data into single session‑level training samples for decoder‑only large language models, leveraging causal attention and simple position IDs, and provides a concrete implementation, performance results, and a gradient‑weight analysis.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Session‑Level Sample Organization for Decoder‑Only LLM Fine‑Tuning

Key Characteristics of Decoder‑Only Models

Decoder‑only architectures use causal (triangular) attention, so each token can attend only to previous tokens. Position IDs encode only token order, unlike models such as GLM that require special position handling.

Session‑Level Sample Construction

Concatenate the whole dialogue into a single sequence and insert an <eos> token after each answer. During loss computation, mask tokens so that only answer portions (A1, A2, …) contribute to the loss. This enables training on whole sessions rather than isolated turns.

causal attention diagram
causal attention diagram

ChatGLM‑1 cannot use this format because its position IDs are not pure order‑based. ChatGLM‑2 adopts true causal attention and simple incremental position IDs, making session‑level training feasible.

Implementation Example (ChatGLM‑2‑6B)

conversation = ''
input_ids = []
labels = []
eos_id = tokenizer.eos_token_id
turn_idx = 0
for sentence in examples[prompt_column][i]:
    sentence_from = sentence["from"].lower()
    # Human turn
    if sentence_from == 'human':
        sentence_value = f'[Round {turn_idx}]

问:' + sentence["value"] + '

答:'
        label = [-100] * len(sentence_value)   # mask loss for prompt
    else:
        sentence_value = sentence["value"] + '

'
        label = tokenizer.encode(sentence_value, add_special_tokens=False)
    conversation += sentence_value
    sentence_ids = tokenizer.encode(sentence_value, add_special_tokens=False)
    input_ids += sentence_ids
    labels += label
    if sentence_from != 'human':
        input_ids += [eos_id]
        labels += [eos_id]
        turn_idx += 1
# Add BOS and GMASK tokens
input_ids = tokenizer.encode('') + input_ids
labels = [-100] * 2 + labels
# Pad to max_seq_length
pad_len = max_seq_length - len(input_ids)
input_ids = input_ids + [eos_id] * pad_len
labels = labels + [-100] * pad_len

The code inserts BOS and GMASK tokens at the beginning, keeps the original prompt format, and masks padding tokens ( -100) so they do not affect the loss.

Training Results

On the same dataset, ChatGLM‑2‑6B achieved a final training loss around 1.x, while ChatGLM‑1 plateaued at about 2.x. Evaluation metrics such as ROUGE showed noticeable improvements.

training loss curve
training loss curve

Gradient‑Weight Analysis

Session‑level training treats the whole dialogue as one batch, effectively increasing the batch size and averaging gradients across turns. Split‑turn training sums gradients, which amplifies the learning‑rate effect and skews token‑level weighting.

For a simplified three‑turn example, the gradient weight distribution for tokens A, B, C is:

Session‑level: 2/3, 1/6, 1/6

Split‑turn: 17/24, 5/24, 1/12

Later turns receive relatively higher influence in session‑level training, matching the intuition that early turns are often repetitive.

Conclusion

Organizing multi‑turn dialogue as a single session sample leverages the properties of decoder‑only models, reduces padding overhead, and yields better fine‑tuning performance. The provided code and empirical results demonstrate the practicality of this approach for ChatGLM‑2‑6B and similar models.

Code repository: https://github.com/SpongebBob/Finetune-ChatGLM2-6B

prompt engineeringLLM fine-tuningChatGLM2decoder-onlysession-level training
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.