Session‑Level Sample Organization for Decoder‑Only LLM Fine‑Tuning
This article explains how to restructure multi‑turn dialogue data into single session‑level training samples for decoder‑only large language models, leveraging causal attention and simple position IDs, and provides a concrete implementation, performance results, and a gradient‑weight analysis.
Key Characteristics of Decoder‑Only Models
Decoder‑only architectures use causal (triangular) attention, so each token can attend only to previous tokens. Position IDs encode only token order, unlike models such as GLM that require special position handling.
Session‑Level Sample Construction
Concatenate the whole dialogue into a single sequence and insert an <eos> token after each answer. During loss computation, mask tokens so that only answer portions (A1, A2, …) contribute to the loss. This enables training on whole sessions rather than isolated turns.
ChatGLM‑1 cannot use this format because its position IDs are not pure order‑based. ChatGLM‑2 adopts true causal attention and simple incremental position IDs, making session‑level training feasible.
Implementation Example (ChatGLM‑2‑6B)
conversation = ''
input_ids = []
labels = []
eos_id = tokenizer.eos_token_id
turn_idx = 0
for sentence in examples[prompt_column][i]:
sentence_from = sentence["from"].lower()
# Human turn
if sentence_from == 'human':
sentence_value = f'[Round {turn_idx}]
问:' + sentence["value"] + '
答:'
label = [-100] * len(sentence_value) # mask loss for prompt
else:
sentence_value = sentence["value"] + '
'
label = tokenizer.encode(sentence_value, add_special_tokens=False)
conversation += sentence_value
sentence_ids = tokenizer.encode(sentence_value, add_special_tokens=False)
input_ids += sentence_ids
labels += label
if sentence_from != 'human':
input_ids += [eos_id]
labels += [eos_id]
turn_idx += 1
# Add BOS and GMASK tokens
input_ids = tokenizer.encode('') + input_ids
labels = [-100] * 2 + labels
# Pad to max_seq_length
pad_len = max_seq_length - len(input_ids)
input_ids = input_ids + [eos_id] * pad_len
labels = labels + [-100] * pad_lenThe code inserts BOS and GMASK tokens at the beginning, keeps the original prompt format, and masks padding tokens ( -100) so they do not affect the loss.
Training Results
On the same dataset, ChatGLM‑2‑6B achieved a final training loss around 1.x, while ChatGLM‑1 plateaued at about 2.x. Evaluation metrics such as ROUGE showed noticeable improvements.
Gradient‑Weight Analysis
Session‑level training treats the whole dialogue as one batch, effectively increasing the batch size and averaging gradients across turns. Split‑turn training sums gradients, which amplifies the learning‑rate effect and skews token‑level weighting.
For a simplified three‑turn example, the gradient weight distribution for tokens A, B, C is:
Session‑level: 2/3, 1/6, 1/6
Split‑turn: 17/24, 5/24, 1/12
Later turns receive relatively higher influence in session‑level training, matching the intuition that early turns are often repetitive.
Conclusion
Organizing multi‑turn dialogue as a single session sample leverages the properties of decoder‑only models, reduces padding overhead, and yields better fine‑tuning performance. The provided code and empirical results demonstrate the practicality of this approach for ChatGLM‑2‑6B and similar models.
Code repository: https://github.com/SpongebBob/Finetune-ChatGLM2-6B
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
