Artificial Intelligence 7 min read

Session‑Level Sample Organization for Decoder‑Only LLM Fine‑Tuning

This article explains how to restructure multi‑turn dialogue data into single session‑level training samples for decoder‑only large language models, leveraging causal attention and simple position IDs, and provides a concrete implementation, performance results, and a gradient‑weight analysis.

Baobao Algorithm Notes

Jul 5, 2023

Session‑Level Sample Organization for Decoder‑Only LLM Fine‑Tuning

Key Characteristics of Decoder‑Only Models

Decoder‑only architectures use causal (triangular) attention, so each token can attend only to previous tokens. Position IDs encode only token order, unlike models such as GLM that require special position handling.

Session‑Level Sample Construction

Concatenate the whole dialogue into a single sequence and insert an <eos> token after each answer. During loss computation, mask tokens so that only answer portions (A1, A2, …) contribute to the loss. This enables training on whole sessions rather than isolated turns.

ChatGLM‑1 cannot use this format because its position IDs are not pure order‑based. ChatGLM‑2 adopts true causal attention and simple incremental position IDs, making session‑level training feasible.

Implementation Example (ChatGLM‑2‑6B)

conversation = ''
input_ids = []
labels = []
eos_id = tokenizer.eos_token_id
turn_idx = 0
for sentence in examples[prompt_column][i]:
    sentence_from = sentence["from"].lower()
    # Human turn
    if sentence_from == 'human':
        sentence_value = f'[Round {turn_idx}]

问：' + sentence["value"] + '

答：'
        label = [-100] * len(sentence_value)   # mask loss for prompt
    else:
        sentence_value = sentence["value"] + '

'
        label = tokenizer.encode(sentence_value, add_special_tokens=False)
    conversation += sentence_value
    sentence_ids = tokenizer.encode(sentence_value, add_special_tokens=False)
    input_ids += sentence_ids
    labels += label
    if sentence_from != 'human':
        input_ids += [eos_id]
        labels += [eos_id]
        turn_idx += 1
# Add BOS and GMASK tokens
input_ids = tokenizer.encode('') + input_ids
labels = [-100] * 2 + labels
# Pad to max_seq_length
pad_len = max_seq_length - len(input_ids)
input_ids = input_ids + [eos_id] * pad_len
labels = labels + [-100] * pad_len

The code inserts BOS and GMASK tokens at the beginning, keeps the original prompt format, and masks padding tokens ( -100) so they do not affect the loss.

Training Results

On the same dataset, ChatGLM‑2‑6B achieved a final training loss around 1.x, while ChatGLM‑1 plateaued at about 2.x. Evaluation metrics such as ROUGE showed noticeable improvements.

Gradient‑Weight Analysis

Session‑level training treats the whole dialogue as one batch, effectively increasing the batch size and averaging gradients across turns. Split‑turn training sums gradients, which amplifies the learning‑rate effect and skews token‑level weighting.

For a simplified three‑turn example, the gradient weight distribution for tokens A, B, C is:

Session‑level: 2/3, 1/6, 1/6

Split‑turn: 17/24, 5/24, 1/12

Later turns receive relatively higher influence in session‑level training, matching the intuition that early turns are often repetitive.

Conclusion

Organizing multi‑turn dialogue as a single session sample leverages the properties of decoder‑only models, reduces padding overhead, and yields better fine‑tuning performance. The provided code and empirical results demonstrate the practicality of this approach for ChatGLM‑2‑6B and similar models.

Code repository: https://github.com/SpongebBob/Finetune-ChatGLM2-6B

prompt engineering LLM fine-tuning ChatGLM2 decoder-only session-level training