Artificial Intelligence 27 min read

InternLM Model Research and XTuner Practical Guide (Part 1): DataLoader, Model Conversion, Merging, and Inference

The guide walks through fine‑tuning InternLM‑Chat‑7B with XTuner, showing how to build a DataLoader from a HuggingFace Dataset, convert a LoRA .pth checkpoint to HuggingFace format, merge the adapter into the base model, run inference, and adapt the process for custom datasets and 4‑bit quantization experiments.

OPPO Kernel Craftsman

Mar 29, 2024

InternLM Model Research and XTuner Practical Guide (Part 1): DataLoader, Model Conversion, Merging, and Inference

This article documents a step‑by‑step practical workflow for fine‑tuning the InternLM‑Chat‑7B model with XTuner, covering dataset handling, DataLoader creation, model conversion from PyTorch .pth checkpoints to HuggingFace format, LoRA adapter extraction, model merging, and inference.

Dataset to DataLoader

After obtaining a HuggingFace‑style Dataset, the Trainer can ingest it directly, but the example manually builds a DataLoader located in

/opt/conda/envs/transformers2/lib/python3.9/site-packages/mmengine/runner/runner.py

. The key arguments are: dataset: an object implementing __getitem__ and __len__. sampler: defines the sampling strategy; omitted when batch_sampler is used. batch_sampler: iterator that yields batch indices, overriding sampler. collate_fn: function that merges a list of samples into a batch. worker_init_fn: initializer for each worker process (e.g., setting random seeds). **dataloader_cfg: additional keyword arguments such as batch_size, shuffle, num_workers, etc.

The collate_fn implementation resides at

/opt/conda/envs/transformers2/lib/python3.9/site-packages/xtuner/dataset/collate_fns/defalut_collate_fn.py

. After loading, each sample looks like:

{'data': {'input_ids': tensor([[265, 539, 2632, ..., 36333, 328, 454]]), 'attention_mask': tensor([[True, True, ...]]), 'labels': tensor([[265, 539, 2632, ..., 36333, 328, 454]])}, 'data_samples': None}

All three tensors have shape torch.Size([1, 2048]) because the pack_to_max_length step pads sequences to 2048 tokens.

Step 4: Converting a .pth checkpoint to HuggingFace

The checkpoint saved in work_dirs is a folder containing a .pth file with LoRA parameters (stored as float32). Conversion is performed by the script

/opt/conda/envs/transformers2/lib/python3.9/site-packages/xtuner/tools/model_converters/pth_to_hf.py

and consists of the following key stages:

Load the base model (e.g., Shanghai_AI_Laboratory/internlm-chat-7b) with load_in_4bit=True. Although 4‑bit quantization reduces weight memory, biases and LoRA matrices remain float32, leading to ~9 GB GPU usage.

Load the .pth file using state_dict = guess_load_checkpoint(args.pth_model). The resulting OrderedDict holds ~159 M LoRA parameters (≈2 % of total parameters).

Overlay LoRA weights onto the model and cast them to float16.

Save the adapter with model.save_pretrained(...), producing adapter_model.bin (≈320 MB, float16).

The conversion yields a HuggingFace‑compatible directory containing four files (config, tokenizer, adapter, etc.).

Step 5: Merging the LoRA adapter into the base model

Using

/opt/conda/envs/transformers2/lib/python3.9/site-packages/xtuner/tools/model_converters/merge.py

, the workflow is:

Load the base model in float16 (GPU memory ≈14.3 GB).

Load the adapter via PeftModel.from_pretrained(base_model, adapter_path). Internally this creates a LoraModel instance and calls model.load_adapter(...) to copy LoRA weights.

Merge LoRA into the main weight matrix:

self.weight.data -= (self.lora_B[adapter].weight @ self.lora_A[adapter].weight) * self.scaling[adapter]

(the sign is irrelevant to the final result).

Save the merged model and tokenizer with model.save_pretrained(merged_path). The final model size is ~16 GB (float16), much smaller than the original 7 B model stored in float32.

Step 6: Inference with the merged model

Run XTuner’s chat script:

python -m xtuner.tools.chat \
    --model_name_or_path ./merged \
    --torch_dtype fp16 \
    --prompt_template internlm_chat \
    --max_new_tokens 2048 \
    --temperature 0.1

The script builds a prompt template such as:

{'SYSTEM': '<|System|>:{system}
', 'INSTRUCTION': '<|User|>:{input}
<|Bot|>:', 'SUFFIX': ''}

During generation, the model receives tokenized inputs like:

{'input_ids': tensor([[1, 333, 352, 1621, 352, 27232, 76379, 103027, 364, ...]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, ...]], device='cuda:0'), 'position_ids': tensor([[0, 1, 2, ...]], device='cuda:0'), 'use_cache': True}

Generation proceeds by repeatedly feeding the updated input_ids back into the model until an EOS token appears or the max length is reached.

Custom Dataset Fine‑Tuning

The article also shows how to fine‑tune on a non‑standard dataset (Medication_QA_MedInfo2019). The steps are:

Convert the original CSV/Excel into a .jsonL file where each line follows

{"conversation": [{"input": "...", "output": "...", "system": "..."}]}

Copy an existing XTuner config (e.g., internlm_chat_7b_qlora_medqa2019_e3.py) and replace the dataset and dataset_map_fn entries to point to the new file.

Run the training command (≈30 epochs, a few minutes because the dataset is tiny):

python -m xtuner.train \
    internlm_chat_7b_qlora_medqa2019_e3.py \
    --deepspeed ds_config.json

After training, repeat the .pth → hf conversion and merging steps, then use the same chat command to query the model. The logs show that the model learns the injected medical QA pairs, though occasional hallucinations remain.

Additional Experiments

Loading the model in 4‑bit mode reduces GPU memory to ~5.8 GB, but weights are stored as uint8 (NF4) and are de‑quantized to fp16 for actual computation.

Exploration of model.generate parameters (e.g., do_sample=True, top_k, temperature) and their effect on deterministic vs. stochastic generation.

Investigation of LoRA merging mathematics (the W + BA formulation) and verification that after merging the total parameter count returns to the original size.

Overall, the document serves as a comprehensive tutorial for researchers and engineers working on large language model fine‑tuning, adapter conversion, and deployment using XTuner and PEFT.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LoRA PyTorch DataLoader FineTuning InternLM ModelConversion XTuner

Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.