Master Essential LLM Engineering Skills: Transform, Model, and Infer with Custom Scripts

This guide presents a hands‑on curriculum of core large‑model engineering tasks—including model conversion scripts, custom modeling wrappers, multi‑model inference utilities, and channel‑aware loss tracking—to help practitioners build practical, reusable tools without deep theoretical overhead.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Master Essential LLM Engineering Skills: Transform, Model, and Infer with Custom Scripts

trans_XX_to_llama.py

In the open‑source community the Llama architecture has become the de‑facto standard, meaning modeling_llama.py can theoretically load any open‑source model. The article asks readers to implement conversion scripts such as trans_qwen_to_llama.py, trans_llama_to_qwen.py, and others, enabling direct loading of diverse models through the Llama loader.

Understanding each model’s quirks (e.g., Qwen2’s biased Q/K/V linear layers, Baichuan’s extra normalize() before lm_head) becomes possible.

Advanced conversions include trans_llama_to_megatron.py (with tensor‑parallel and pipeline‑parallel parameters) and trans_megatron_to_llama.py, noting the simplicity of Megatron checkpoint split/merge but warning about its GQA implementation.

modeling_XX.py

Because modeling_llama.py lacks features such as skip_build, streaming generation, sequence parallelism, default flash attention, and a reward‑model head, the author recommends building a unified modeling_XX.py by aggregating the best functions from existing implementations ( modeling_qwen.py, modeling_baichuan.py, modeling_yi.py, modeling_deepseek.py, modeling_glm.py, etc.).

Example helper methods to add for debugging: def show_cos_distance(self, layer): compute cosine distance between a layer’s input and output hidden states. def show_topk_token(self, layer, K=10): display the top‑K token predictions from a given layer. def show_attention(self, layer, tokenA, tokenB): output the attention value between two tokens at a specific layer.

With this unified wrapper, any new open‑source model can be fine‑tuned via a generic “trans_newModel_to_myModel.py” without modifying training code.

multi_infer.py

While model.model.generate() is familiar, inference speed scales better when multiple models share GPUs (e.g., eight models on eight GPUs) rather than a single model using a large batch. The article proposes a class Infer(model_path, data_path, output_path, num_workers) that can launch inference using torchrun, multiprocessing, or other Python libraries, allowing 1‑machine‑8‑GPU setups to load 8/4/2/1 models concurrently.

Tip: add a def set_device(self, device_list) method to modeling_XX.py to avoid repeatedly setting os.environ["CUDA_VISIBLE_DEVICES"].

After achieving single‑machine parallel inference, try multi‑machine deployment.

Explore faster inference frameworks such as vllm instead of the default model.generate().

Channel Loss

When performing domain post‑pretraining, the loss curve usually shows a slow decline or plateau, while SFT loss drops quickly in a stair‑step fashion. Observing only the overall loss provides little insight, so the article suggests splitting the dataset into channels and plotting each channel’s loss curve to uncover hidden patterns.

Task: modify the training code to assign a random channel to each SFT data point and record per‑channel loss during training, possibly using all_gather_object for aggregation.

Megatron’s tensor‑parallel (TP) and pipeline‑parallel (PP) make integration more complex than DeepSpeed.

The trainer’s rigid model.trainer() wrapper complicates adding custom channel loss.

These practical “basic skills” may not help with interview questions, but they significantly improve development efficiency; all the listed programs can be generated by ChatGPT, except for some advanced parts that may still require manual debugging.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Inference OptimizationAI Engineeringmodel conversionPython scripting
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.