Understanding Large Language Model Files: Structure, Tokens, and Inference with Qwen3

This article walks through the complete workflow of loading and running the open‑source Qwen3‑8B model, explaining each core file (weights, config, generation config, tokenizer), how the model tokenizes input, applies chat templates, generates responses, and decodes output, all illustrated with code and diagrams.

Fun with Large Models
Fun with Large Models
Fun with Large Models
Understanding Large Language Model Files: Structure, Tokens, and Inference with Qwen3

1. Where to find open‑source large models

Open‑source models are typically hosted on communities such as Hugging Face (the "GitHub of AI") or its Chinese alternative ModelScope . The tutorial uses ModelScope to ensure smooth access for domestic users.

2. Typical file composition of a large model

Model weight files : model-*.safetensors (sharded) plus an index file model.safetensors.index.json that maps each shard.

Model configuration : config.json or configuration.json describing architecture, layer count, hidden size, etc.

Generation configuration : generation_config.json containing default sampling parameters such as temperature and top_p.

Tokenizer files : tokenizer.json, tokenizer_config.json, and vocab.json that define tokenization rules and the token‑to‑ID map.

3. How a large model is built

A trained model consists of two parts: the architecture definition (described in model.json or the config file) and the weight parameters stored in the .safetensors shards. For Qwen3‑8B the model has roughly 8 billion parameters, split across multiple shard files.

4. From input text to model output

The model is a autoregressive generator : it predicts the next token given all previous tokens. The pipeline is:

Tokenization – the input sentence is split into tokens using the tokenizer files. For example, the token ID for the word "Hello" is 9707 in tokenizer.json.

Encoding – tokens are converted to IDs and fed to the model.

Generation – the model repeatedly predicts the next token (e.g., "你好呀", then "我的", …) until a stop condition.

Decoding – token IDs are mapped back to readable text.

4.1 Special symbols and chat template

To distinguish user queries from model replies, Qwen3 uses a chat template with special markers: <|im_start|> – start of a segment. user / assistant – speaker role. <|im_end|> – end of a segment.

Example of a formatted prompt:

<|im_start|>user
你好<|im_end|>
<|im_start|>assistant

5. Practical inference guide (Qwen3‑8B on ModelScope)

Step 1 – Environment : Launch a JupyterLab instance on the Lab4AI platform (pre‑installed lammafactory 0.9.4, torch 2.8, etc.).

Step 2 – Import libraries :

from modelscope import AutoModelForCausalLM, AutoTokenizer

Step 3 – Load model and tokenizer :

model_name = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

Step 4 – Prepare input and apply chat template :

user_input = "你好"
messages = [{'role': 'user', 'content': user_input}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

Step 5 – Tokenize and inspect IDs :

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
print(model_inputs)

The printed input_ids tensor shows the token sequence, including the special markers defined by the template.

Step 6 – Run generation :

generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
print(output_ids)

Step 7 – Decode the result :

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("
")
print(content)

The final printed content is the model's response, matching the theoretical generation process described earlier.

6. Conclusion

The article demystifies the file layout of open‑source large language models, explains how tokenization and chat templates enable dialogue generation, and provides a reproducible code example that loads Qwen3‑8B from ModelScope, runs inference, and decodes the output.

PythonInferencetokenizerModelScopeQwen3model file structure
Fun with Large Models
Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.