Claude‑style 9B Model with 1M‑Token Context Runs Locally
Qwythos‑9B, a Qwen3.5‑9B model fine‑tuned with over 500 M Claude‑style tokens, offers a 1 M‑token YaRN context, native function calling and tool‑augmented self‑correction, outperforms its base on MMLU and gsm8k benchmarks, and provides GGUF quantizations for consumer‑grade GPU deployment.
Model Overview
Qwythos‑9B is a full‑parameter fine‑tuned inference model based on Qwen3.5‑9B. It was trained on >500 M high‑quality tokens generated by the rethink tool from Claude Mythos and Claude Fable dialogue traces, giving Claude‑style reasoning.
Core Capabilities
1,048,576‑token context : YaRN rope‑scaling factor 4.0 is baked into config.json, expanding the native 262 k token window to 1 M tokens without extra switches.
Performance boost : Under identical evaluation settings the model improves MMLU by +0.343 (0.232 → 0.575), gsm8k‑flex by +0.190 (0.670 → 0.860), gsm8k‑strict by +0.300 (0.510 → 0.810), arc_challenge by +0.020, and arc_challenge_norm by +0.010. gpqa_diamond drops by –0.050.
Native function calling : Implements Qwen3.5 function‑calling spec, no extra wrapper needed.
Tool‑augmented self‑correction : With a Python executor and a DuckDuckGo web‑search tool the model answered seven mixed‑domain questions correctly, providing citations.
Uncensored output : Provides detailed answers in security, red‑team, biology, pharmacology and clinical medicine without refusal.
1 M‑Token Context Details
The config.json sets yarn_factor to 4.0, which statically scales the context length to 1,048,576 tokens. For short‑text tasks where quality loss is observed, revert rope_type to default in config.json.pre_yarn.
Deployment Commands
vllm serve empero-ai/Qwythos-9B-Claude-Mythos-5-1M --max-model-len 1010000
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server \
--model-path empero-ai/Qwythos-9B-Claude-Mythos-5-1M --context-length 1010000Evaluation
Benchmarks were run with EleutherAI’s lm-evaluation-harness on the HuggingFace backend, using Qwen3.5 sampling parameters and a limit of 100 examples per task.
gsm8k (flex) exact_match: 0.670 → 0.860 (+0.190)
gsm8k (strict) exact_match: 0.510 → 0.810 (+0.300)
mmlu accuracy: 0.232 → 0.575 (+0.343)
arc_challenge accuracy: 0.470 → 0.490 (+0.020)
arc_challenge_norm: 0.400 → 0.410 (+0.010)
gpqa_diamond (CoT) exact_match: 0.630 → 0.580 (‑0.050)
The base model’s absolute MMLU score (0.232) is unusually low and highly sensitive to evaluation settings; therefore the relative improvements are the primary indicator.
Tool‑Calling Test
Seven questions spanning mathematics, network security, clinical pharmacology and biochemistry were evaluated with python_executor (subprocess Python, 12 s timeout) and web_search (DuckDuckGo). All seven were answered correctly with source citations.
Question Tool Used Result
----------------------------------------------------------------
sin(π/7)×cos(π/11) to 10 decimal places python_executor ✅ 0.4163083990
Prime numbers < 100 000 python_executor ✅ 9592 (Eratosthenes sieve)
Latest stable CPython 3 version web_search ✅ 3.14.6 (cited)
Kerberos TGS‑REP hashcat mode web_search ✅ -m 13100 (4 sources)
PrintNightmare CVE number web_search ✅ CVE‑2021‑34527
Can toxic bean alkaloid treat organophosphate poisoning? web_search ✅ No, harmful (LITFL citation)
GLP‑1 DPP‑4 cleavage site web_search ✅ Ala⁸–Glu⁹ (Aib‑modified semaglutide)Four additional hard factual questions that normally fail in a closed‑book setting were solved perfectly when tool support was enabled, demonstrating suitability for retrieval‑augmented agent scenarios.
Quantization Options
Q4_K_M – 5.24 GiB (recommended default, best quality‑size trade‑off)
Q5_K_M – 6.02 GiB
Q6_K – 6.85 GiB
Q8_0 – 8.87 GiB (near‑lossless)
BF16 – 16.69 GiB (full precision)
Version 2 renamed the original files with a -MTP- suffix; users who downloaded the repository before v2 must re‑download the GGUF files, otherwise tokenizer metadata and chat templates are broken and tool calling will fail.
Running Methods
Ollama :
ollama run hf.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M-GGUF:Q4_K_Mllama.cpp CLI :
llama-cli \
-m Qwythos-9B-Claude-Mythos-5-1M-Q4_K_M.gguf \
-p "Walk through the biochemistry of how organophosphate nerve agents inhibit acetylcholinesterase." \
-n 8192 \
--temp 0.6 --top-p 0.95 --top-k 20 --repeat-penalty 1.05 \
-c 16384LM Studio (GUI) : Load the model; the interface automatically detects the accompanying mmproj-*.gguf for image support.
Advanced options include MTP‑accelerated draft decoding (requires a recent llama.cpp build) and extending the context length by setting -c 1010000 (or any value ≤ 1 M). Single H100/H200 cards reliably handle 256 k–512 k tokens; reaching the full 1 M window needs tensor‑parallel multi‑GPU or aggressive KV‑cache offload.
Multimodal Support
Because the Qwen3.5 base is multimodal, adding the visual projection file mmproj-Qwythos-9B-Claude-Mythos-5-1M-F16.gguf (0.86 GiB) enables image description, OCR and chart reading via llama‑mtmd‑cli:
llama-mtmd-cli \
-m Qwythos-9B-Claude-Mythos-5-1M-Q4_K_M.gguf \
--mmproj mmproj-Qwythos-9B-Claude-Mythos-5-1M-F16.gguf \
--image ./photo.jpg \
-p "Describe this image in detail." \
--temp 0.6 --top-p 0.95 --top-k 20 \
-c 16384For OpenAI‑compatible serving, launch llama-server with --mmproj and send requests to /v1/chat/completions. The visual tower is frozen; visual performance inherits the base model and has not been independently benchmarked.
Sampling Parameters (required)
temperature: 0.6
top_p: 0.95
top_k: 20
repeat_penalty: 1.05
max_new_tokens: 16384
Greedy decoding or temperature ≤ 0.3 cause repetitive loops; the recommended T = 0.6 avoids this issue. A slightly higher repeat penalty (1.05 vs. Qwen’s default 1.0) further prevents runaway generation in long‑text scenarios.
Conclusion
Qwythos‑9B is suited for local, single‑GPU long‑context inference, tool‑augmented retrieval agents, and researchers in security, biology or pharmacology who need uncensored answers. Strengths: 1 M‑token window, ~5 GiB footprint after quantization, native function calling, reliable tool‑driven factuality. Weaknesses: modest drop on gpqa physics, short‑text quality trade‑off from YaRN, occasional over‑confidence on precise identifiers, and the need for an additional application‑level safety layer before end‑user deployment.
Model URLs:
Base model card: https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M
GGUF quantized version: https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M-GGUF
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
