Claude‑style 9B Model with 1M‑Token Context Runs Locally

Qwythos‑9B, a Qwen3.5‑9B model fine‑tuned with over 500 M Claude‑style tokens, offers a 1 M‑token YaRN context, native function calling and tool‑augmented self‑correction, outperforms its base on MMLU and gsm8k benchmarks, and provides GGUF quantizations for consumer‑grade GPU deployment.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Claude‑style 9B Model with 1M‑Token Context Runs Locally

Model Overview

Qwythos‑9B is a full‑parameter fine‑tuned inference model based on Qwen3.5‑9B. It was trained on >500 M high‑quality tokens generated by the rethink tool from Claude Mythos and Claude Fable dialogue traces, giving Claude‑style reasoning.

Core Capabilities

1,048,576‑token context : YaRN rope‑scaling factor 4.0 is baked into config.json, expanding the native 262 k token window to 1 M tokens without extra switches.

Performance boost : Under identical evaluation settings the model improves MMLU by +0.343 (0.232 → 0.575), gsm8k‑flex by +0.190 (0.670 → 0.860), gsm8k‑strict by +0.300 (0.510 → 0.810), arc_challenge by +0.020, and arc_challenge_norm by +0.010. gpqa_diamond drops by –0.050.

Native function calling : Implements Qwen3.5 function‑calling spec, no extra wrapper needed.

Tool‑augmented self‑correction : With a Python executor and a DuckDuckGo web‑search tool the model answered seven mixed‑domain questions correctly, providing citations.

Uncensored output : Provides detailed answers in security, red‑team, biology, pharmacology and clinical medicine without refusal.

1 M‑Token Context Details

The config.json sets yarn_factor to 4.0, which statically scales the context length to 1,048,576 tokens. For short‑text tasks where quality loss is observed, revert rope_type to default in config.json.pre_yarn.

Deployment Commands

vllm serve empero-ai/Qwythos-9B-Claude-Mythos-5-1M --max-model-len 1010000

SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server \
  --model-path empero-ai/Qwythos-9B-Claude-Mythos-5-1M --context-length 1010000

Evaluation

Benchmarks were run with EleutherAI’s lm-evaluation-harness on the HuggingFace backend, using Qwen3.5 sampling parameters and a limit of 100 examples per task.

gsm8k (flex) exact_match: 0.670 → 0.860 (+0.190)

gsm8k (strict) exact_match: 0.510 → 0.810 (+0.300)

mmlu accuracy: 0.232 → 0.575 (+0.343)

arc_challenge accuracy: 0.470 → 0.490 (+0.020)

arc_challenge_norm: 0.400 → 0.410 (+0.010)

gpqa_diamond (CoT) exact_match: 0.630 → 0.580 (‑0.050)

The base model’s absolute MMLU score (0.232) is unusually low and highly sensitive to evaluation settings; therefore the relative improvements are the primary indicator.

Tool‑Calling Test

Seven questions spanning mathematics, network security, clinical pharmacology and biochemistry were evaluated with python_executor (subprocess Python, 12 s timeout) and web_search (DuckDuckGo). All seven were answered correctly with source citations.

Question                                      Tool Used          Result
----------------------------------------------------------------
sin(π/7)×cos(π/11) to 10 decimal places      python_executor    ✅ 0.4163083990
Prime numbers < 100 000                     python_executor    ✅ 9592 (Eratosthenes sieve)
Latest stable CPython 3 version               web_search         ✅ 3.14.6 (cited)
Kerberos TGS‑REP hashcat mode                 web_search         ✅ -m 13100 (4 sources)
PrintNightmare CVE number                     web_search         ✅ CVE‑2021‑34527
Can toxic bean alkaloid treat organophosphate poisoning?  web_search  ✅ No, harmful (LITFL citation)
GLP‑1 DPP‑4 cleavage site                    web_search         ✅ Ala⁸–Glu⁹ (Aib‑modified semaglutide)

Four additional hard factual questions that normally fail in a closed‑book setting were solved perfectly when tool support was enabled, demonstrating suitability for retrieval‑augmented agent scenarios.

Quantization Options

Q4_K_M – 5.24 GiB (recommended default, best quality‑size trade‑off)

Q5_K_M – 6.02 GiB

Q6_K – 6.85 GiB

Q8_0 – 8.87 GiB (near‑lossless)

BF16 – 16.69 GiB (full precision)

Version 2 renamed the original files with a -MTP- suffix; users who downloaded the repository before v2 must re‑download the GGUF files, otherwise tokenizer metadata and chat templates are broken and tool calling will fail.

Running Methods

Ollama :

ollama run hf.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M-GGUF:Q4_K_M

llama.cpp CLI :

llama-cli \
  -m Qwythos-9B-Claude-Mythos-5-1M-Q4_K_M.gguf \
  -p "Walk through the biochemistry of how organophosphate nerve agents inhibit acetylcholinesterase." \
  -n 8192 \
  --temp 0.6 --top-p 0.95 --top-k 20 --repeat-penalty 1.05 \
  -c 16384

LM Studio (GUI) : Load the model; the interface automatically detects the accompanying mmproj-*.gguf for image support.

Advanced options include MTP‑accelerated draft decoding (requires a recent llama.cpp build) and extending the context length by setting -c 1010000 (or any value ≤ 1 M). Single H100/H200 cards reliably handle 256 k–512 k tokens; reaching the full 1 M window needs tensor‑parallel multi‑GPU or aggressive KV‑cache offload.

Multimodal Support

Because the Qwen3.5 base is multimodal, adding the visual projection file mmproj-Qwythos-9B-Claude-Mythos-5-1M-F16.gguf (0.86 GiB) enables image description, OCR and chart reading via llama‑mtmd‑cli:

llama-mtmd-cli \
  -m Qwythos-9B-Claude-Mythos-5-1M-Q4_K_M.gguf \
  --mmproj mmproj-Qwythos-9B-Claude-Mythos-5-1M-F16.gguf \
  --image ./photo.jpg \
  -p "Describe this image in detail." \
  --temp 0.6 --top-p 0.95 --top-k 20 \
  -c 16384

For OpenAI‑compatible serving, launch llama-server with --mmproj and send requests to /v1/chat/completions. The visual tower is frozen; visual performance inherits the base model and has not been independently benchmarked.

Sampling Parameters (required)

temperature: 0.6

top_p: 0.95

top_k: 20

repeat_penalty: 1.05

max_new_tokens: 16384

Greedy decoding or temperature ≤ 0.3 cause repetitive loops; the recommended T = 0.6 avoids this issue. A slightly higher repeat penalty (1.05 vs. Qwen’s default 1.0) further prevents runaway generation in long‑text scenarios.

Conclusion

Qwythos‑9B is suited for local, single‑GPU long‑context inference, tool‑augmented retrieval agents, and researchers in security, biology or pharmacology who need uncensored answers. Strengths: 1 M‑token window, ~5 GiB footprint after quantization, native function calling, reliable tool‑driven factuality. Weaknesses: modest drop on gpqa physics, short‑text quality trade‑off from YaRN, occasional over‑confidence on precise identifiers, and the need for an additional application‑level safety layer before end‑user deployment.

Model URLs:

Base model card: https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M

GGUF quantized version: https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M-GGUF

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Function CallingClaudeLLM evaluationGGUFQwen3.51M token
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.