Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 28, 2026 · Artificial Intelligence

Qwen3.5-27B Outperforms the 397B Model in Tool Calling – Q6 Quantization Is Optimal

Using the open‑source ToolCall‑15 benchmark, the author shows that the 27‑billion‑parameter Qwen3.5 model consistently scores full marks while the 397‑billion‑parameter version fails on several tasks, and that the Q6 quantized variant offers the best trade‑off between size and tool‑calling accuracy.

AILLM benchmarkQwen3.5
0 likes · 7 min read
Qwen3.5-27B Outperforms the 397B Model in Tool Calling – Q6 Quantization Is Optimal
SuanNi
SuanNi
Mar 23, 2026 · Artificial Intelligence

Can AI Agents Master Long-Term Memory? Supermemory’s Near‑99% Accuracy Breakthrough

The Supermemory team’s new ASMR (Agentic Search and Memory Retrieval) system achieves almost 99% accuracy on the LongMemEval benchmark by replacing vector‑database retrieval with parallel, specialized AI agents that ingest, search, and synthesize massive conversational histories entirely in memory, offering a potential solution to longstanding AI memory challenges.

AI memoryASMRLLM benchmark
0 likes · 8 min read
Can AI Agents Master Long-Term Memory? Supermemory’s Near‑99% Accuracy Breakthrough
PaperAgent
PaperAgent
Mar 9, 2026 · Artificial Intelligence

Which LLM Wins the Agent Benchmark? PinchBench Success, Speed, and Cost Rankings Revealed

PinchBench evaluates 32 mainstream large language models on success rate, execution speed, and cost for real‑world agent tasks, highlighting top performers like Gemini‑3‑flash‑preview, MiniMax‑M2.1, and Kimi‑K2.5, and explains why traditional AI benchmarks no longer predict agent effectiveness.

Execution SpeedLLM benchmarkOpenClaw
0 likes · 4 min read
Which LLM Wins the Agent Benchmark? PinchBench Success, Speed, and Cost Rankings Revealed
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 3, 2026 · Artificial Intelligence

Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)

Step‑3.5‑Flash, a 196‑billion‑parameter open‑source LLM that activates only 11 B per token via a Mixture‑of‑Experts design, delivers 3‑plus‑times faster inference, matches top‑tier closed‑source models on SWE‑bench and other benchmarks, supports 256 K context, runs on consumer‑grade hardware, and is already integrated into vLLM, SGLang, and Claude Code, though it has known token‑efficiency and domain‑stability limitations.

LLM benchmarkMoEMulti‑Token Prediction
0 likes · 11 min read
Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 24, 2025 · Artificial Intelligence

GLM-4.7 Review: How the New Model Beats Competitors in Coding and Reasoning

The GLM-4.7 model launches with record‑breaking benchmark scores across coding, reasoning, and real‑world programming tasks, outperforming both open‑source and commercial LLMs while introducing advanced interleaved, retained, and round‑level thinking modes that enhance complex task execution.

AI model comparisonCoding AIGLM-4.7
0 likes · 9 min read
GLM-4.7 Review: How the New Model Beats Competitors in Coding and Reasoning
Volcano Engine Developer Services
Volcano Engine Developer Services
Apr 14, 2025 · Artificial Intelligence

Introducing Multi‑SWE‑bench: The First Multilingual Code‑Fix Benchmark for LLMs

ByteDance’s Doubao model team has open‑sourced Multi‑SWE‑bench, a multilingual benchmark covering seven major programming languages with 1,632 real‑world bug‑fix tasks, complete Docker environments, difficulty grading, and strict human validation, aiming to evaluate and advance large‑language‑model code‑repair capabilities beyond Python.

LLM benchmarkcode repairdataset
0 likes · 11 min read
Introducing Multi‑SWE‑bench: The First Multilingual Code‑Fix Benchmark for LLMs