LLM benchmark — 7 Technical Articles

Mar 28, 2026 · Artificial Intelligence

Qwen3.5-27B Outperforms the 397B Model in Tool Calling – Q6 Quantization Is Optimal

Using the open‑source ToolCall‑15 benchmark, the author shows that the 27‑billion‑parameter Qwen3.5 model consistently scores full marks while the 397‑billion‑parameter version fails on several tasks, and that the Q6 quantized variant offers the best trade‑off between size and tool‑calling accuracy.

AILLM benchmarkQwen3.5

0 likes · 7 min read

Qwen3.5-27B Outperforms the 397B Model in Tool Calling – Q6 Quantization Is Optimal

SuanNi

Mar 23, 2026 · Artificial Intelligence

Can AI Agents Master Long-Term Memory? Supermemory’s Near‑99% Accuracy Breakthrough

The Supermemory team’s new ASMR (Agentic Search and Memory Retrieval) system achieves almost 99% accuracy on the LongMemEval benchmark by replacing vector‑database retrieval with parallel, specialized AI agents that ingest, search, and synthesize massive conversational histories entirely in memory, offering a potential solution to longstanding AI memory challenges.

AI memoryASMRLLM benchmark

0 likes · 8 min read

Can AI Agents Master Long-Term Memory? Supermemory’s Near‑99% Accuracy Breakthrough

PaperAgent

Mar 9, 2026 · Artificial Intelligence

Which LLM Wins the Agent Benchmark? PinchBench Success, Speed, and Cost Rankings Revealed

PinchBench evaluates 32 mainstream large language models on success rate, execution speed, and cost for real‑world agent tasks, highlighting top performers like Gemini‑3‑flash‑preview, MiniMax‑M2.1, and Kimi‑K2.5, and explains why traditional AI benchmarks no longer predict agent effectiveness.

Execution SpeedLLM benchmarkOpenClaw

0 likes · 4 min read

Which LLM Wins the Agent Benchmark? PinchBench Success, Speed, and Cost Rankings Revealed

Old Zhang's AI Learning

Feb 3, 2026 · Artificial Intelligence

Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)

Step‑3.5‑Flash, a 196‑billion‑parameter open‑source LLM that activates only 11 B per token via a Mixture‑of‑Experts design, delivers 3‑plus‑times faster inference, matches top‑tier closed‑source models on SWE‑bench and other benchmarks, supports 256 K context, runs on consumer‑grade hardware, and is already integrated into vLLM, SGLang, and Claude Code, though it has known token‑efficiency and domain‑stability limitations.

LLM benchmarkMoEMulti‑Token Prediction

0 likes · 11 min read

Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)

Baobao Algorithm Notes

Dec 24, 2025 · Artificial Intelligence

GLM-4.7 Review: How the New Model Beats Competitors in Coding and Reasoning

The GLM-4.7 model launches with record‑breaking benchmark scores across coding, reasoning, and real‑world programming tasks, outperforming both open‑source and commercial LLMs while introducing advanced interleaved, retained, and round‑level thinking modes that enhance complex task execution.

AI model comparisonCoding AIGLM-4.7

0 likes · 9 min read

GLM-4.7 Review: How the New Model Beats Competitors in Coding and Reasoning

Volcano Engine Developer Services

Apr 14, 2025 · Artificial Intelligence

Introducing Multi‑SWE‑bench: The First Multilingual Code‑Fix Benchmark for LLMs

ByteDance’s Doubao model team has open‑sourced Multi‑SWE‑bench, a multilingual benchmark covering seven major programming languages with 1,632 real‑world bug‑fix tasks, complete Docker environments, difficulty grading, and strict human validation, aiming to evaluate and advance large‑language‑model code‑repair capabilities beyond Python.

LLM benchmarkcode repairdataset

0 likes · 11 min read

Introducing Multi‑SWE‑bench: The First Multilingual Code‑Fix Benchmark for LLMs

Java Tech Enthusiast

Mar 18, 2025 · Artificial Intelligence

Can Apple’s M3 Ultra Mac Studio Run Full‑Scale DeepSeek R1 at 11 Tokens/s?

Early adopters benchmarked the M3 Ultra‑powered Mac Studio running the 671‑billion‑parameter DeepSeek R1 model, achieving around 11 tokens per second in practice (up to 20 tokens/s theoretically), and compared its performance and cost against GPU‑based solutions and the newer M4 Max hardware.

AI inferenceDeepSeekLLM benchmark

0 likes · 5 min read

Can Apple’s M3 Ultra Mac Studio Run Full‑Scale DeepSeek R1 at 11 Tokens/s?