Tagged articles

LLM benchmarks

5 articles · Page 1 of 1

Jun 16, 2026 · Artificial Intelligence

A Systematic Approach to AI Evaluation: From Benchmarks to Real‑World Scenarios

This article outlines a comprehensive methodology for evaluating large language models, covering classic benchmarks, human and multimodal assessments, common pitfalls such as data contamination and benchmark overfitting, and practical guidelines for building a scientific, multi‑layered AI evaluation framework.

AI evaluationLLM benchmarksLLM-as-judge

0 likes · 27 min read

A Systematic Approach to AI Evaluation: From Benchmarks to Real‑World Scenarios

SuanNi

Jun 12, 2026 · Artificial Intelligence

Kimi K2.7 Code Goes Open: 30% Token Savings and Major Coding Performance Boost

Kimi K2.7 Code, now open‑source on HuggingFace, reduces token consumption by ~30% and boosts coding benchmark scores—Kimi Code Bench v2 climbs from 50.9 to 62.0, Program‑Bench from 48.3 to 53.6, MLS Bench Lite from 26.7 to 35.1—narrowing the gap with GPT‑5.5 and Claude Opus, all built on a 1‑trillion‑parameter MoE architecture with INT4 quantization and a 256K‑token context.

HuggingFaceKimi K2.7LLM benchmarks

0 likes · 6 min read

Kimi K2.7 Code Goes Open: 30% Token Savings and Major Coding Performance Boost

Old Zhang's AI Learning

Jun 10, 2026 · Artificial Intelligence

Anthropic’s Claude Fable 5 and Mythos 5: Twin Models with a Shockingly Low Price and New Safety Switches

Anthropic released Claude Fable 5 and Mythos 5 as twin large‑language‑model variants that share the same base but differ only in safety‑classifier settings, offering 1 M‑token context, 128 k‑token output, a halved price, and a three‑layer real‑time safety system that routes risky requests to Claude Opus 4.8.

AI safetyAnthropicClaude Fable 5

0 likes · 12 min read

Anthropic’s Claude Fable 5 and Mythos 5: Twin Models with a Shockingly Low Price and New Safety Switches

Machine Heart

Jun 9, 2026 · Artificial Intelligence

Claude Fable 5 Unveiled: Record-Breaking Performance and New Pricing

Anthropic has launched Claude Fable 5, its most powerful LLM to date, claiming top‑tier results across software engineering, knowledge work, vision and scientific benchmarks, while offering higher token efficiency, new safety layers, and a pricing model of $10 per M input and $50 per M output tokens.

AI safetyAnthropicClaude Fable 5

0 likes · 7 min read

Claude Fable 5 Unveiled: Record-Breaking Performance and New Pricing

Old Zhang's AI Learning

Apr 29, 2026 · Artificial Intelligence

Top 10 Open‑Source LLM Benchmarks: Scores, Rankings, and What They Test

This article walks through ten mainstream open‑source large‑model benchmarks—SWE‑bench Verified and Pro, MMLU‑Pro, GPQA Diamond, HLE, AIME, HMMT, olmOCR‑bench, Terminal‑Bench 2.0, and EvasionBench—explaining their data, evaluation metrics, current leading models, and the capability dimensions they reveal.

AI evaluationLLM benchmarksMMLU-Pro

0 likes · 20 min read

Top 10 Open‑Source LLM Benchmarks: Scores, Rankings, and What They Test