Tagged articles

Benchmarking

129 articles · Page 1 of 2

Jul 3, 2026 · Artificial Intelligence

From Prediction to Planning: WLA Unifies World Modeling, Language Reasoning, and Action Generation

The paper introduces the World‑Language‑Action (WLA) model, which replaces pixel‑level world‑action predictions with combined textual intent and fine‑grained physical dynamics, achieving 2 B‑parameter real‑time inference at 40 ms, doubling success rates on the RMBench benchmark and outperforming prior WAM and VLA baselines in simulation and real‑robot tests.

Action SynthesisBenchmarkingCross-embodiment Transfer

0 likes · 9 min read

From Prediction to Planning: WLA Unifies World Modeling, Language Reasoning, and Action Generation

AI Engineer Programming

Jun 30, 2026 · Artificial Intelligence

How to Quickly Validate LLM Capabilities Without Standard Benchmarks

Standard benchmarks often suffer from data leakage, mismatched real‑world scenarios, and limited metrics, so this guide proposes a practical, self‑crafted evaluation framework with diverse question types, clear scoring dimensions, and a step‑by‑step SOP to reliably assess LLM code‑generation abilities.

AI model assessmentBenchmarkingLLM evaluation

0 likes · 18 min read

How to Quickly Validate LLM Capabilities Without Standard Benchmarks

DataFunTalk

Jun 23, 2026 · Artificial Intelligence

Can a Fable‑Level AI Model That Evades Export Controls Beat Claude Mythos?

Amid the sudden shutdown of Anthropic's Claude Fable 5, Sakana AI unveils Fugu—an orchestration‑based, Fable‑level model that sidesteps export restrictions, matches or exceeds Fable 5 and Mythos on engineering, scientific, and reasoning benchmarks, and demonstrates a new trend toward model scheduling over sheer scale.

AI orchestrationBenchmarkingSakana AI

0 likes · 8 min read

Can a Fable‑Level AI Model That Evades Export Controls Beat Claude Mythos?

Ops Development & AI Practice

Jun 23, 2026 · Artificial Intelligence

Sovereign‑Free Routing: How Sakana AI’s Fugu Beats Claude Fable 5 Amid Geopolitical Constraints

Sakana AI’s newly released Fugu system uses a tiny 7B “commander” model to dynamically orchestrate a pool of global and local AI models, achieving a 73.7 % SWE‑bench Pro score that outperforms GPT‑5.5 and the heavily sanctioned Claude Fable 5, while illustrating a sovereign‑free routing strategy born from geopolitical and compute limitations.

AI GeopoliticsBenchmarkingEvolutionary Algorithms

0 likes · 8 min read

Sovereign‑Free Routing: How Sakana AI’s Fugu Beats Claude Fable 5 Amid Geopolitical Constraints

Machine Heart

Jun 22, 2026 · Artificial Intelligence

How ZhiZi XinYuan’s AI‑Driven Compute Platform Is Disrupting the Industry After Two Funding Rounds

The article explains how the emerging "AI for Computing" paradigm—using large models, operations‑research optimization, and automated algorithm discovery—enables ZhiZi XinYuan to automate hardware‑level performance tuning, achieve SOTA benchmark results with its KernelCAT platform, and attract nearly a hundred‑million‑yuan funding in just two months.

AI for ComputingBenchmarkingcompute acceleration

0 likes · 14 min read

How ZhiZi XinYuan’s AI‑Driven Compute Platform Is Disrupting the Industry After Two Funding Rounds

SuanNi

Jun 17, 2026 · Artificial Intelligence

Can a 3B Small Model Match Top Closed‑Source LLMs? VibeThinker-3B’s Limits

VibeThinker-3B, a newly open‑sourced 3‑billion‑parameter model, achieves near‑state‑of‑the‑art scores on math competitions (AIME, IMO‑AnswerBench), coding (LiveCodeBench), and verification benchmarks, rivaling trillion‑parameter closed models, thanks to a Spectrum‑to‑Signal training pipeline, multi‑stage SFT, RL, and offline distillation, supporting a new parametric compression‑coverage hypothesis.

AI researchBenchmarkingParameter Efficiency

0 likes · 8 min read

Can a 3B Small Model Match Top Closed‑Source LLMs? VibeThinker-3B’s Limits

IT Services Circle

Jun 16, 2026 · Artificial Intelligence

Microsoft’s Open‑Source SkillOpt Supercharges AI Agent Skills, Surpasses 5K GitHub Stars

SkillOpt, an open‑source framework from Microsoft Research, treats skill markdown files as trainable parameters and applies neural‑network optimization techniques across six ReflACT stages, achieving up to 39‑point accuracy gains on 52 benchmark evaluations and demonstrating cross‑model transferability, all while requiring zero inference cost.

AI AgentsBenchmarkingGitHub

0 likes · 10 min read

Microsoft’s Open‑Source SkillOpt Supercharges AI Agent Skills, Surpasses 5K GitHub Stars

Machine Heart

Jun 12, 2026 · Artificial Intelligence

Recursive AI Takes Its First Step: Automated Research System Sets New SOTA Benchmarks

Recursive Superintelligence unveiled an open‑source system that automates the AI research loop, achieving state‑of‑the‑art results on three distinct benchmarks—NanoChat autoresearch, NanoGPT speedrun, and SOL‑ExecBench—while illustrating the practical progress toward recursive self‑improvement warned about by Anthropic.

AI AutomationAnthropicBenchmarking

0 likes · 12 min read

Recursive AI Takes Its First Step: Automated Research System Sets New SOTA Benchmarks

Machine Heart

Jun 9, 2026 · Artificial Intelligence

Can a $10 Million Inference Budget Uncover AI’s Real Upper Limit?

The article argues that as large language models grow more capable, single‑score benchmarks no longer capture true performance; instead, evaluating models across varying inference budgets—measured in tokens, cost, or time—reveals their real capabilities and safety risks, prompting a shift toward performance‑cost curves and new industry standards.

AI evaluationAI safetyBenchmarking

0 likes · 13 min read

Can a $10 Million Inference Budget Uncover AI’s Real Upper Limit?

SuanNi

Jun 1, 2026 · Artificial Intelligence

Rewriting Claude Code in 90k Lines of Python: How CheetahClaws Tests Harness Scaling

The article analyzes why AI agents need system‑level scaling, explains the UC Berkeley "Harness" framework, and details how the open‑source CheetahClaws project rewrites Claude Code in Python to evaluate system scaling across memory, context, routing, orchestration and governance components.

AI AgentsBenchmarkingCheetahClaws

0 likes · 13 min read

Rewriting Claude Code in 90k Lines of Python: How CheetahClaws Tests Harness Scaling

SuanNi

Jun 1, 2026 · Artificial Intelligence

MiniMax M3 Beats GPT‑5.5 in Programming and Goes Open‑Source

MiniMax M3, a domestically developed LLM, combines a new sparse‑attention MSA architecture, native multimodal support, and million‑token context to match or surpass top closed‑source models in programming and agent benchmarks, while achieving a 9.4× speedup on FP8 GEMM and preparing for open‑source release.

AIBenchmarkingFP8 GEMM

0 likes · 12 min read

MiniMax M3 Beats GPT‑5.5 in Programming and Goes Open‑Source

SuanNi

May 28, 2026 · Artificial Intelligence

How a 3.8B Model Beats 6B+ Models Using Just 20% of the Compute – Inside Microsoft Lens

Microsoft’s Lens team shows that a 3.8 B‑parameter image‑generation model can match or surpass 6 B‑plus models while consuming only about 19 % of the GPU compute, thanks to aggressive model compression, dense captioning, mixed‑resolution training, optimized VAE and language encoders, and targeted RL fine‑tuning.

BenchmarkingModel Efficiencydense captioning

0 likes · 14 min read

How a 3.8B Model Beats 6B+ Models Using Just 20% of the Compute – Inside Microsoft Lens

Machine Heart

May 25, 2026 · Artificial Intelligence

Claude’s Pass Rate Under 4%: SaaS‑Bench Shatters the “Fully Automated Office” Dream

SaaS‑Bench evaluates AI agents on 23 real SaaS applications and 106 cross‑app, long‑horizon tasks, revealing that even the strongest model, Claude Opus 4.7, passes fewer than four percent of tasks and exposing four structural failure modes that separate benchmark scores from true office productivity.

AI AgentsBenchmarkingClaude Opus

0 likes · 10 min read

Claude’s Pass Rate Under 4%: SaaS‑Bench Shatters the “Fully Automated Office” Dream

Machine Heart

May 24, 2026 · Artificial Intelligence

From High‑Scoring Agent to Reliable Employee: What Gaps Remain in Production?

The article examines how AI agent benchmarks, once focused on single‑answer quality, now emphasize task completion, tool use, and state maintenance, yet still miss critical production concerns such as pre‑deployment evaluation, runtime observability, safety, cost efficiency, and organizational metrics, as highlighted by reports from Galileo, Datadog, and Harness.io.

AI AgentsBenchmarkingEnterprise AI

0 likes · 8 min read

From High‑Scoring Agent to Reliable Employee: What Gaps Remain in Production?

Xiaomi Tech

May 14, 2026 · Artificial Intelligence

500 M Videos Yield the Largest Open‑Source GUI Dataset; 3B Model Cuts Inference Tokens 71% and Beats Larger Models (Xiaomi AI at ICML 2026)

Xiaomi’s AI team extracted 5 billion video frames to create the world’s largest open‑source GUI dataset, demonstrated that a 3 B‑parameter model can reduce inference tokens by 71% while surpassing larger models, and presented a suite of ICML 2026 papers covering data scaling, benchmarking, reasoning, multimodal perception, and training stability for GUI agents and other AI tasks.

BenchmarkingGUI AgentLarge Language Model

0 likes · 21 min read

500 M Videos Yield the Largest Open‑Source GUI Dataset; 3B Model Cuts Inference Tokens 71% and Beats Larger Models (Xiaomi AI at ICML 2026)

Linux Tech Enthusiast

May 14, 2026 · Operations

9 Visual Guides to Linux Performance Tuning Tools

The article presents nine diagrams that illustrate Linux performance tooling categories—including observability, static analysis, benchmarking, tuning, sar, perf-tools, tracing, and BPF tools—providing a quick visual reference for system engineers.

BPFBenchmarkingLinux

0 likes · 2 min read

9 Visual Guides to Linux Performance Tuning Tools

AI Engineer Programming

May 4, 2026 · Artificial Intelligence

RAG in the Long-Context Era: Challenges, Benchmarks, and Context Engineering

The article analyzes how expanding LLM context windows to millions of tokens reshape Retrieval‑Augmented Generation, detailing chunking trade‑offs, embedding retrieval limits, attention U‑shaped distribution, benchmark results, and the emerging practice of Context Engineering for optimal end‑to‑end pipelines.

BenchmarkingEmbedding RetrievalLLM

0 likes · 10 min read

RAG in the Long-Context Era: Challenges, Benchmarks, and Context Engineering

Machine Learning Algorithms & Natural Language Processing

May 1, 2026 · Artificial Intelligence

Agentic Harness Engineering Enables Agents to Self‑Evolve and Outperform Codex in 10 Rounds

The Agentic Harness Engineering (AHE) framework lets coding agents automatically read massive execution traces, identify failure patterns, and iteratively modify harness components—prompt, tools, middleware, and memory—achieving a pass@1 increase from 69.7% to 77.0% and surpassing human‑tuned Codex‑CLI after ten automated evolution rounds.

Agentic Harness EngineeringBenchmarkingObservability

0 likes · 9 min read

Agentic Harness Engineering Enables Agents to Self‑Evolve and Outperform Codex in 10 Rounds

Machine Learning Algorithms & Natural Language Processing

Apr 25, 2026 · Artificial Intelligence

GPT-5.5 Arrives: Faster, Stronger, Costlier—Nvidia Engineer Says Losing Access Feels Like Amputation

GPT-5.5, co‑designed with Nvidia hardware, breaks the traditional scaling‑law trade‑off by delivering higher intelligence while keeping token latency similar, achieves over 20% faster token generation, outperforms competitors across coding, knowledge‑work, and math benchmarks, and even proves new Ramsey‑number results verified by Lean.

BenchmarkingCodexGPT-5.5

0 likes · 11 min read

GPT-5.5 Arrives: Faster, Stronger, Costlier—Nvidia Engineer Says Losing Access Feels Like Amputation

Data Party THU

Apr 24, 2026 · Artificial Intelligence

OpenAI Unveils GPT‑Rosalind: A New AI Model for Accelerating Life‑Science Research

OpenAI introduced GPT‑Rosalind, a purpose‑built reasoning model for biology, drug discovery and translational medicine that streamlines evidence synthesis, hypothesis generation and experiment planning, and demonstrates leading performance on benchmarks such as BixBench and LABBench2 while offering free plugins that connect to over fifty scientific tools and data sources.

BenchmarkingBixBenchGPT-Rosalind

0 likes · 8 min read

OpenAI Unveils GPT‑Rosalind: A New AI Model for Accelerating Life‑Science Research

AI Engineer Programming

Apr 23, 2026 · Artificial Intelligence

From Zero to One: A Roadmap for Building Trustworthy AI Agent Evaluations

The article outlines why rigorous, automated evaluation is essential for AI agents, defines core concepts such as tasks, trials, graders, and frameworks, compares code‑based, model‑based and human graders, and presents an eight‑step roadmap—from early testing to open‑source maintenance—to create reliable, scalable agent assessments.

AI AgentsAgent developmentBenchmarking

0 likes · 22 min read

From Zero to One: A Roadmap for Building Trustworthy AI Agent Evaluations

Qborfy AI

Apr 20, 2026 · Artificial Intelligence

How Harness Engineering Lifted LangChain Agents into the Top 5 on Terminal Bench 2.0

LangChain’s Harness Engineering framework tuned system prompts, tool selection, and middleware to turn a rank‑30 programming agent into a top‑5 performer on Terminal Bench 2.0, using trace‑driven analysis, inference‑sandwich scheduling, and context engineering without changing the underlying model.

AI AgentsBenchmarkingHarness Engineering

0 likes · 12 min read

How Harness Engineering Lifted LangChain Agents into the Top 5 on Terminal Bench 2.0

SuanNi

Apr 18, 2026 · Artificial Intelligence

How GPT‑Rosalind Is Accelerating Drug Discovery with AI

OpenAI's GPT‑Rosalind model, designed for chemistry and genomics, demonstrates superior performance on scientific benchmarks, outperforms human experts, offers a rich plugin ecosystem, and implements strict access controls to help accelerate early-stage drug research while ensuring responsible AI use in life sciences.

AI GovernanceBenchmarkingLarge Language Model

0 likes · 10 min read

How GPT‑Rosalind Is Accelerating Drug Discovery with AI

Architect's Must-Have

Apr 18, 2026 · Artificial Intelligence

Claude Opus 4.7 Unpacked: Engineering Boost, Vision Leap, and Safety Test

Claude Opus 4.7, Anthropic’s latest publicly released model, extends engineering intelligence with autonomous verification loops, upgrades visual resolution three‑fold, introduces layered safety deployment and new API controls, while benchmarked against GPT‑5.4 and Gemini 3.1, delivering record SWE‑bench scores and detailed real‑world security evaluations.

AI safetyAPI featuresBenchmarking

0 likes · 36 min read

Claude Opus 4.7 Unpacked: Engineering Boost, Vision Leap, and Safety Test

AI Engineer Programming

Apr 16, 2026 · Artificial Intelligence

Choosing the Right LLM: A Complete Guide to Selecting from Over 2 Million Models

With more than two million LLMs available, this guide explains how to evaluate functional capabilities, latency, throughput, cost, tool‑calling reliability, context‑window size and compliance, and presents a step‑by‑step framework for picking the most suitable model for each business scenario.

BenchmarkingLLMObservability

0 likes · 25 min read

Choosing the Right LLM: A Complete Guide to Selecting from Over 2 Million Models

SuanNi

Apr 13, 2026 · Artificial Intelligence

How AI Researchers Built a 400% Better Multimodal Memory System with AutoResearchClaw

A fully automated AI research pipeline called AutoResearchClaw enabled a team from top universities to redesign a multimodal memory architecture, OMNIMEM, achieving over 400% performance gains on LoCoMo and Mem‑Gallery benchmarks by iteratively fixing code bugs, restructuring the system, and optimizing retrieval strategies.

AI research automationAutoResearchClawBenchmarking

0 likes · 12 min read

How AI Researchers Built a 400% Better Multimodal Memory System with AutoResearchClaw

SuanNi

Apr 12, 2026 · Artificial Intelligence

How TDM‑R1 Achieves 4‑Step Image Generation that Beats 80‑Step Models

Researchers from HKUST, CUHK and XiaoHongShu introduced TDM‑R1, a reinforcement‑learning‑based method that enables 4‑step diffusion image generation to surpass 80‑step models in speed, fidelity, and complex instruction adherence, as demonstrated on the GenEval benchmark and multiple quality metrics.

AI image synthesisBenchmarkingDiffusion Models

0 likes · 9 min read

How TDM‑R1 Achieves 4‑Step Image Generation that Beats 80‑Step Models

Old Zhang's AI Learning

Apr 8, 2026 · Artificial Intelligence

GLM‑5.1 Outperforms Claude Opus in Benchmarks – The Open‑Source LLM’s Edge

GLM‑5.1, the new 744 B‑parameter open‑source LLM from Zhipu, tops SWE‑Bench Pro with a score of 58.4, outpacing Claude Opus, GPT‑5.4 and Gemini, excels at long‑duration autonomous tasks, yet shows gaps in single‑turn generation and pure mathematical reasoning.

Agent ProgrammingBenchmarkingGLM-5.1

0 likes · 22 min read

GLM‑5.1 Outperforms Claude Opus in Benchmarks – The Open‑Source LLM’s Edge

Code Mala Tang

Mar 28, 2026 · Artificial Intelligence

How MiniMax M2.7 Achieves SOTA Agent Performance Through Self‑Evolving Loops

MiniMax M2.7 is a self‑evolving LLM that combines a persistent Agent Harness, multi‑level memory, and autonomous improvement cycles to reach SOTA benchmark scores, cost efficiency, and real‑world software‑engineering capabilities, illustrating the emerging skill‑economy of agent ecosystems.

BenchmarkingSelf-Improving Modelsagent architecture

0 likes · 13 min read

How MiniMax M2.7 Achieves SOTA Agent Performance Through Self‑Evolving Loops

PaperAgent

Mar 26, 2026 · Artificial Intelligence

TurboQuant: How Google’s New Vector Quantization Cuts KV Memory 6× and Boosts Speed

TurboQuant, presented at ICLR 2026, introduces a theoretically grounded vector quantization technique that reduces large‑language‑model key‑value cache memory by at least six times, achieves up to eight‑fold speedups, and maintains zero accuracy loss by combining PolarQuant’s polar‑coordinate compression with a 1‑bit QJL error‑correction step, as demonstrated on benchmarks such as LongBench and GloVe.

AI inferenceBenchmarkingMemory compression

0 likes · 10 min read

TurboQuant: How Google’s New Vector Quantization Cuts KV Memory 6× and Boosts Speed

DataFunTalk

Mar 16, 2026 · Artificial Intelligence

Unlocking Anthropic’s Skill‑Creator: New Evaluation, Benchmarking, and Parallel Testing Features

The article explains Anthropic’s latest Skill‑Creator update, which adds an evaluation system, benchmark testing, parallel agent execution, and description optimization, and demonstrates how these capabilities dramatically improve skill reliability, trigger accuracy, and overall performance through concrete examples and quantitative results.

AI AgentsAnthropicBenchmarking

0 likes · 13 min read

Unlocking Anthropic’s Skill‑Creator: New Evaluation, Benchmarking, and Parallel Testing Features

Data Party THU

Mar 12, 2026 · Artificial Intelligence

Can a 30B LLM Truly Conduct Autonomous Scientific Research? Inside UniScientist

UniScientist, a 30‑billion‑parameter open‑source model from UniPat AI, demonstrates a closed‑loop scientific research workflow—generating hypotheses, gathering evidence, performing reproducible derivations, and iteratively refining conclusions—while achieving benchmark scores comparable to much larger proprietary systems across multiple scientific evaluation suites.

BenchmarkingLarge Language ModelScientific research

0 likes · 10 min read

Can a 30B LLM Truly Conduct Autonomous Scientific Research? Inside UniScientist

AI Engineering

Mar 6, 2026 · Artificial Intelligence

Anthropic Adds a Full Evaluation Framework to Skill Creator

Anthropic's latest Skill Creator update introduces a code‑free evaluation framework that lets non‑engineer skill authors run tests, benchmark regressions, and optimize trigger descriptions, while supporting parallel multi‑agent execution and A/B comparisons to keep skills reliable as models evolve.

AI evaluationAnthropicBenchmarking

0 likes · 8 min read

Anthropic Adds a Full Evaluation Framework to Skill Creator

Woodpecker Software Testing

Mar 1, 2026 · Artificial Intelligence

Optimizing RAG System Performance: A Practical Testing Guide

The article presents a systematic framework for testing and optimizing Retrieval‑Augmented Generation (RAG) systems, detailing performance‑sensitive bottlenecks, a three‑dimensional test matrix, real‑world case studies, and test‑driven engineering practices to ensure stable, fast, and accurate AI services.

AIBenchmarkingObservability

0 likes · 9 min read

Optimizing RAG System Performance: A Practical Testing Guide

Alibaba Cloud Infrastructure

Feb 23, 2026 · Cloud Native

Deploying Qwen 3.5 Multimodal Model on Alibaba Cloud ACK with RoleBasedGroup

This guide details how to deploy the open‑source Qwen 3.5‑397B‑A17B multimodal LLM on Alibaba Cloud ACK using the RoleBasedGroup (RBG) engine, covering model preparation, Kubernetes resources, role‑based orchestration, performance tuning, and benchmark testing.

BenchmarkingCloud Native AIQwen3.5

0 likes · 24 min read

Deploying Qwen 3.5 Multimodal Model on Alibaba Cloud ACK with RoleBasedGroup

Open Source Tech Hub

Feb 12, 2026 · Artificial Intelligence

How GLM-5 Advances AI with Bigger Scale, Sparse Attention, and Agent Capabilities

GLM-5, a new large language model with 744 B parameters and 28.5 T tokens of training data, introduces DeepSeek sparse attention and an asynchronous RL system called slime, delivering strong benchmark gains on complex system engineering, long‑horizon agent tasks, and surpassing many open‑source competitors.

AIBenchmarkingGLM-5

0 likes · 6 min read

How GLM-5 Advances AI with Bigger Scale, Sparse Attention, and Agent Capabilities

Old Zhang's AI Learning

Feb 6, 2026 · Artificial Intelligence

GPT-5.3 Codex vs Claude Opus 4.6: Late‑Night Showdown for the Programming Champion

Anthropic and OpenAI released Claude Opus 4.6 and GPT‑5.3‑Codex within minutes, prompting a detailed side‑by‑side analysis of their programming abilities, long‑context windows, agentic features, benchmark scores, pricing, and real‑world use‑case recommendations.

AI model comparisonAgentic AIBenchmarking

0 likes · 12 min read

GPT-5.3 Codex vs Claude Opus 4.6: Late‑Night Showdown for the Programming Champion

Java Tech Enthusiast

Feb 4, 2026 · Artificial Intelligence

Claude Sonnet 5 (Fennec) – The Next‑Gen Coding LLM Set to Outperform All Rivals

Claude Sonnet 5, codenamed Fennec, is about to launch on Google’s infrastructure with a 1‑million‑token context window, pricing half of Opus 4.5, and benchmark scores surpassing 80.9% on SWE‑Bench, while introducing an autonomous “Dev Team” swarm that can generate, test, and deliver full software modules without human intervention.

BenchmarkingMulti-Agent Systemsmodel release

0 likes · 9 min read

Claude Sonnet 5 (Fennec) – The Next‑Gen Coding LLM Set to Outperform All Rivals

Baidu Intelligent Cloud Tech Hub

Jan 20, 2026 · Artificial Intelligence

How LoongFlow Enables Expert‑Level AI Agents to Outperform Human Mathematicians

LoongFlow is an open‑source AI agent framework that combines a Plan‑Execute‑Summarize (PES) paradigm with a Hybrid Evolutionary Memory system, allowing agents to perform directed, iterative problem solving and achieve state‑of‑the‑art results on mathematical challenges, Kaggle‑style benchmarks, and real‑world tasks with dramatically higher efficiency.

BenchmarkingEvolutionary AlgorithmsLoongFlow

0 likes · 15 min read

How LoongFlow Enables Expert‑Level AI Agents to Outperform Human Mathematicians

Amazon Cloud Developers

Dec 23, 2025 · Artificial Intelligence

Evaluating Agent Quality: A Practical Guide for Agentic AI

This article explains why evaluating AI agents is essential, outlines a multi‑dimensional metric system covering performance, safety, cost and bias, describes common evaluation frameworks such as AgentBoard, AgentBench and τ‑bench, and provides step‑by‑step instructions, example datasets and code for building a robust agent assessment pipeline.

AI AgentsAgent evaluationBenchmarking

0 likes · 35 min read

Evaluating Agent Quality: A Practical Guide for Agentic AI

NetEase LeiHuo Testing Center

Dec 12, 2025 · User Experience Design

Unlock Better Game UX: A Lazy‑Person’s Guide to QA, Personas, and Benchmarking

This article presents a step‑by‑step, low‑effort framework for improving game product quality by combining QA‑focused requirement analysis, detailed user personas, role‑mapping, systematic benchmark research, and realistic scenario validation, enabling designers to uncover hidden pain points, prioritize core user needs, and generate actionable design insights without excessive overhead.

BenchmarkingQAScenario Testing

0 likes · 32 min read

Unlock Better Game UX: A Lazy‑Person’s Guide to QA, Personas, and Benchmarking

Bighead's Algorithm Notes

Dec 5, 2025 · Artificial Intelligence

Quantitative Finance Paper Summaries (Nov 29–Dec 5 2025)

This article presents concise summaries of five recent AI‑driven finance papers, covering a stress‑testing framework for LLM trading agents, an orchestration framework for financial agents, an event‑reflection memory model for stock forecasting, a hybrid LLM‑Bayesian network architecture for options wheel strategies, and their experimental results.

BenchmarkingLLMRisk analysis

0 likes · 12 min read

Quantitative Finance Paper Summaries (Nov 29–Dec 5 2025)

JD Tech Talk

Nov 3, 2025 · Artificial Intelligence

How JoyCode Agent Achieves 74.6% Pass@1 on SWE‑bench Verified with Patch‑Test Co‑generation

JoyCode Agent reaches a 74.6% pass rate on the authoritative SWE‑bench Verified benchmark, ranking in the global top‑3, and is now open‑source, showcasing a high‑efficiency, test‑driven, iterative approach to automated code repair that dramatically reduces token consumption while improving success rates.

Automated Code RepairBenchmarkingSWE‑Bench

0 likes · 44 min read

How JoyCode Agent Achieves 74.6% Pass@1 on SWE‑bench Verified with Patch‑Test Co‑generation

Data Party THU

Oct 24, 2025 · Artificial Intelligence

How 78 Samples Outperform 10,000: The LIMI Breakthrough in Agent AI

The paper introduces the LIMI framework, which achieves state‑of‑the‑art agent performance on AgencyBench using only 78 carefully crafted samples—outperforming baseline models trained on thousands of examples—by focusing on high‑quality, strategic data construction and demonstrating superior generalization across code, research, and tool‑use tasks.

AgencyBenchAgent AIBenchmarking

0 likes · 11 min read

How 78 Samples Outperform 10,000: The LIMI Breakthrough in Agent AI

Meituan Technology Team

Oct 15, 2025 · Artificial Intelligence

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

This curated list showcases Meituan’s latest large‑model breakthroughs and academic papers up to October 2025, spanning LLM system optimizations, multimodal generation, evaluation benchmarks, quantization techniques, and reinforcement‑learning‑driven improvements, offering researchers valuable insights and resources across the AI landscape.

AI researchBenchmarkingMultimodal AI

0 likes · 10 min read

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

AI2ML AI to Machine Learning

Sep 24, 2025 · Artificial Intelligence

Key Points for Evaluating AI Agents

The article explains how Coze's Compass introduces a flexible evaluation system for AI agents, outlines a four‑dimensional submodule assessment (planning, tool use, self‑reflection, memory), and details specific testing criteria and challenges for web, scientific, dialogue, and programming agents.

AI AgentsBenchmarkingCoze

0 likes · 6 min read

DataFunTalk

Sep 1, 2025 · Artificial Intelligence

Unlocking 560B‑Parameter AI: Inside LongCat‑Flash‑Chat’s Zero‑Computation MoE

LongCat‑Flash‑Chat, a 560‑billion‑parameter Mixture‑of‑Experts model with Zero‑Computation Experts, delivers top‑tier benchmark scores and fast inference while activating only a fraction of its parameters, and is fully open‑sourced with easy deployment scripts.

BenchmarkingLarge Language ModelMixture of Experts

0 likes · 6 min read

Unlocking 560B‑Parameter AI: Inside LongCat‑Flash‑Chat’s Zero‑Computation MoE

Architects' Tech Alliance

Jul 22, 2025 · Fundamentals

Understanding Computer Performance Metrics: A Deep Dive into SPEC CPU Benchmarks

This article explains the key performance evaluation metrics for computer systems, illustrates them with real‑world benchmark results from the Phoronix Test Suite, and provides a comprehensive overview of the SPEC CPU benchmark suites (2000, 2006, 2017) and how their scores are calculated.

BenchmarkingComputer ArchitectureHardware Evaluation

0 likes · 20 min read

Understanding Computer Performance Metrics: A Deep Dive into SPEC CPU Benchmarks

Alibaba Cloud Developer

Jun 23, 2025 · Artificial Intelligence

How to Systematically Conduct Large Model Evaluation in Real-World Scenarios

This guide walks readers through a complete, business‑oriented workflow for evaluating large language models—from requirement analysis and test‑set design to metric definition, execution, result aggregation, and report generation—while addressing common challenges such as data imbalance, annotation quality, and automation.

AI evaluationBenchmarkingReporting

0 likes · 24 min read

How to Systematically Conduct Large Model Evaluation in Real-World Scenarios

ShiZhen AI

May 23, 2025 · Artificial Intelligence

Claude 4 and Claude Code Released – Anthropic API Adds Four Powerful New Features

Anthropic unveiled Claude Opus 4 and Claude Sonnet 4, the strongest coding models to date, detailed benchmark results, new memory and tool‑use capabilities, the Claude Code IDE extensions, and four fresh API functions that together expand AI agent development.

AI AgentsAPIAnthropic

0 likes · 13 min read

Claude 4 and Claude Code Released – Anthropic API Adds Four Powerful New Features

Architects' Tech Alliance

May 20, 2025 · Industry Insights

What Do GPU Core Specs Really Mean? A Deep Dive into Modern GPU Performance

This article provides a comprehensive analysis of GPU core parameters—including compute units, memory systems, floating‑point performance, power consumption, and manufacturing process—while comparing leading international and domestic GPU products to help readers choose the right accelerator for AI, HPC, or graphics workloads.

AIBenchmarkingGPU

0 likes · 19 min read

What Do GPU Core Specs Really Mean? A Deep Dive into Modern GPU Performance

Python Programming Learning Circle

May 15, 2025 · Fundamentals

Benchmarking Python 3.11 Performance Against C++ Using Monte Carlo Pi Estimation

This article benchmarks Python 3.11's speed with a Monte Carlo Pi estimation script, compares it to earlier Python releases and a C++ implementation, shows Docker‑based testing methodology, presents performance results, and extrapolates when Python might surpass C++ in execution time.

BenchmarkingMonte CarloPython

0 likes · 9 min read

Benchmarking Python 3.11 Performance Against C++ Using Monte Carlo Pi Estimation

AI Frontier Lectures

Apr 19, 2025 · Artificial Intelligence

Why Recent AI Model Gains May Be Illusory: Benchmark Gaps and Real‑World Limits

The author argues that since August 2023 AI large‑model improvements have stalled in practical applications, with benchmark scores diverging from user experience, citing security‑scanning experiments, possible benchmark gaming, and alignment bottlenecks that undermine confidence in claimed progress.

AIBenchmarkingIndustry insight

0 likes · 13 min read

Why Recent AI Model Gains May Be Illusory: Benchmark Gaps and Real‑World Limits

FunTester

Apr 14, 2025 · Backend Development

Common Mistakes in Go Unit Testing and How to Avoid Them

This article examines nine frequent errors developers make when writing Go unit tests—such as improper test classification, neglecting the race detector, ignoring parallel and shuffle flags, avoiding table‑driven tests, using sleep, mishandling time APIs, overlooking httptest/iotest, misusing benchmarks, and skipping fuzz testing—providing analysis and concrete code‑based solutions to improve test reliability and efficiency.

BenchmarkingGoUnit Testing

0 likes · 11 min read

Common Mistakes in Go Unit Testing and How to Avoid Them

Python Programming Learning Circle

Apr 9, 2025 · Fundamentals

Python Code Optimization Techniques for Faster Execution

This article presents a comprehensive collection of Python performance‑boosting techniques, covering fundamental optimization principles, avoiding global variables and attribute access, eliminating unnecessary abstraction and data copying, loop optimizations, just‑in‑time compilation with numba, and selecting appropriate built‑in data structures to achieve significant speed improvements.

BenchmarkingOptimizationPython

0 likes · 15 min read

Python Code Optimization Techniques for Faster Execution

DataFunTalk

Apr 8, 2025 · Artificial Intelligence

Meta AI VP Responds to Llama 4 Controversies and Allegations of Benchmark Manipulation

Meta AI Vice President Ahmad Al‑Dahle addressed recent criticisms of the newly released Llama 4 model, denying claims of test‑set cheating, explaining quality variations as post‑release optimization, and acknowledging internal concerns that led to staff resignations and calls for transparency.

BenchmarkingLlama 4Meta AI

0 likes · 5 min read

Meta AI VP Responds to Llama 4 Controversies and Allegations of Benchmark Manipulation

Xiaokun's Architecture Exploration Notes

Apr 6, 2025 · Operations

Mastering Performance Testing: Why It Matters and How to Use wrk Effectively

This article explains what performance testing is, why it is essential for reliable systems, outlines practical steps for conducting effective tests, and introduces the wrk benchmarking tool as a lightweight solution for generating realistic load and measuring key performance metrics.

BenchmarkingOperationsload testing

0 likes · 2 min read

Mastering Performance Testing: Why It Matters and How to Use wrk Effectively

Linux Code Review Hub

Apr 5, 2025 · Operations

Inside Linux Perf: How the Kernel’s Powerful Tracing Tool Works

The article introduces Linux’s built‑in performance analysis tool perf, explains its event‑driven sampling, tracing and profiling capabilities, shows how to install it on various distributions, demonstrates common commands with real code examples, and discusses practical scenarios for locating and optimizing kernel and application performance issues.

BenchmarkingSystem Optimizationflamegraph

0 likes · 36 min read

Inside Linux Perf: How the Kernel’s Powerful Tracing Tool Works

Su San Talks Tech

Mar 20, 2025 · Backend Development

How to Crush the One Billion Row Challenge: Java Performance Secrets Revealed

This article walks through the One Billion Row Challenge, explaining the problem, baseline Java solution, and a series of deep performance optimizations—from parallel streams and custom hash tables to unsafe memory access and SIMD techniques—that shrink execution time from minutes to under two seconds.

BenchmarkingJavaLarge Data Processing

0 likes · 21 min read

How to Crush the One Billion Row Challenge: Java Performance Secrets Revealed

DataFunSummit

Feb 25, 2025 · Artificial Intelligence

Tiny‑R1‑32B‑Preview: A 5% Parameter Model Matching Deepseek‑R1‑671B Performance

On February 24, 2025, 360 and Peking University unveiled Tiny‑R1‑32B‑Preview, a medium‑scale inference model that uses only 5% of the parameters yet achieves performance comparable to the 671‑billion‑parameter Deepseek‑R1, with leading results on math, programming, and scientific benchmarks.

AI modelBenchmarkingOpen-source AI

0 likes · 7 min read

Tiny‑R1‑32B‑Preview: A 5% Parameter Model Matching Deepseek‑R1‑671B Performance

Software Engineering 3.0 Era

Feb 19, 2025 · Artificial Intelligence

Three Breakthroughs in AI Inference Models: 1% Data for 99% Performance and More

The article reviews three recent AI inference model advances—open‑source models surpassing OpenAI, the LIMO approach that gains 99% performance with just 1% of the data, and the CoAT framework that combines Monte‑Carlo tree search with associative memory to enable iterative, self‑correcting reasoning.

AI inferenceBenchmarkingCoAT

0 likes · 7 min read

Three Breakthroughs in AI Inference Models: 1% Data for 99% Performance and More

Liangxu Linux

Jan 14, 2025 · Fundamentals

How to Measure Execution Time in C with time(), clock() and gettimeofday()

This guide shows how to benchmark C code by measuring elapsed time using the standard time() function for second‑level precision, clock() for higher CPU‑time accuracy, and gettimeofday() for microsecond‑level resolution, including complete example programs and key considerations.

BenchmarkingC#clock

0 likes · 4 min read

How to Measure Execution Time in C with time(), clock() and gettimeofday()

Baobao Algorithm Notes

Jan 11, 2025 · Artificial Intelligence

Why Phi‑4’s 14B Model Outperforms GPT‑4 on STEM and Reasoning Tasks

Microsoft Research’s Phi‑4 model, a 14‑billion‑parameter LLM, leverages extensive synthetic data, advanced tokenization, and a two‑stage training pipeline to achieve superior performance on STEM question answering, long‑context reasoning, and safety benchmarks, rivaling larger models like GPT‑4.

AI safetyBenchmarkingPhi-4

0 likes · 15 min read

Why Phi‑4’s 14B Model Outperforms GPT‑4 on STEM and Reasoning Tasks

BirdNest Tech Talk

Nov 26, 2024 · Industry Insights

Which Language Wins the 1 Billion Loop Benchmark? C, Rust, and Zig Lead

Ben Dicken benchmarked a double‑nested loop of 10 000 × 100 000 iterations across dozens of languages, publishing the source code and fastest‑run results that show C, Zig and Rust consistently topping the performance chart, illustrated with an animated speed comparison.

BenchmarkingC#Performance Benchmark

0 likes · 7 min read

Which Language Wins the 1 Billion Loop Benchmark? C, Rust, and Zig Lead

FunTester

Oct 14, 2024 · Backend Development

Mastering Go Benchmarking: A Practical Guide to Performance Testing

This article introduces Go's benchmarking framework, explains its purpose and best practices, provides step‑by‑step code examples for measuring string concatenation performance, shows how to run benchmarks from the command line, and teaches how to interpret the detailed test reports.

BenchmarkingGobackend

0 likes · 11 min read

Mastering Go Benchmarking: A Practical Guide to Performance Testing

Test Development Learning Exchange

Oct 11, 2024 · Fundamentals

Fundamentals of Performance Testing: Concepts, Metrics, Tools, and Best Practices

This article provides a comprehensive overview of performance testing fundamentals, covering core concepts, key metrics, common testing tools, test design, load generation, result analysis, bottleneck identification, optimization techniques, cloud and micro‑service testing, monitoring, reporting, challenges, and cost‑benefit considerations.

BenchmarkingMonitoringOptimization

0 likes · 12 min read

Fundamentals of Performance Testing: Concepts, Metrics, Tools, and Best Practices

NewBeeNLP

Oct 11, 2024 · Artificial Intelligence

Inside Llama 3: Training, Architecture, and Performance Secrets

An extensive review of Meta’s Llama 3 model breaks down its pre‑training data pipeline, scaling laws, architectural tweaks like GQA and RoPE, post‑training methods such as SFT, DPO, and reward modeling, and evaluates benchmark results, offering practical insights for researchers and engineers building large language models.

BenchmarkingLlama 3Quantization

0 likes · 32 min read

Inside Llama 3: Training, Architecture, and Performance Secrets

21CTO

Aug 26, 2024 · Fundamentals

Which Programming Languages Use the Least Power? Findings from a 2017 Study

A 2017 study by six Portuguese researchers compared the energy consumption, execution time, and memory usage of 27 programming languages across ten benchmark problems, revealing that faster languages aren't always more energy‑efficient and that compiled languages generally outperform interpreted ones in both speed and power usage.

Benchmarkingcompiled languagesenergy efficiency

0 likes · 9 min read

Which Programming Languages Use the Least Power? Findings from a 2017 Study

NewBeeNLP

Jul 31, 2024 · Artificial Intelligence

How Continual Pre‑Training Boosts Llama‑3’s Chinese and Scientific Reasoning

This report presents a continual pre‑training approach that significantly enhances Llama‑3 (8B)’s Chinese language proficiency and scientific reasoning by using a carefully mixed corpus of existing and synthetic data, detailing the bilingual adaptation and synthetic‑enhancement stages, data‑mixing and curriculum strategies, and demonstrating strong results across multilingual and scientific benchmarks without sacrificing original capabilities.

BenchmarkingLLMLlama-3

0 likes · 9 min read

How Continual Pre‑Training Boosts Llama‑3’s Chinese and Scientific Reasoning

Python Programming Learning Circle

Jul 17, 2024 · Fundamentals

Simple Techniques to Speed Up Python For Loops by Up to 970×

This article demonstrates a collection of straightforward Python performance tricks—such as list comprehensions, external length calculation, set usage, loop skipping, code inlining, generators, map(), memoization, vectorization, filterfalse, and string joining—that together can accelerate for‑loops from modest 1.3× gains to dramatic 970× speed‑ups, with detailed benchmark results and code examples.

BenchmarkingLoopsOptimization

0 likes · 15 min read

Simple Techniques to Speed Up Python For Loops by Up to 970×

php Courses

Jun 25, 2024 · Backend Development

Improving PHP Performance with OPcache: Benchmarks, Configuration, and Deployment Strategies

This article examines how enabling and tuning OPcache can dramatically boost PHP request throughput, presents benchmark results before and after optimization, discusses configuration trade‑offs, and outlines safe deployment and cache‑clearing strategies for high‑traffic backend systems.

BenchmarkingCachingOPcache

0 likes · 8 min read

Improving PHP Performance with OPcache: Benchmarks, Configuration, and Deployment Strategies

Spring Full-Stack Practical Cases

Jun 25, 2024 · Backend Development

Master Java Performance Testing with JMH: From Setup to Advanced Benchmarks

This article introduces JMH, explains why it outperforms simple loops or other tools, and provides step‑by‑step Maven setup, benchmark creation, execution, and advanced annotations such as @Warmup, @Fork, @Setup, Blackhole usage, and SpringBoot integration for accurate Java micro‑benchmarking.

BenchmarkingJMHmaven

0 likes · 12 min read

Master Java Performance Testing with JMH: From Setup to Advanced Benchmarks

Ops Development & AI Practice

Apr 16, 2024 · Backend Development

Master Go Benchmarking: Write, Run, and Analyze Performance Tests

This guide explains how to create Go benchmark tests using the standard library, run them with the appropriate go test flags, and interpret the detailed output to identify performance bottlenecks and optimize code effectively.

Backend DevelopmentBenchmarkingGo

0 likes · 4 min read

Master Go Benchmarking: Write, Run, and Analyze Performance Tests

FunTester

Apr 15, 2024 · Fundamentals

Using JMH to Benchmark GUID Generation Strategies in Java

This article introduces JMH, explains its key features, and presents microbenchmark results comparing thread‑exclusive, thread‑shared, Snowflake, UUID, and Snowflake‑algorithm GUID generation methods under various thread counts, accompanied by the full Java test code.

BenchmarkingGUIDJMH

0 likes · 8 min read

Using JMH to Benchmark GUID Generation Strategies in Java

Architects' Tech Alliance

Feb 6, 2024 · Industry Insights

How to Evaluate Data Center Compute Power: From Supercomputer Benchmarks to PUE

This article explains the concept of data‑center compute power, reviews mature evaluation methods such as TOP500/FLOPS for supercomputers and SPEC CPU, SPECpower, and MLPerf for conventional servers, introduces the PUE efficiency metric, and summarizes the four core components that together define a data‑center's computing capability.

BenchmarkingData CenterHPC

0 likes · 9 min read

How to Evaluate Data Center Compute Power: From Supercomputer Benchmarks to PUE

Architects' Tech Alliance

Aug 11, 2023 · Fundamentals

Survey of General CPU Performance Benchmarking and Emerging Trends (2023)

This article reviews the evolution of mainstream CPU performance benchmarks such as SPEC and TPC, compares their methodologies and tools, discusses challenges in evaluating heterogeneous CPUs, and outlines future research directions, providing a comprehensive overview for researchers and practitioners.

BenchmarkingCPUSpec

0 likes · 10 min read

Survey of General CPU Performance Benchmarking and Emerging Trends (2023)

ITPUB

Jul 31, 2023 · Databases

How to Choose the Right Database: Key Steps for Successful Selection

This guide walks you through the essential stages of database selection—from assessing project requirements and comparing candidate systems to performance testing, long‑term impact analysis, and making the final decision—ensuring you pick a solution that fits both current and future needs.

BenchmarkingLong-term PlanningNoSQL

0 likes · 10 min read

How to Choose the Right Database: Key Steps for Successful Selection

StarRocks

Apr 23, 2023 · Databases

Why Query Performance Optimization Matters and How to Master It

This guide explains the importance of query performance optimization for database products and engineers, outlines latency and throughput goals, shows how to locate bottlenecks with observability tools and Linux profilers, and provides practical high‑level and low‑level optimization techniques along with testing best practices.

BenchmarkingCPU profilingQuery Optimization

0 likes · 16 min read

Why Query Performance Optimization Matters and How to Master It

Top Architect

Apr 19, 2023 · Backend Development

Using JMH for Java Microbenchmarking: Demo Project, Annotations and Result Analysis

This article explains why simple timing is unreliable in Java due to JIT compilation, introduces the official JMH tool for microbenchmarking, outlines best‑practice tips, demonstrates how to set up a Maven project, write benchmark code, run tests, interpret results, and details each JMH annotation.

BenchmarkingJMHJava

0 likes · 14 min read

Using JMH for Java Microbenchmarking: Demo Project, Annotations and Result Analysis

Ziru Technology

Mar 31, 2023 · Backend Development

Master Java Performance Testing with JMH: From Basics to Advanced Benchmarks

This article explains what benchmarking is, introduces the JMH framework for Java, shows how to add JMH dependencies, walks through a simple "Hello JMH" example with full source and output, demonstrates using JMH with Spring Boot, details the most important options and annotations, and highlights common pitfalls to avoid when writing reliable micro‑benchmarks.

BenchmarkingJMHJava

0 likes · 17 min read

Master Java Performance Testing with JMH: From Basics to Advanced Benchmarks

Alibaba Cloud Developer

Mar 6, 2023 · Backend Development

Master Go Performance: Practical Optimization Tips, Tools, and Real-World Cases

This comprehensive guide walks Go developers through performance tuning fundamentals, recommended profiling tools, code-level optimizations, and real-world case studies, offering actionable insights to measure, diagnose, and improve CPU, memory, and concurrency efficiency in high‑throughput Go services.

BenchmarkingGoMemory Management

0 likes · 41 min read

Master Go Performance: Practical Optimization Tips, Tools, and Real-World Cases

Architect's Tech Stack

Feb 16, 2023 · Backend Development

A Comprehensive Guide to Java Microbenchmarking with JMH

This article introduces Java Microbenchmark Harness (JMH), explains why warm‑up is necessary, details common annotations, shows how to set up a Maven project, provides a complete benchmark example comparing LinkedList iteration methods, and demonstrates how to run and interpret the results.

BenchmarkingJMHJava

0 likes · 13 min read

A Comprehensive Guide to Java Microbenchmarking with JMH

Top Architect

Nov 19, 2022 · Operations

Guidelines for Sizing and Benchmarking Elasticsearch Clusters

This article provides a comprehensive guide on allocating hardware resources, calculating cluster size based on data volume, and conducting index and search benchmark tests for Elasticsearch, offering practical formulas, test configurations, and performance conclusions to help engineers design stable, high‑throughput clusters.

BenchmarkingCluster Sizingperformance testing

0 likes · 12 min read

Guidelines for Sizing and Benchmarking Elasticsearch Clusters

Architecture Digest

Oct 21, 2022 · Operations

Benchmarking and Sizing Your Elasticsearch Cluster for Logs and Metrics

This article explains how to assess hardware resources, calculate required Elasticsearch cluster size based on data volume, and perform indexing and search benchmark tests to ensure stable performance and optimal throughput for log and metric workloads in production environments.

BenchmarkingCluster SizingElasticsearch

0 likes · 10 min read

Benchmarking and Sizing Your Elasticsearch Cluster for Logs and Metrics

dbaplus Community

Oct 19, 2022 · Operations

How to Size and Benchmark Your Elasticsearch Cluster for Logs and Metrics

This guide explains how to allocate hardware resources, calculate Elasticsearch cluster size based on data volume, and conduct indexing and search benchmarks using Rally to ensure production‑grade performance and capacity planning.

BenchmarkingCluster SizingElasticsearch

0 likes · 12 min read

How to Size and Benchmark Your Elasticsearch Cluster for Logs and Metrics

Top Architect

Oct 9, 2022 · Backend Development

JMH – Java Microbenchmark Harness: Introduction, Demo, and Annotation Guide

This article introduces JMH, the official Java microbenchmarking tool, explains why warm‑up is needed, shows how to build a Maven project, provides a complete LinkedList iteration benchmark example, demonstrates common JMH annotations, and outlines how to run and interpret benchmark results.

BenchmarkingJMHJava

0 likes · 16 min read

JMH – Java Microbenchmark Harness: Introduction, Demo, and Annotation Guide

FunTester

Jun 24, 2022 · Operations

Performance Testing Resource Collection

A comprehensive catalog of performance testing articles ranging from Linux monitoring tools and test frameworks to concurrency utilities, distributed load testing strategies, QPS modeling and language-specific benchmark studies, providing developers with practical insights and techniques for optimizing system performance.

Benchmarkingload testingperformance testing

0 likes · 6 min read

DataFunTalk

Jun 17, 2022 · Artificial Intelligence

Issues with Recommender System Benchmarks and Insights from the BARS Paper

This article examines the shortcomings of current recommender system benchmarks, explains why standardized datasets and metrics are essential, and highlights key findings from the recent BARS paper that propose a more open and reproducible benchmarking framework for recommendation research.

AIBARSBenchmarking

0 likes · 6 min read

Issues with Recommender System Benchmarks and Insights from the BARS Paper

MaGe Linux Operations

Jun 3, 2022 · Backend Development

How Rewriting Hasura Storage in Go Boosted Performance Fivefold

The Hasura Storage team rewrote their Node.js service in Go, ran k6 benchmarks, and achieved up to five times more request handling, half the memory usage, and significantly lower response times across multiple download scenarios, demonstrating the scalability benefits of Go for backend services.

Benchmarkingbackendgolang

0 likes · 6 min read

How Rewriting Hasura Storage in Go Boosted Performance Fivefold

FunTester

Feb 27, 2022 · Operations

Performance Testing Articles Collection (Chinese Resources)

This collection compiles dozens of Chinese articles on performance testing, covering tools, frameworks, case studies, and techniques such as netdata monitoring, load generators, concurrency utilities, distributed testing, QPS modeling, and comparisons of JMeter, K6, Gatling, and FunTester.

BenchmarkingOperationsload testing

0 likes · 8 min read

Performance Testing Articles Collection (Chinese Resources)

Top Architect

Feb 20, 2022 · Backend Development

Using JMH for Java Microbenchmarking: Demo, Annotations, and Best Practices

This article introduces Java Microbenchmark Harness (JMH), explains why warm‑up is needed, shows how to build a benchmark project with Maven, provides a complete LinkedList iteration benchmark example with all relevant JMH annotations, demonstrates execution commands, and interprets the resulting performance reports.

BenchmarkingJMHJava

0 likes · 13 min read

Using JMH for Java Microbenchmarking: Demo, Annotations, and Best Practices

Java Interview Crash Guide

Feb 18, 2022 · Backend Development

Master Java Microbenchmarking with JMH: From Setup to Results

This article explains how to use JMH for precise Java micro‑benchmarks, covering JVM warm‑up, project setup with Maven, writing benchmark methods, configuring annotations, running tests, interpreting results, and provides practical code examples and tips for reliable performance measurement.

BenchmarkingJMHJVM

0 likes · 13 min read

Master Java Microbenchmarking with JMH: From Setup to Results

Alimama Tech

Feb 16, 2022 · Big Data

Target Group Discovery: Framework, Models, and Case Study

The article presents a target‑group discovery framework that combines goal definition, rule‑or model‑based segmentation, tiered metrics, benchmarking and quadrant analysis to identify and characterize advantageous, problematic, or weak consumer, product, or merchant sub‑groups, illustrated by a FMCG e‑commerce case study diagnosing high‑share, low‑growth categories.

BenchmarkingBig DataMarketing Analytics

0 likes · 13 min read

Target Group Discovery: Framework, Models, and Case Study

Python Programming Learning Circle

Dec 1, 2021 · Fundamentals

The Fastest Way to Loop in Python: Using Built‑in Functions and Formulas Instead of While/For Loops

This article benchmarks Python while and for loops, shows that for loops are faster due to fewer operations, demonstrates how built‑in functions like sum and direct arithmetic formulas can achieve orders‑of‑magnitude speedups, and concludes that the quickest way to "loop" in Python is to avoid loops altogether.

Algorithmic EfficiencyBenchmarkingOptimization

0 likes · 8 min read

The Fastest Way to Loop in Python: Using Built‑in Functions and Formulas Instead of While/For Loops

Laravel Tech Community

Nov 1, 2021 · Databases

Vitess 12 Release: New Gen4 Planner, VTAdmin Enhancements, RBAC, and Benchmarking Improvements

Vitess 12, the latest major release of the MySQL clustering solution, introduces the experimental Gen4 query planner, enhanced VTAdmin multi‑cluster management, role‑based access control, updated benchmarking tools, and more inclusive naming conventions to improve scalability and cloud deployment.

BenchmarkingQuery PlannerRBAC

0 likes · 5 min read

Vitess 12 Release: New Gen4 Planner, VTAdmin Enhancements, RBAC, and Benchmarking Improvements

Python Programming Learning Circle

Aug 11, 2021 · Databases

Generating One Billion SQLite Rows in Under a Minute: Python, PyPy, and Rust Performance Comparison

A programmer needed to create a billion‑row SQLite test database within a minute, found a naïve Python script unbearably slow, applied batch inserts and SQLite PRAGMA tweaks, then compared CPython, PyPy, and Rust implementations, ultimately achieving sub‑minute runtimes with Rust and highlighting best‑practice optimizations.

BenchmarkingPyPyPython

0 likes · 6 min read

Generating One Billion SQLite Rows in Under a Minute: Python, PyPy, and Rust Performance Comparison

Python Programming Learning Circle

Aug 9, 2021 · Backend Development

How to Choose the Fastest JSON Library for Python: A Practical Benchmarking Guide

This article explains a systematic process for evaluating and selecting the most suitable high‑performance JSON library for Python, covering the need assessment, benchmark definition, filtering by additional requirements, and detailed benchmark results that highlight orjson as the fastest option for small‑message encoding while discussing trade‑offs such as safety, customizability, and ecosystem support.

BenchmarkingPythonRapidJSON

0 likes · 5 min read

How to Choose the Fastest JSON Library for Python: A Practical Benchmarking Guide

FunTester

Jul 27, 2021 · Operations

How I Boosted FunTester QPS by 14% and Halved Memory Usage

After a weekend of code refactoring, asynchronous processing, and removing unnecessary statistics, the author increased FunTester's QPS from 104,375 to 118,904 (≈13.9% gain), reduced memory consumption by over 57%, and documented detailed performance impacts of various optimizations with code samples and benchmark tables.

BenchmarkingFunTesterJava

0 likes · 13 min read

How I Boosted FunTester QPS by 14% and Halved Memory Usage

Python Programming Learning Circle

May 7, 2021 · Fundamentals

Why Python Is Perceived as Slow and How to Make It Faster

The article explains that Python’s reputation for slowness stems more from algorithmic choices and costly import patterns than the language itself, and it offers practical measurements, tooling insights, and optimization suggestions to improve Python’s performance in real‑world projects.

BenchmarkingOptimizationimports

0 likes · 8 min read

Why Python Is Perceived as Slow and How to Make It Faster

Aikesheng Open Source Community

Apr 30, 2021 · Databases

Why mysqlslap Shows Smoother Results Than sysbench for SQL Performance Testing

The article explains that mysqlslap appears to produce smoother latency results than sysbench because mysqlslap reports metrics per test round rather than per individual SQL statement, leading to mis‑interpretation of Max/Avg/Min values, while sysbench can reveal per‑statement latency variations.

BenchmarkingSQL latencySysbench

0 likes · 4 min read

Why mysqlslap Shows Smoother Results Than sysbench for SQL Performance Testing