Tagged articles

performance benchmarking

31 articles · Page 1 of 1

Jun 17, 2026 · Artificial Intelligence

Model Quantization: INT8, INT4, and AWQ/GPTQ – Choosing the Right Compression for Production

This article explains how INT8, INT4, bitsandbytes, GPTQ, and AWQ quantization methods can dramatically cut memory usage, boost inference speed, and lower costs for large language models, while detailing their trade‑offs, practical workflows, benchmark results, and common pitfalls to help engineers decide which technique best fits their production scenario.

AWQGPTQINT4

0 likes · 22 min read

Model Quantization: INT8, INT4, and AWQ/GPTQ – Choosing the Right Compression for Production

Geek Labs

Jun 9, 2026 · Artificial Intelligence

Why Rapid-MLX Is the Fastest Local AI Engine for Apple Silicon (4.2× Faster Than Ollama)

Rapid-MLX leverages Apple’s MLX framework and optimizations such as model caching and reasoning separation to deliver up to 4.2× faster token throughput than Ollama on Apple Silicon Macs, offers a lightweight 460 MB install, full OpenAI‑compatible API, tool calling, prompt caching, and easy Homebrew or pip setup.

Apple SiliconOpenAI compatibilityRapid-MLX

0 likes · 6 min read

Why Rapid-MLX Is the Fastest Local AI Engine for Apple Silicon (4.2× Faster Than Ollama)

Architects' Tech Alliance

Jun 1, 2026 · Industry Insights

Intel’s World‑First 1.8nm Data‑Center CPU Packs 288 Cores – A Performance Leap

Intel unveiled the world’s first 1.8nm data‑center CPU, the Xeon 6+ with 288 cores, leveraging RibbonFET, PowerVia and 3D chiplet stacking to achieve up to 2.26× higher performance and 55% better performance‑per‑watt than the previous generation, while adding SGX/TDX security and a 200 GbE Ethernet plus a new AI‑focused GPU.

1.8nmAI inferenceIntel

0 likes · 10 min read

Intel’s World‑First 1.8nm Data‑Center CPU Packs 288 Cores – A Performance Leap

Old Zhang's AI Learning

May 13, 2026 · Artificial Intelligence

Why vLLM Now Leads Open‑Source LLM Inference Benchmarks

vLLM tops the Artificial Analysis ranking by delivering the highest throughput for DeepSeek V3.2, Qwen 3.5 397B, and MiniMax‑M2.5 on identical NVIDIA Blackwell Ultra hardware, thanks to extensive kernel‑fusion optimizations that remain in the main branch.

DeepSeekLLM InferenceQwen

0 likes · 7 min read

Why vLLM Now Leads Open‑Source LLM Inference Benchmarks

Machine Heart

May 9, 2026 · Artificial Intelligence

Can QuantClaw Cut OpenClaw Costs by 21% and Speed Up Inference by 15%?

QuantClaw, an open‑source plug‑in for the OpenClaw AI agent framework, uses a systematic quantization study to dynamically route tasks to appropriate model precisions, achieving up to 21% cost reduction, 8‑15% latency improvement, and even higher task scores across diverse workloads.

AI AgentsModel QuantizationOpenClaw

0 likes · 8 min read

Can QuantClaw Cut OpenClaw Costs by 21% and Speed Up Inference by 15%?

Code Mala Tang

Apr 28, 2026 · Backend Development

Redis No Longer Dominates: Discover the Best Python Caching Alternatives

A benchmark of Redis, Memcached, DragonflyDB, and Cashews using the same FastAPI workload reveals that Redis falls behind on latency, throughput, and memory efficiency, while DragonflyDB and Cashews offer superior performance and developer experience for Python caching.

CachingCashewsDragonflyDB

0 likes · 11 min read

Redis No Longer Dominates: Discover the Best Python Caching Alternatives

PaperAgent

Apr 5, 2026 · Artificial Intelligence

Can AI Make Code Faster? Problem‑Oriented Optimization and Anchor Verification Breakthrough

A recent ICLR 2026 study from Zhejiang University, Ant Group, and Stony Brook introduces a problem‑oriented dataset and an anchor‑verification framework that enable large language models to not only generate correct code but also significantly improve its execution speed, achieving up to six‑fold acceleration while maintaining high correctness.

AI code generationanchor verificationcode optimization

0 likes · 8 min read

Can AI Make Code Faster? Problem‑Oriented Optimization and Anchor Verification Breakthrough

Old Zhang's AI Learning

Feb 24, 2026 · Industry Insights

How Taalas HC1 Embeds Llama 3.1 8B in Silicon to Achieve 17k tokens/s

Taalas embeds the Llama 3.1 8B model directly into a 6nm ASIC, delivering 17,000 tokens per second—nearly ten times faster than top NVIDIA GPUs—while cutting system cost by over tenfold and power consumption by tenfold, albeit with limited flexibility and quantization trade‑offs.

AI hardwareASICLlama 3.1

0 likes · 10 min read

How Taalas HC1 Embeds Llama 3.1 8B in Silicon to Achieve 17k tokens/s

TonyBai

Feb 18, 2026 · Backend Development

Why We Chose Go Over Python for Building an LLM Gateway

The Bifrost team replaced Python with Go for their LLM gateway, achieving roughly 700× lower latency, 68% less memory usage, and three‑fold higher throughput, and the article explains the performance bottlenecks of Python, Go’s concurrency model, deployment advantages, and future AI infrastructure trends.

AI InfrastructureGoLLM Gateway

0 likes · 14 min read

Why We Chose Go Over Python for Building an LLM Gateway

Network Intelligence Research Center (NIRC)

Dec 23, 2025 · Artificial Intelligence

ClusterAttn: Compressing KV Cache with Intrinsic Attention Clustering

ClusterAttn tackles the KV‑cache bottleneck of large language models by exploiting the natural clustering of attention scores, achieving up to 92% compression without accuracy loss, boosting throughput 2.6–4.8×, handling 128K‑token sequences on a single GPU, and outperforming existing training‑free compression methods.

KV cache compressionattention clusteringdensity clustering

0 likes · 8 min read

ClusterAttn: Compressing KV Cache with Intrinsic Attention Clustering

Network Intelligence Research Center (NIRC)

Jul 15, 2025 · Fundamentals

How to Write High‑Performance GPU Code with OpenAI Triton

This article introduces OpenAI's Triton language, compares its block‑wise programming model to traditional CUDA, walks through vector‑addition and fused‑softmax kernel implementations, and presents benchmark results that demonstrate significant speedups over native PyTorch operations.

CUDAGPU programmingPyTorch

0 likes · 10 min read

How to Write High‑Performance GPU Code with OpenAI Triton

Linux Kernel Journey

Feb 27, 2025 · Cloud Native

Designing FUSE: From Kernel VFS to Userspace and JuiceFS Performance

This article explains the evolution of file system architecture from kernel‑level VFS to userspace via FUSE, reviews the historical role of NFS, details JuiceFS's implementation on top of FUSE, and presents benchmark results that demonstrate its high throughput and practical limitations.

Distributed storageFUSEJuiceFS

0 likes · 15 min read

Designing FUSE: From Kernel VFS to Userspace and JuiceFS Performance

AIWalker

Jan 14, 2025 · Artificial Intelligence

Pure 3×3 Convolutions for Image‑Generation Diffusion Models: The DiC Approach

The paper introduces DiC, a fully convolutional diffusion model that rethinks 3×3 convolutions, adds sparse skip connections, stage‑specific embeddings and conditional gating, and demonstrates superior FID/IS scores and faster inference compared to diffusion Transformers across multiple scales.

AIDiffusion Modelsconvolutional networks

0 likes · 19 min read

Pure 3×3 Convolutions for Image‑Generation Diffusion Models: The DiC Approach

Alibaba Cloud Big Data AI Platform

Dec 18, 2024 · Artificial Intelligence

Can GPU Graph Algorithms Boost Vector Search Performance by 10×?

This article explains how OpenSearch's GPU‑accelerated vector search leverages parallel graph algorithms to achieve up to tenfold speed improvements over CPU solutions, detailing ANNS techniques, performance benchmarks, and practical GPU specifications for high‑QPS AI applications.

GPU AccelerationOpenSearchapproximate nearest neighbor

0 likes · 11 min read

Can GPU Graph Algorithms Boost Vector Search Performance by 10×?

Alibaba Cloud Developer

Dec 3, 2024 · Operations

How to Boost Logtail Multiline Log Collection Speed by Up to 7×

This article investigates why enabling line‑prefix regex for multiline logs slows Logtail down, explains the underlying regex matching mechanism, and demonstrates how switching from boost::regex_match to boost::regex_search with proper flags can dramatically improve collection throughput, achieving a seven‑fold speed increase.

boost::regexlog collectionlogtail

0 likes · 10 min read

How to Boost Logtail Multiline Log Collection Speed by Up to 7×

21CTO

Nov 7, 2024 · Databases

Can Memcached Match ScyllaDB? Deep Performance Comparison and Choosing the Right Solution

This article presents a comprehensive benchmark comparing memcached and ScyllaDB across memory, disk, and read‑only workloads, analyzes architectural trade‑offs, and offers practical guidance on when to prefer a simple in‑memory cache versus a persistent wide‑column database.

CachingDatabase ComparisonMemcached

0 likes · 13 min read

Can Memcached Match ScyllaDB? Deep Performance Comparison and Choosing the Right Solution

Alibaba Cloud Big Data AI Platform

Sep 16, 2024 · Artificial Intelligence

How TAG Makes LLM Inference Fully Asynchronous for Higher Throughput

With the growing complexity of LLM architectures like GQA, MLA, and MoE, runtime overhead has become a bottleneck; this article analyzes Python performance, communication costs, and synchronous execution in current inference frameworks, introduces the fully asynchronous TAG architecture, and demonstrates its superior throughput and latency through benchmarks.

GPU UtilizationLLM Inferenceasynchronous scheduling

0 likes · 12 min read

How TAG Makes LLM Inference Fully Asynchronous for Higher Throughput

21CTO

Apr 23, 2024 · Artificial Intelligence

Deploy Large Language Models with vLLM and Quantization for Low Latency

This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.

LLM deploymentPythonQuantization

0 likes · 13 min read

Deploy Large Language Models with vLLM and Quantization for Low Latency

Aikesheng Open Source Community

Nov 29, 2023 · Databases

When to Use Distributed vs. Centralized Databases: Analysis, Benchmarks, and Best Practices

This article examines the trade‑offs between centralized and distributed OLTP databases, presents industry usage statistics, performance benchmarks, practical questions for migration, and detailed guidance on sharding, SQL design, and operational considerations to help decide when a distributed solution is truly needed.

Database ArchitectureOLTPSharding

0 likes · 12 min read

When to Use Distributed vs. Centralized Databases: Analysis, Benchmarks, and Best Practices

Architects' Tech Alliance

Aug 31, 2022 · Artificial Intelligence

Performance Evaluation of Transformer Models on the Inspur NF5488A5 GPU Server

This article presents a detailed benchmark of four Transformer models of varying sizes trained on the high‑end Inspur NF5488A5 GPU server, compares its NVSwitch‑based interconnect with a PCIe‑based system, and analyzes the impact of model scale, tensor parallelism, and hardware bandwidth on training efficiency.

DeepSpeedGPU serverMegatron-LM

0 likes · 12 min read

Performance Evaluation of Transformer Models on the Inspur NF5488A5 GPU Server

Baidu Tech Salon

Jun 28, 2022 · Artificial Intelligence

How Kunlun XPU‑R Redefines AI Compute: Architecture, Performance, and Future Trends

The article presents a detailed technical review of Kunlun Chip's XPU‑R AI accelerator, covering its evolution from early FPGA prototypes to the current 7nm, 256 TOPS chip, the architectural choices that address AI workload demands, performance advantages over CPUs/GPUs, and the product ecosystem supporting diverse AI scenarios.

AI accelerationAI hardwareKunlun chip

0 likes · 20 min read

How Kunlun XPU‑R Redefines AI Compute: Architecture, Performance, and Future Trends

Code DAO

May 21, 2022 · Artificial Intelligence

How Quantization and Fusion Accelerate CNN Inference on Edge Devices

The article explains CNN inference optimization by applying PyTorch quantization and module‑fusion techniques, compares model size and latency before and after quantization, shows code for building, quantizing, and fusing a simple CNN, and presents benchmark results on CPU, highlighting a four‑fold size reduction and up to 1.7× speed‑up.

CNNPyTorchQuantization

0 likes · 11 min read

How Quantization and Fusion Accelerate CNN Inference on Edge Devices

Volcano Engine Developer Services

Mar 16, 2022 · Artificial Intelligence

How veGiantModel Boosts Large Language Model Training Up to 6.9× Faster

The article introduces Volcano Engine's veGiantModel, a high‑performance large‑model training framework built on PyTorch, Megatron and DeepSpeed, details its distributed parallel strategies, hardware setups, benchmark results showing up to 6.9× speedup over Megatron and DeepSpeed, and provides open‑source links for further use.

ByteCCLdistributed traininglarge language models

0 likes · 6 min read

How veGiantModel Boosts Large Language Model Training Up to 6.9× Faster

FunTester

Jul 5, 2021 · Industry Insights

Which Load Testing Tool Wins at 100k QPS? K6 vs Gatling vs FunTester Benchmarks

In a series of local benchmarks on a 2.6 GHz six‑core Intel i7 machine, the author compares K6, Gatling, and FunTester under 10 k to 20 k QPS loads, detailing CPU, memory, and response‑time metrics, analyzing script languages, JVM settings, and offering optimization suggestions for FunTester.

FunTesterGatlingJVM

0 likes · 11 min read

FunTester

Jun 29, 2021 · Operations

Which Load‑Testing Tool Performs Best? JMeter, k6, Locust & FunTester Compared

This article benchmarks four load‑testing frameworks—JMeter, k6, Locust, and FunTester—across multiple concurrency levels, measuring CPU, memory, QPS and response time to reveal each tool’s strengths, weaknesses, and scalability limits.

FunTesterJMeterK6

0 likes · 11 min read

Which Load‑Testing Tool Performs Best? JMeter, k6, Locust & FunTester Compared

NetEase Media Technology Team

Jan 15, 2021 · Backend Development

Go Language Practice and Ngo Framework Development at NetEase Media

Facing high memory usage and slow startup after containerizing its Java services, NetEase Media adopted Go in 2020, leveraging its fast compilation, low‑resource footprint and goroutine‑based concurrency to build the high‑performance Ngo framework, which outperforms Spring‑Boot in throughput while using far less memory.

Backend DevelopmentGo languageGoroutine

0 likes · 32 min read

Go Language Practice and Ngo Framework Development at NetEase Media

ITPUB

Dec 24, 2020 · Databases

How TDSQL Achieves Multi‑Level Strong Consistency with 4×‑3× Performance Gains

This article explains how Tencent's TDSQL database tackles the combined challenges of transaction and distributed consistency by introducing a multi‑level strong consistency model that delivers several‑fold performance improvements over Spanner, CockroachDB, and native Greenplum while preserving ACID guarantees.

Database ResearchTDSQLdistributed transactions

0 likes · 12 min read

How TDSQL Achieves Multi‑Level Strong Consistency with 4×‑3× Performance Gains

Xiao Lou's Tech Notes

May 19, 2020 · Backend Development

Can You Build a Faster Counter Than Java’s LongAdder? A Deep Dive

An in‑depth Java performance study explores LongAdder, compares it with AtomicLong and lock‑based counters using JMH, and walks through successive custom implementations (V0‑V5) that apply striping, modulo optimization, false‑sharing elimination, and advanced hash probing to approach or surpass LongAdder’s throughput.

JMHJava concurrencyfalse sharing

0 likes · 16 min read

Can You Build a Faster Counter Than Java’s LongAdder? A Deep Dive

360 Zhihui Cloud Developer

Sep 3, 2019 · Big Data

QuickSQL: 360’s Unified Multi-Source Query Engine Explained

This article outlines how 360’s data center built QuickSQL, a federated SQL engine that unifies queries across heterogeneous sources such as Hive, MySQL, and Elasticsearch, detailing the business challenges, architectural design, performance benchmarks, and future roadmap for multi‑source data analysis.

Big DataData IntegrationFederated Query

0 likes · 12 min read

QuickSQL: 360’s Unified Multi-Source Query Engine Explained

High Availability Architecture

Jun 7, 2017 · Databases

Evaluating Pilosa on Dense, Low‑Cardinality Data Using the NYC Taxi Dataset

This article examines whether Pilosa, a bitmap index originally built for sparse high‑cardinality data, can efficiently handle dense relational datasets by benchmarking it against a billion‑row NYC taxi trip dataset and comparing query performance with other database systems.

Bitmap IndexNYC taxi datasetPilosa

0 likes · 6 min read

Evaluating Pilosa on Dense, Low‑Cardinality Data Using the NYC Taxi Dataset

21CTO

May 22, 2017 · Backend Development

Why Rewriting a Laravel App in Go Boosted Performance and Simplicity

The author rewrote a Laravel‑based Boxzilla application in Go, detailing migration steps, code‑size reduction, benchmark results, and testing advantages, showing how Go delivers faster response times, lower latency, and a more maintainable backend.

Code size reductionGoLaravel migration

0 likes · 7 min read

Why Rewriting a Laravel App in Go Boosted Performance and Simplicity