Tagged articles

21 articles

Page 1 of 1

May 17, 2026 · Artificial Intelligence

Personalizing AI Agents: Memory, Rolling Context, and Advanced Retrieval Techniques

The article explains how AI agents use memory to retain conversation context, why sending the full history to large language models is inefficient, and presents rolling context windows, inverted‑index pruning, semantic embedding retrieval, and GraphRAG as complementary strategies to build more accurate and personalized agents.

AI memoryGraphRAGLLM optimization

0 likes · 10 min read

Personalizing AI Agents: Memory, Rolling Context, and Advanced Retrieval Techniques

Machine Heart

Apr 23, 2026 · Artificial Intelligence

DeepSeek Unveils Tile Kernels and DeepEP V2 – Is V4 on the Horizon?

DeepSeek recently opened the Tile Kernels repository and released DeepEP V2, detailing new GPU kernel features, a fully JIT-enabled expert parallelism redesign that boosts peak performance by up to 1.3× while cutting SM usage fourfold, and hinting at an upcoming V4 release.

DeepEP V2DeepSeekExpert Parallelism

0 likes · 6 min read

DeepSeek Unveils Tile Kernels and DeepEP V2 – Is V4 on the Horizon?

Woodpecker Software Testing

Mar 17, 2026 · Artificial Intelligence

5 Proven Strategies to Boost Large Language Model Performance

The article presents five actionable strategies—defining a three‑dimensional performance baseline, applying layered injection load tests, co‑optimizing dynamic quantization with cache, employing SLO‑driven chaos engineering, and shifting testing left to compilation—to reliably measure and improve LLM throughput, latency, and resource efficiency in production.

LLM optimizationLarge Language ModelsLoad Testing

0 likes · 7 min read

5 Proven Strategies to Boost Large Language Model Performance

High Availability Architecture

Mar 12, 2026 · Artificial Intelligence

How Claude Code Hits 92% Prompt Cache Rate and Slashes AI Agent Costs by 81%

This article explains the prompt‑caching mechanism used by Claude Code, showing how separating static prefixes from dynamic tails and leveraging KV‑tensor caching reduces the O(n²) complexity of transformer pre‑fill to O(n), achieving a 92% cache hit rate and up to 81% cost savings in long‑running AI agent sessions.

AI agentsClaudeCost reduction

0 likes · 12 min read

How Claude Code Hits 92% Prompt Cache Rate and Slashes AI Agent Costs by 81%

PaperAgent

Mar 3, 2026 · Artificial Intelligence

How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

The article presents CharacterFlywheel, a 15‑generation flywheel methodology that iteratively improves social‑dialogue LLMs in production using data‑driven reward models, rejection sampling, and a mix of SFT, DPO, and RL, with detailed experiments and best‑practice insights.

AI SafetyLLM optimizationReinforcement Learning

0 likes · 12 min read

How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

Baobao Algorithm Notes

Feb 4, 2026 · Artificial Intelligence

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Deep Learning CompilerLLM optimizationLinear Attention

0 likes · 22 min read

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

Instant Consumer Technology Team

Dec 18, 2025 · Artificial Intelligence

How a Multi‑Agent Framework Boosts Graph Chain‑of‑Thought Reasoning Efficiency

The paper introduces GLM, a multi‑agent Graph‑CoT framework with an optimized LLM serving architecture that dramatically improves accuracy, reduces token consumption, lowers latency, and increases throughput across diverse domains, as demonstrated by extensive GRBench evaluations.

LLM optimizationMulti-AgentToken efficiency

0 likes · 10 min read

How a Multi‑Agent Framework Boosts Graph Chain‑of‑Thought Reasoning Efficiency

Old Meng AI Explorer

Nov 24, 2025 · Artificial Intelligence

How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU

ktransformers is an open‑source AI model optimization framework that dramatically reduces memory usage and speeds up loading and inference, enabling ordinary laptops— even without a GPU— to run 7B‑13B large language models for coding, content creation, and academic assistance.

KTransformersLLM optimizationLocal AI

0 likes · 10 min read

How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU

Instant Consumer Technology Team

Oct 17, 2025 · Artificial Intelligence

Mastering Context Engineering for AI Agents: Overcome Overload with Smart Strategies

This article distills Anthropic’s “Effective Context Engineering for AI Agents” into key insights, explaining why context engineering matters, how it differs from prompt engineering, what constitutes good practice, and practical techniques—system prompts, tool design, few‑shot prompting, compaction, structured note‑taking, and sub‑agent architectures—to mitigate context overload in large language model agents.

AI agentsAgent DesignContext Engineering

0 likes · 10 min read

Mastering Context Engineering for AI Agents: Overcome Overload with Smart Strategies

DataFunTalk

Oct 6, 2025 · Artificial Intelligence

Mastering Context Engineering: 5 Proven Strategies to Boost AI Agent Performance

This article explores the emerging concept of context engineering for AI agents, explains why managing long‑range context is critical, and details five practical strategies—Offload, Reduce, Retrieve, Isolate, and Cache—backed by insights from leading industry teams and the "Bitter Lesson" philosophy.

AI agentsContext EngineeringLLM optimization

0 likes · 30 min read

Mastering Context Engineering: 5 Proven Strategies to Boost AI Agent Performance

Instant Consumer Technology Team

Sep 3, 2025 · Artificial Intelligence

Why Context Modeling Could Replace RAG – Insights from DeepVista CEO Jing Conan Wang

In a two‑hour interview, DeepVista CEO Jing Conan Wang explains how his new "context modeling" paradigm addresses the rigidity, lack of personalization, and performance limits of current RAG‑based AI agents, proposing a dual‑model architecture that learns and adapts context dynamically for faster, more accurate results.

AI ArchitectureLLM optimizationPersonalized AI

0 likes · 15 min read

Why Context Modeling Could Replace RAG – Insights from DeepVista CEO Jing Conan Wang

Alibaba Cloud Big Data AI Platform

Jul 23, 2025 · Artificial Intelligence

Unlock Efficient LLMs: How Alibaba’s PAI EasyDistill Powers Model Post‑Training

This article explains how Alibaba Cloud's AI platform PAI leverages the EasyDistill framework for post‑training model optimization, covering knowledge distillation concepts, data synthesis techniques, basic and advanced distillation training, the DistilQwen model family, real‑world customer cases, and step‑by‑step practical demos.

AI PlatformEasyDistillLLM optimization

0 likes · 12 min read

Unlock Efficient LLMs: How Alibaba’s PAI EasyDistill Powers Model Post‑Training

Kuaishou Tech

Apr 24, 2025 · Artificial Intelligence

Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning

The article introduces SRPO, a two‑stage history‑resampling reinforcement‑learning framework that systematically tackles common GRPO training issues and achieves state‑of‑the‑art performance on both math and code benchmarks with far fewer training steps, while also revealing emergent self‑reflection behaviors in large language models.

LLM optimizationReinforcement LearningSRPO

0 likes · 12 min read

Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning

AIWalker

Feb 19, 2025 · Artificial Intelligence

DeepSeek’s NSA Attention Cuts Inference Time 11× – CEO Liang Co‑author

DeepSeek introduces the NSA sparse attention mechanism, combining dynamic hierarchical sparsity, coarse token compression and fine token selection to achieve up to 11.6× faster inference, lower pre‑training cost, and superior benchmark performance across general, long‑context, and chain‑of‑thought tasks.

DeepSeekLLM optimizationNSA

0 likes · 9 min read

DeepSeek’s NSA Attention Cuts Inference Time 11× – CEO Liang Co‑author

AI2ML AI to Machine Learning

Feb 5, 2025 · Artificial Intelligence

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

The article enumerates DeepSeek’s extensive technical optimizations—including Grouped Query Attention, Multi‑head Latent Attention, Mixture‑of‑Experts, 4D parallelism, quantization, and multi‑token prediction—that together enable cheap, high‑performance large language models.

4D parallelismDeepSeekGrouped Query Attention

0 likes · 8 min read

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

AsiaInfo Technology: New Tech Exploration

Dec 13, 2024 · Artificial Intelligence

Optimizing Graph RAG: Boosting Global QA with Better Chunking, Prompts, and Entity Extraction

This article presents a comprehensive analysis of Graph RAG, detailing its implementation workflow, step‑by‑step execution guide, four targeted optimization strategies, and experimental validation that demonstrates significant improvements in global and local question answering for industry scenarios.

Graph RAGLLM optimizationPrompt engineering

0 likes · 18 min read

Optimizing Graph RAG: Boosting Global QA with Better Chunking, Prompts, and Entity Extraction

Baobao Algorithm Notes

Oct 21, 2024 · Artificial Intelligence

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

This article provides a thorough, four‑part overview of RLHF for large language models, covering preference‑optimization algorithms (PPO‑based and offline RL approaches), reward‑model training techniques, inference‑time exploration strategies, and practical implementation details including the OpenRLHF framework and resource‑allocation tricks.

DPOLLM optimizationOpenRLHF

0 likes · 27 min read

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

Tencent Cloud Developer

Jul 30, 2024 · Artificial Intelligence

A Systematic Guide to Prompt Engineering: From Zero to One

This guide walks readers from beginner to proficient Prompt Engineer by outlining the evolution of prompting, introducing a universal four‑component template, and detailing a five‑step workflow—including refinement, retrieval‑augmented generation, chain‑of‑thought reasoning, and advanced tuning techniques—plus evaluation metrics for LLM performance.

AI promptingChain-of-ThoughtLLM optimization

0 likes · 51 min read

A Systematic Guide to Prompt Engineering: From Zero to One

Alibaba Cloud Developer

Jun 27, 2024 · Artificial Intelligence

How to Supercharge Retrieval‑Augmented Generation: Papers, Techniques, and Real‑World Tips

This article surveys the main challenges of deploying large language models, introduces key RAG optimization papers such as RAPTOR, Self‑RAG, and CRAG, and compiles practical engineering tricks—including chunking, query rewriting, hybrid and progressive retrieval—to help practitioners build more accurate and efficient RAG systems.

AI researchLLM optimizationRAG

0 likes · 22 min read

How to Supercharge Retrieval‑Augmented Generation: Papers, Techniques, and Real‑World Tips

Baidu Tech Salon

May 20, 2024 · Artificial Intelligence

Boosting Ad Efficiency with Baidu’s Multi‑Agent AI Architecture

In the AI‑native era, Baidu's ad platform adopts a multi‑agent architecture that combines large and small LLMs, SOP‑driven workflows, long‑term memory, and vector databases to achieve high query accuracy, low latency, and significant business gains while tackling challenges such as hallucination, planning, execution, and personalization.

AI agentsLLM optimizationLarge Language Models

0 likes · 18 min read

Boosting Ad Efficiency with Baidu’s Multi‑Agent AI Architecture

NetEase Cloud Music Tech Team

Dec 28, 2023 · Frontend Development

Lossless Design-Frontend Collaboration: The Evolution of NetEase Cloud Music's Design Collaboration Practice

Since 2021, NetEase Cloud Music’s Design Platform has evolved its design‑frontend workflow through three stages—engineering phase 1.0, phase 2.0, and the AI‑driven intelligent phase—by introducing the C2D2C (Code‑to‑Design‑to‑Code) methodology, unified design systems, LLM‑enhanced code, and generative AI tools, cutting communication overhead and boosting designer and developer productivity by up to 200 %.

AI designC2D2CD2C

0 likes · 31 min read

Lossless Design-Frontend Collaboration: The Evolution of NetEase Cloud Music's Design Collaboration Practice