Tagged articles
21 articles
Page 1 of 1
Data Party THU
Data Party THU
May 17, 2026 · Artificial Intelligence

Personalizing AI Agents: Memory, Rolling Context, and Advanced Retrieval Techniques

The article explains how AI agents use memory to retain conversation context, why sending the full history to large language models is inefficient, and presents rolling context windows, inverted‑index pruning, semantic embedding retrieval, and GraphRAG as complementary strategies to build more accurate and personalized agents.

AI memoryGraphRAGLLM optimization
0 likes · 10 min read
Personalizing AI Agents: Memory, Rolling Context, and Advanced Retrieval Techniques
Machine Heart
Machine Heart
Apr 23, 2026 · Artificial Intelligence

DeepSeek Unveils Tile Kernels and DeepEP V2 – Is V4 on the Horizon?

DeepSeek recently opened the Tile Kernels repository and released DeepEP V2, detailing new GPU kernel features, a fully JIT-enabled expert parallelism redesign that boosts peak performance by up to 1.3× while cutting SM usage fourfold, and hinting at an upcoming V4 release.

DeepEP V2DeepSeekExpert Parallelism
0 likes · 6 min read
DeepSeek Unveils Tile Kernels and DeepEP V2 – Is V4 on the Horizon?
Woodpecker Software Testing
Woodpecker Software Testing
Mar 17, 2026 · Artificial Intelligence

5 Proven Strategies to Boost Large Language Model Performance

The article presents five actionable strategies—defining a three‑dimensional performance baseline, applying layered injection load tests, co‑optimizing dynamic quantization with cache, employing SLO‑driven chaos engineering, and shifting testing left to compilation—to reliably measure and improve LLM throughput, latency, and resource efficiency in production.

LLM optimizationLarge Language ModelsLoad Testing
0 likes · 7 min read
5 Proven Strategies to Boost Large Language Model Performance
High Availability Architecture
High Availability Architecture
Mar 12, 2026 · Artificial Intelligence

How Claude Code Hits 92% Prompt Cache Rate and Slashes AI Agent Costs by 81%

This article explains the prompt‑caching mechanism used by Claude Code, showing how separating static prefixes from dynamic tails and leveraging KV‑tensor caching reduces the O(n²) complexity of transformer pre‑fill to O(n), achieving a 92% cache hit rate and up to 81% cost savings in long‑running AI agent sessions.

AI agentsClaudeCost reduction
0 likes · 12 min read
How Claude Code Hits 92% Prompt Cache Rate and Slashes AI Agent Costs by 81%
PaperAgent
PaperAgent
Mar 3, 2026 · Artificial Intelligence

How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

The article presents CharacterFlywheel, a 15‑generation flywheel methodology that iteratively improves social‑dialogue LLMs in production using data‑driven reward models, rejection sampling, and a mix of SFT, DPO, and RL, with detailed experiments and best‑practice insights.

AI SafetyLLM optimizationReinforcement Learning
0 likes · 12 min read
How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization
Baobao Algorithm Notes
Baobao Algorithm Notes
Feb 4, 2026 · Artificial Intelligence

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Deep Learning CompilerLLM optimizationLinear Attention
0 likes · 22 min read
Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks
Instant Consumer Technology Team
Instant Consumer Technology Team
Dec 18, 2025 · Artificial Intelligence

How a Multi‑Agent Framework Boosts Graph Chain‑of‑Thought Reasoning Efficiency

The paper introduces GLM, a multi‑agent Graph‑CoT framework with an optimized LLM serving architecture that dramatically improves accuracy, reduces token consumption, lowers latency, and increases throughput across diverse domains, as demonstrated by extensive GRBench evaluations.

LLM optimizationMulti-AgentToken efficiency
0 likes · 10 min read
How a Multi‑Agent Framework Boosts Graph Chain‑of‑Thought Reasoning Efficiency
Old Meng AI Explorer
Old Meng AI Explorer
Nov 24, 2025 · Artificial Intelligence

How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU

ktransformers is an open‑source AI model optimization framework that dramatically reduces memory usage and speeds up loading and inference, enabling ordinary laptops— even without a GPU— to run 7B‑13B large language models for coding, content creation, and academic assistance.

KTransformersLLM optimizationLocal AI
0 likes · 10 min read
How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU
Instant Consumer Technology Team
Instant Consumer Technology Team
Oct 17, 2025 · Artificial Intelligence

Mastering Context Engineering for AI Agents: Overcome Overload with Smart Strategies

This article distills Anthropic’s “Effective Context Engineering for AI Agents” into key insights, explaining why context engineering matters, how it differs from prompt engineering, what constitutes good practice, and practical techniques—system prompts, tool design, few‑shot prompting, compaction, structured note‑taking, and sub‑agent architectures—to mitigate context overload in large language model agents.

AI agentsAgent DesignContext Engineering
0 likes · 10 min read
Mastering Context Engineering for AI Agents: Overcome Overload with Smart Strategies
DataFunTalk
DataFunTalk
Oct 6, 2025 · Artificial Intelligence

Mastering Context Engineering: 5 Proven Strategies to Boost AI Agent Performance

This article explores the emerging concept of context engineering for AI agents, explains why managing long‑range context is critical, and details five practical strategies—Offload, Reduce, Retrieve, Isolate, and Cache—backed by insights from leading industry teams and the "Bitter Lesson" philosophy.

AI agentsContext EngineeringLLM optimization
0 likes · 30 min read
Mastering Context Engineering: 5 Proven Strategies to Boost AI Agent Performance
Instant Consumer Technology Team
Instant Consumer Technology Team
Sep 3, 2025 · Artificial Intelligence

Why Context Modeling Could Replace RAG – Insights from DeepVista CEO Jing Conan Wang

In a two‑hour interview, DeepVista CEO Jing Conan Wang explains how his new "context modeling" paradigm addresses the rigidity, lack of personalization, and performance limits of current RAG‑based AI agents, proposing a dual‑model architecture that learns and adapts context dynamically for faster, more accurate results.

AI ArchitectureLLM optimizationPersonalized AI
0 likes · 15 min read
Why Context Modeling Could Replace RAG – Insights from DeepVista CEO Jing Conan Wang
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 23, 2025 · Artificial Intelligence

Unlock Efficient LLMs: How Alibaba’s PAI EasyDistill Powers Model Post‑Training

This article explains how Alibaba Cloud's AI platform PAI leverages the EasyDistill framework for post‑training model optimization, covering knowledge distillation concepts, data synthesis techniques, basic and advanced distillation training, the DistilQwen model family, real‑world customer cases, and step‑by‑step practical demos.

AI PlatformEasyDistillLLM optimization
0 likes · 12 min read
Unlock Efficient LLMs: How Alibaba’s PAI EasyDistill Powers Model Post‑Training
Kuaishou Tech
Kuaishou Tech
Apr 24, 2025 · Artificial Intelligence

Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning

The article introduces SRPO, a two‑stage history‑resampling reinforcement‑learning framework that systematically tackles common GRPO training issues and achieves state‑of‑the‑art performance on both math and code benchmarks with far fewer training steps, while also revealing emergent self‑reflection behaviors in large language models.

LLM optimizationReinforcement LearningSRPO
0 likes · 12 min read
Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning
AIWalker
AIWalker
Feb 19, 2025 · Artificial Intelligence

DeepSeek’s NSA Attention Cuts Inference Time 11× – CEO Liang Co‑author

DeepSeek introduces the NSA sparse attention mechanism, combining dynamic hierarchical sparsity, coarse token compression and fine token selection to achieve up to 11.6× faster inference, lower pre‑training cost, and superior benchmark performance across general, long‑context, and chain‑of‑thought tasks.

DeepSeekLLM optimizationNSA
0 likes · 9 min read
DeepSeek’s NSA Attention Cuts Inference Time 11× – CEO Liang Co‑author
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Feb 5, 2025 · Artificial Intelligence

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

The article enumerates DeepSeek’s extensive technical optimizations—including Grouped Query Attention, Multi‑head Latent Attention, Mixture‑of‑Experts, 4D parallelism, quantization, and multi‑token prediction—that together enable cheap, high‑performance large language models.

4D parallelismDeepSeekGrouped Query Attention
0 likes · 8 min read
What Optimizations Power DeepSeek’s High‑Efficiency LLMs?
AsiaInfo Technology: New Tech Exploration
AsiaInfo Technology: New Tech Exploration
Dec 13, 2024 · Artificial Intelligence

Optimizing Graph RAG: Boosting Global QA with Better Chunking, Prompts, and Entity Extraction

This article presents a comprehensive analysis of Graph RAG, detailing its implementation workflow, step‑by‑step execution guide, four targeted optimization strategies, and experimental validation that demonstrates significant improvements in global and local question answering for industry scenarios.

Graph RAGLLM optimizationPrompt engineering
0 likes · 18 min read
Optimizing Graph RAG: Boosting Global QA with Better Chunking, Prompts, and Entity Extraction
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 21, 2024 · Artificial Intelligence

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

This article provides a thorough, four‑part overview of RLHF for large language models, covering preference‑optimization algorithms (PPO‑based and offline RL approaches), reward‑model training techniques, inference‑time exploration strategies, and practical implementation details including the OpenRLHF framework and resource‑allocation tricks.

DPOLLM optimizationOpenRLHF
0 likes · 27 min read
Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide
Tencent Cloud Developer
Tencent Cloud Developer
Jul 30, 2024 · Artificial Intelligence

A Systematic Guide to Prompt Engineering: From Zero to One

This guide walks readers from beginner to proficient Prompt Engineer by outlining the evolution of prompting, introducing a universal four‑component template, and detailing a five‑step workflow—including refinement, retrieval‑augmented generation, chain‑of‑thought reasoning, and advanced tuning techniques—plus evaluation metrics for LLM performance.

AI promptingChain-of-ThoughtLLM optimization
0 likes · 51 min read
A Systematic Guide to Prompt Engineering: From Zero to One
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 27, 2024 · Artificial Intelligence

How to Supercharge Retrieval‑Augmented Generation: Papers, Techniques, and Real‑World Tips

This article surveys the main challenges of deploying large language models, introduces key RAG optimization papers such as RAPTOR, Self‑RAG, and CRAG, and compiles practical engineering tricks—including chunking, query rewriting, hybrid and progressive retrieval—to help practitioners build more accurate and efficient RAG systems.

AI researchLLM optimizationRAG
0 likes · 22 min read
How to Supercharge Retrieval‑Augmented Generation: Papers, Techniques, and Real‑World Tips
Baidu Tech Salon
Baidu Tech Salon
May 20, 2024 · Artificial Intelligence

Boosting Ad Efficiency with Baidu’s Multi‑Agent AI Architecture

In the AI‑native era, Baidu's ad platform adopts a multi‑agent architecture that combines large and small LLMs, SOP‑driven workflows, long‑term memory, and vector databases to achieve high query accuracy, low latency, and significant business gains while tackling challenges such as hallucination, planning, execution, and personalization.

AI agentsLLM optimizationLarge Language Models
0 likes · 18 min read
Boosting Ad Efficiency with Baidu’s Multi‑Agent AI Architecture
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Dec 28, 2023 · Frontend Development

Lossless Design-Frontend Collaboration: The Evolution of NetEase Cloud Music's Design Collaboration Practice

Since 2021, NetEase Cloud Music’s Design Platform has evolved its design‑frontend workflow through three stages—engineering phase 1.0, phase 2.0, and the AI‑driven intelligent phase—by introducing the C2D2C (Code‑to‑Design‑to‑Code) methodology, unified design systems, LLM‑enhanced code, and generative AI tools, cutting communication overhead and boosting designer and developer productivity by up to 200 %.

AI designC2D2CD2C
0 likes · 31 min read
Lossless Design-Frontend Collaboration: The Evolution of NetEase Cloud Music's Design Collaboration Practice