Tagged articles

11 articles

Page 1 of 1

Jun 1, 2026 · Artificial Intelligence

Why Do Most Agent Projects Fail Before Launch? LangChain’s Solution

The article explains why many AI Agent projects collapse before production due to non‑determinism, error propagation, and creative solutions, and presents LangChain’s Deep Agent evaluation framework—integrated with LangSmith, AWS Bedrock, and Pytest—to provide a reproducible, end‑to‑end testing and monitoring process.

AWS BedrockAgent EvaluationDeep Agent

0 likes · 9 min read

Why Do Most Agent Projects Fail Before Launch? LangChain’s Solution

PaperAgent

May 25, 2026 · Artificial Intelligence

DeepSeek’s Harness: How Agent Harness Engineering Is Shaping the Next LLM Agent Era

The article surveys DeepSeek’s Harness initiative, presenting the Binding‑Constraint Thesis, three‑stage evolution from prompt to harness engineering, the ETCLOVG seven‑layer architecture, and concrete benchmark evidence that harness‑only improvements far outweigh model upgrades, while detailing security, observability, and governance considerations for reliable LLM agents.

AI ArchitectureAgent EvaluationAgent Harness Engineering

0 likes · 12 min read

DeepSeek’s Harness: How Agent Harness Engineering Is Shaping the Next LLM Agent Era

AgentGuide

May 24, 2026 · Artificial Intelligence

Comprehensive AI Agent Interview Guide: From Core Concepts to Engineering Implementation

This curated collection gathers AI Agent interview questions covering fundamentals, tokenization, skill design, RAG, MCP, memory systems, evaluation methods, and practical engineering pathways, offering a complete navigation resource for backend engineers transitioning to AI roles.

AI AgentAgent EvaluationInterview Questions

0 likes · 3 min read

Comprehensive AI Agent Interview Guide: From Core Concepts to Engineering Implementation

ITPUB

May 16, 2026 · Artificial Intelligence

Managing AI‑Generated Code with an Agent‑Based Evaluation Framework: Lessons from Refactoring 310 K Lines

When over 90% of a codebase is produced by AI, the authors show how a unified "people‑align → human‑machine‑align" approach, driven by evaluation agents, transforms technical debt into incremental business work, enabling continuous refactoring, AI‑friendly standards, and a sustainable engineering environment.

AI codingAI governanceAgent Evaluation

0 likes · 21 min read

Managing AI‑Generated Code with an Agent‑Based Evaluation Framework: Lessons from Refactoring 310 K Lines

Meituan Technology Team

May 7, 2026 · R&D Management

Managing AI‑Generated Code with Agent‑Based Evaluation: Refactoring 310K Lines of Code

When over 90% of a codebase is produced by AI, system quality hinges on constraining AI rather than speed, and this article details how a team used an agent‑based evaluation framework, unified standards, and incremental refactoring to turn 310,000 lines of AI‑written code into a maintainable, low‑debt system.

AI codingAI governanceAgent Evaluation

0 likes · 21 min read

Managing AI‑Generated Code with Agent‑Based Evaluation: Refactoring 310K Lines of Code

AntData

Apr 28, 2026 · Artificial Intelligence

Iterative Agent Evaluation Skill: Automating Bad‑Case Diagnosis with AI Pre‑Annotation

The article presents an end‑to‑end, eight‑phase automated evaluation pipeline for large‑model agents that replaces manual bad‑case inspection with AI‑assisted pre‑annotation, cutting analysis time from a full‑day to about 30 minutes and achieving over 90 % efficiency gain while enabling iterative knowledge‑base refinement.

AI Pre‑annotationAgent EvaluationAutomated Pipeline

0 likes · 20 min read

Iterative Agent Evaluation Skill: Automating Bad‑Case Diagnosis with AI Pre‑Annotation

Alibaba Cloud Developer

Mar 27, 2026 · Artificial Intelligence

How OpenClaw Empowers a Self‑Evolving Bank Manager Assistant

This article details a three‑day deep dive into OpenClaw, demonstrating how a self‑iterating AI assistant for bank relationship managers can be built, validated, and refined through autonomous agent communication, scheduled tasks, and memory‑driven reflection.

AI AgentsAgent EvaluationOpenClaw

0 likes · 20 min read

How OpenClaw Empowers a Self‑Evolving Bank Manager Assistant

PaperAgent

Dec 23, 2025 · Artificial Intelligence

CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning

CATArena introduces a tournament‑style evaluation framework where AI agents iteratively code, compete, and improve across classic board games, using three‑dimensional quantitative scores to measure strategy programming, global learning, and generalization, and reveals how different LLM‑based agents learn and adapt over multiple rounds.

AI benchmarkAgent EvaluationCATArena

0 likes · 8 min read

CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning

Amazon Cloud Developers

Dec 23, 2025 · Artificial Intelligence

Evaluating Agent Quality: A Practical Guide for Agentic AI

This article explains why evaluating AI agents is essential, outlines a multi‑dimensional metric system covering performance, safety, cost and bias, describes common evaluation frameworks such as AgentBoard, AgentBench and τ‑bench, and provides step‑by‑step instructions, example datasets and code for building a robust agent assessment pipeline.

AI AgentsAgent EvaluationBenchmarking

0 likes · 35 min read

Evaluating Agent Quality: A Practical Guide for Agentic AI

Fun with Large Models

Aug 20, 2025 · Artificial Intelligence

DeepSeek V3.1 Review: 128K Context, Knowledge, Programming & Agent Skills Near Claude 4

DeepSeek V3.1, released on August 19, expands context length to 128 K tokens and updates its knowledge base to July 2024, and the author’s benchmarks show its programming and agent capabilities now rival Claude 4, with detailed prompt examples, code generation demos, and performance comparisons.

Agent EvaluationClaude 4DeepSeek

0 likes · 9 min read

DeepSeek V3.1 Review: 128K Context, Knowledge, Programming & Agent Skills Near Claude 4

DataFunTalk

Jul 14, 2025 · Artificial Intelligence

Can Kimi K2 Beat Claude and Gemini in Coding and Agent Tasks?

This in‑depth review examines Kimi K2’s new focus on agent and coding abilities, comparing its performance on 3D HTML generation, code generation, and real‑world agent tasks against Claude 4 and Gemini 2.5, while also evaluating cost, openness, and practical usability for developers.

AI codingAgent EvaluationKimi K2

0 likes · 15 min read

Can Kimi K2 Beat Claude and Gemini in Coding and Agent Tasks?