Tagged articles
174 articles
Page 1 of 2
PaperAgent
PaperAgent
May 19, 2026 · Artificial Intelligence

Why Long-Term Memory Needs Vision: How MemEye Evaluates Multimodal Agent Recall

MemEye is a multimodal memory benchmark that tests agents across eight real‑world scenarios, measuring visual evidence granularity and reasoning depth, and reveals that captions fall short for fine‑grained visual recall, highlighting the need for true visual memory in long‑term AI agents.

AI AgentsBenchmarkMemEye
0 likes · 4 min read
Why Long-Term Memory Needs Vision: How MemEye Evaluates Multimodal Agent Recall
DataFunTalk
DataFunTalk
May 19, 2026 · Industry Insights

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Forks for Data Platforms

A live discussion dissected the shift from single‑point Copilot assistants to platform‑level Agentic data platforms, exposing hard architectural, security, knowledge‑base, evaluation, stability‑cost, and governance challenges while debating whether the future will favor a super‑agent or a multi‑agent ecosystem.

Agentic AIBig DataData Platform
0 likes · 18 min read
From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Forks for Data Platforms
DataFunSummit
DataFunSummit
May 18, 2026 · Artificial Intelligence

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Paths for Data Platforms

A 90‑minute live discussion examined how data platforms must evolve from simple Copilot assistants to fully agentic systems, covering architectural redesign, security guardrails, knowledge‑base integration, evaluation pitfalls, cost management, and whether the future favors a super‑agent or a multi‑agent ecosystem.

Agentic AICost ManagementData Platform
0 likes · 20 min read
From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Paths for Data Platforms
James' Growth Diary
James' Growth Diary
May 11, 2026 · Artificial Intelligence

Mastering RAG Evaluation: Recall@K, MRR, NDCG, and RAGAS Explained

This article breaks down RAG evaluation into a two‑layer framework, explains the four core metrics—Recall@K, MRR, NDCG, and the four RAGAS scores—shows how to implement them with LangChain.js, highlights common pitfalls, and offers scenario‑specific metric combinations for reliable performance monitoring.

LangChainMRRNDCG
0 likes · 20 min read
Mastering RAG Evaluation: Recall@K, MRR, NDCG, and RAGAS Explained
Machine Heart
Machine Heart
May 10, 2026 · Artificial Intelligence

Stop Fragmenting Long Texts: HiLight Lets AI Highlight Key Points Directly

The HiLight approach inserts lightweight highlight tags into full-length inputs, training a small Emphasis Actor to score token importance and guide a frozen large language model, improving performance on tasks like recommendation and QA without modifying the solver, while keeping low latency and training cost.

LLMLow latencyevaluation
0 likes · 9 min read
Stop Fragmenting Long Texts: HiLight Lets AI Highlight Key Points Directly
Architect
Architect
May 4, 2026 · Artificial Intelligence

What Skills Architects Must Master in the Agent Era and Which Will Last Six Months

In the fast‑changing Agent era, architects should focus on durable engineering capabilities—context management, tool design, evaluation, harness, permissions, and cost control—rather than chasing the latest frameworks, ensuring agents remain stable and controllable in production systems.

AI AgentsContext managementHarness
0 likes · 26 min read
What Skills Architects Must Master in the Agent Era and Which Will Last Six Months
PaperAgent
PaperAgent
May 4, 2026 · Artificial Intelligence

Why Claude 4.6 Scores Only 66%: Claw‑Eval‑Live Shows Terminal Skills Aren’t Enough

The article explains that modern AI agents must be judged on actual task execution and audit evidence, and Claw‑Eval‑Live reveals that while agents can use terminals, they still fail dramatically on cross‑system workflows such as HR, management, and operations, with no model surpassing a 70% pass rate.

AI AgentsBenchmarkClaw-Eval
0 likes · 7 min read
Why Claude 4.6 Scores Only 66%: Claw‑Eval‑Live Shows Terminal Skills Aren’t Enough
AI Engineering
AI Engineering
May 4, 2026 · Artificial Intelligence

Why the Big‑Model Race Is Over: Where Real Value Lies in AI Infrastructure

The article argues that the competition over which large language model will dominate is outdated, explaining that true value now comes from building multi‑model routing, context engineering, standardized tool protocols, intelligent orchestration, and robust evaluation layers that turn models into reliable AI infrastructure.

AI InfrastructureMCPOrchestration
0 likes · 6 min read
Why the Big‑Model Race Is Over: Where Real Value Lies in AI Infrastructure
PMTalk Product Manager Community
PMTalk Product Manager Community
May 4, 2026 · Product Management

2026 AI Product Manager: The Essential Capability Model

By 2026, AI product managers must shift from merely using models to delivering stable, valuable results, mastering seven core abilities—demand judgment, evaluation-driven iteration, context design, RAG strategy, agent orchestration, solution planning, and rapid Vibe Coding—to close the loop between business needs and AI capabilities.

AI product managementAgent DesignContext Engineering
0 likes · 13 min read
2026 AI Product Manager: The Essential Capability Model
AI Architecture Hub
AI Architecture Hub
May 3, 2026 · Artificial Intelligence

What to Learn, Build, and Skip in AI Agents

The article analyzes the fast‑changing AI‑agent landscape, proposes five concrete criteria for filtering new technologies, outlines essential concepts such as context engineering, tool design, scheduler‑subagent patterns, evaluation frameworks, and recommends a stable 2026 tech stack while warning against hype‑driven tools.

AI AgentsContext EngineeringLangGraph
0 likes · 27 min read
What to Learn, Build, and Skip in AI Agents
AI Engineer Programming
AI Engineer Programming
May 2, 2026 · Artificial Intelligence

From Demo to Production: How to Evaluate RAG Effectively

This guide outlines a comprehensive RAG evaluation framework covering failure modes, multi‑layer metrics, test‑set construction, open‑source tools, CI/CD quality gates, production monitoring, and special considerations for agentic RAG to ensure reliable, trustworthy retrieval‑augmented generation systems.

AIGenerationLLM
0 likes · 18 min read
From Demo to Production: How to Evaluate RAG Effectively
MaGe Linux Operations
MaGe Linux Operations
Apr 28, 2026 · Artificial Intelligence

Why Your RAG Performance Is Poor: Common Issues and Optimization Strategies

This article systematically analyzes why Retrieval‑Augmented Generation pipelines often underperform—covering embedding model selection, chunking strategies, hybrid retrieval, reranking, context window waste, evaluation metrics, and a detailed troubleshooting checklist—while providing concrete code examples and best‑practice recommendations for engineers.

EmbeddingHybrid RetrievalRAG
0 likes · 19 min read
Why Your RAG Performance Is Poor: Common Issues and Optimization Strategies
PaperAgent
PaperAgent
Apr 27, 2026 · Artificial Intelligence

A Comprehensive Review of Modern LLM Agent Memory Frameworks

The article surveys recent LLM‑based agent memory research, presenting a unified framework that breaks memory systems into four components, detailing their design choices, experimental evaluation on LOCOMO and LONGMEMEVAL, key findings, and a new low‑token SOTA architecture.

Agent MemoryLLMMemory Management
0 likes · 8 min read
A Comprehensive Review of Modern LLM Agent Memory Frameworks
AI Engineer Programming
AI Engineer Programming
Apr 23, 2026 · Artificial Intelligence

From Zero to One: A Roadmap for Building Trustworthy AI Agent Evaluations

The article outlines why rigorous, automated evaluation is essential for AI agents, defines core concepts such as tasks, trials, graders, and frameworks, compares code‑based, model‑based and human graders, and presents an eight‑step roadmap—from early testing to open‑source maintenance—to create reliable, scalable agent assessments.

AI AgentsAutomated TestingBenchmarking
0 likes · 22 min read
From Zero to One: A Roadmap for Building Trustworthy AI Agent Evaluations
MaGe Linux Operations
MaGe Linux Operations
Apr 22, 2026 · Artificial Intelligence

5 Essential Design Principles for Building High‑Quality RAG Systems

This article outlines five critical design principles for constructing high‑quality Retrieval‑Augmented Generation (RAG) systems, covering document chunking strategies, embedding model selection, hybrid retrieval architectures, metadata filtering with multi‑level indexes, and reranking mechanisms, and provides concrete code snippets and evaluation metrics.

EmbeddingHybrid RetrievalRAG
0 likes · 17 min read
5 Essential Design Principles for Building High‑Quality RAG Systems
Su San Talks Tech
Su San Talks Tech
Apr 21, 2026 · Artificial Intelligence

How to Turn Bad Prompts into High‑Scoring AI Prompts: A Step‑by‑Step Guide

This article walks through a complete prompt‑engineering workflow—starting from a weak baseline, building an evaluation pipeline, and applying four concrete techniques (clarity, specificity, XML structuring, and examples) that lift a Claude score from 3.4 to over 9, with code, metrics, and real‑world examples.

AIClaudePrompt Engineering
0 likes · 19 min read
How to Turn Bad Prompts into High‑Scoring AI Prompts: A Step‑by‑Step Guide
FunTester
FunTester
Apr 20, 2026 · Artificial Intelligence

Why Self‑Evaluating Agents Fail and How to Build Reliable Multi‑Agent Systems

The article analyzes why letting the same AI Agent generate and self‑evaluate results in over‑confident but flawed outputs, especially for subjective tasks, and proposes a three‑stage multi‑agent architecture with independent evaluation, concrete standards, and prompt‑based calibration to improve reliability as models evolve.

AIMulti-AgentPrompt Engineering
0 likes · 9 min read
Why Self‑Evaluating Agents Fail and How to Build Reliable Multi‑Agent Systems
Java One
Java One
Apr 20, 2026 · Artificial Intelligence

From Bad Prompts to 9.5 Scores: A Step‑by‑Step Prompt Engineering Guide

This article walks through an iterative prompt‑engineering workflow—starting with a weak baseline, applying four concrete techniques (clarity & directness, specificity, XML structuring, and examples), evaluating each change with a PromptEvaluator, and showing how scores jump from 3.4 to over 9.5 using real code snippets and concrete data.

AIClaudePrompt Engineering
0 likes · 20 min read
From Bad Prompts to 9.5 Scores: A Step‑by‑Step Prompt Engineering Guide
Machine Heart
Machine Heart
Apr 17, 2026 · Artificial Intelligence

Can LLMs Truly Mimic Human Shopping Behavior? The OPeRA Dataset and Evaluation

The paper introduces OPeRA, a step‑wise online‑shopping dataset capturing observations, personas, rationales, and actions from real users, and uses it to benchmark LLMs on next‑action prediction, revealing that even top models like GPT‑4.1 achieve only about 20 % accuracy on fine‑grained actions, with persona information offering limited benefit while rationales prove crucial.

AIDatasetLLM
0 likes · 9 min read
Can LLMs Truly Mimic Human Shopping Behavior? The OPeRA Dataset and Evaluation
Data Party THU
Data Party THU
Apr 16, 2026 · Artificial Intelligence

Can Multimodal LLMs Truly Understand Emotions? Inside the MME-Emotion Benchmark

The MME-Emotion benchmark, introduced by researchers from CUHK and Alibaba Tongyi and accepted at ICLR 2026, provides a large‑scale, multimodal evaluation of emotional intelligence in large language models, revealing current models’ limited emotion recognition and reasoning abilities across diverse real‑world scenarios.

AIBenchmarkMME-Emotion
0 likes · 10 min read
Can Multimodal LLMs Truly Understand Emotions? Inside the MME-Emotion Benchmark
Machine Heart
Machine Heart
Apr 10, 2026 · Artificial Intelligence

Why Generalist’s Success Shifts Embodied AI Competition From Models to Infrastructure

The launch of Generalist AI’s GEN‑1 model demonstrates a breakthrough in success rate, speed and resilience, but the article argues that the true competitive frontier has moved from model performance to the underlying data, simulation and evaluation infrastructure that enables continuous learning and scalable testing for embodied intelligence.

AI modelsEmbodied AIRobotics
0 likes · 12 min read
Why Generalist’s Success Shifts Embodied AI Competition From Models to Infrastructure
DataFunSummit
DataFunSummit
Apr 10, 2026 · Artificial Intelligence

How Can AI Agents Truly Remember? A Deep Dive into Long‑Term Memory Engineering

This article examines the shortcomings of current AI assistants, outlines the ideal of long‑term memory engineering, reviews mainstream industry solutions such as hard‑context models and Retrieval‑Augmented Generation, proposes a four‑layer memory loop architecture, and looks ahead to online learning and collective intelligence for future agents.

AIAgentHybrid Architecture
0 likes · 15 min read
How Can AI Agents Truly Remember? A Deep Dive into Long‑Term Memory Engineering
Data STUDIO
Data STUDIO
Apr 10, 2026 · Artificial Intelligence

Step‑by‑Step Guide to Writing Effective Agent Skill.md Files

This article explains what Agent Skills are, shows the folder layout and SKILL.md format, introduces the progressive‑disclosure design, provides concrete best‑practice tips, testing and evaluation methods, and demonstrates how to package scripts for reliable AI‑assistant automation.

AI AssistantAgent SkillsAutomation
0 likes · 29 min read
Step‑by‑Step Guide to Writing Effective Agent Skill.md Files
AI Step-by-Step
AI Step-by-Step
Apr 8, 2026 · Operations

How to Light Up the Black Box of LLM Agents with Full‑Stack Observability

The article explains why traditional logs are insufficient for LLM agents, outlines five observability dimensions—tracing, metrics, behavioral governance, state & memory, and evaluation—and provides concrete, open‑source‑based steps to instrument, monitor, and act on agent workloads in production.

Behavioral GovernanceLLM agentsObservability
0 likes · 11 min read
How to Light Up the Black Box of LLM Agents with Full‑Stack Observability
Machine Heart
Machine Heart
Apr 5, 2026 · Artificial Intelligence

How Imitation Learning Powers Dexterous Manipulation: A 2021‑2025 Technical Roadmap

This survey maps the 2021‑2025 progress of imitation learning for dexterous manipulation, detailing theoretical foundations, datasets, algorithms, hardware platforms, and evaluation protocols, and highlights challenges such as data quality, hardware dependence, and the need for standardized benchmarks to advance embodied AI.

AlgorithmsDatasetsDexterous Manipulation
0 likes · 11 min read
How Imitation Learning Powers Dexterous Manipulation: A 2021‑2025 Technical Roadmap
AI Code to Success
AI Code to Success
Apr 3, 2026 · Artificial Intelligence

Can Your AI Agent Earn a College Degree? Exploring Clawvard’s Evaluation Platform

The author explores Clawvard, an AI‑agent assessment platform that tests agents across eight dimensions, shares personal test results showing an initial A‑ rating with a critical retrieval weakness, details the customized improvement rules applied, and demonstrates a subsequent A+ rating, while also discussing the platform’s limits and practical use cases.

AI AgentPrompt Engineeringartificial intelligence
0 likes · 8 min read
Can Your AI Agent Earn a College Degree? Exploring Clawvard’s Evaluation Platform
AgentGuide
AgentGuide
Apr 3, 2026 · Artificial Intelligence

How to Evaluate RAG Systems: Key Metrics and the Ragas Framework

The article explains how to assess Retrieval-Augmented Generation (RAG) projects using the Ragas automated evaluation framework, detailing four key dimensions—recall quality, answer faithfulness, answer relevance, and context utilization—and describes the underlying metrics for both retrieval and generation stages.

LLMRAGRAGAS
0 likes · 5 min read
How to Evaluate RAG Systems: Key Metrics and the Ragas Framework
Top Architecture Tech Stack
Top Architecture Tech Stack
Mar 28, 2026 · Artificial Intelligence

How Anthropic’s Multi‑Agent Harness Keeps Claude Running for Hours

Anthropic’s engineering recap reveals a GAN‑inspired multi‑agent framework that separates generation, evaluation, and planning to overcome Claude’s context anxiety and self‑evaluation bias, enabling the model to sustain multi‑hour, high‑quality tasks across frontend design, full‑stack apps, and game‑engine projects.

AIClaudeevaluation
0 likes · 19 min read
How Anthropic’s Multi‑Agent Harness Keeps Claude Running for Hours
SuanNi
SuanNi
Mar 26, 2026 · Artificial Intelligence

Unveiling Omni-WorldBench: How 18 AI Video Models Stack Up on 4D Interaction Tests

The Omni-WorldBench framework introduces a comprehensive 4D evaluation suite with 1,068 test cases and three interaction levels, applying novel metrics to assess video quality, controllability, and physical interaction fidelity across 18 state‑of‑the‑art AI video models, revealing strengths, weaknesses, and future research directions.

4D interactionBenchmarkOmni-WorldBench
0 likes · 14 min read
Unveiling Omni-WorldBench: How 18 AI Video Models Stack Up on 4D Interaction Tests
SuanNi
SuanNi
Mar 25, 2026 · Artificial Intelligence

Can Harness Engineering Enable AI Agents to Master Complex Long‑Running Tasks?

This article analyses the concept of Harness engineering introduced by OpenAI and Anthropic, explains how multi‑agent architectures decompose and manage long‑running AI tasks, examines practical experiments such as a retro game maker and a web‑audio workstation, and distills lessons for future AI system design.

AI EngineeringAnthropicClaude
0 likes · 16 min read
Can Harness Engineering Enable AI Agents to Master Complex Long‑Running Tasks?
o-ai.tech
o-ai.tech
Mar 25, 2026 · Artificial Intelligence

From Code Writing to Continuous Development: Anthropic’s Long‑Running Agent Harness Design

Anthropic’s article dissects a three‑role harness—planner, generator, evaluator—for building long‑running AI applications, explaining how structured specs, sprint contracts, iterative evaluation, and context management transform a single model into a reliable software‑engineering pipeline, with concrete front‑end and full‑stack case studies.

AI AgentsEvaluatorHarness
0 likes · 23 min read
From Code Writing to Continuous Development: Anthropic’s Long‑Running Agent Harness Design
Frontend AI Walk
Frontend AI Walk
Mar 25, 2026 · Artificial Intelligence

Slow Learning Agents: 7 Cognitive Shifts from Using ChatGPT to Truly Understanding Agents

The article outlines seven essential mindset transitions for building robust LLM agents—recognizing agents as autonomous decision loops, prioritizing harness over model size, layering context, designing tools for agent goals, structuring multi‑layer memory, coordinating multiple agents with isolation and protocols, and aligning evaluation with the real environment.

Context managementHarnessLLM agents
0 likes · 16 min read
Slow Learning Agents: 7 Cognitive Shifts from Using ChatGPT to Truly Understanding Agents
AgentGuide
AgentGuide
Mar 22, 2026 · Artificial Intelligence

How to Design Prompt Engineering in Your Project: A Complete Workflow

The article outlines a systematic Prompt Engineering process that starts with defining task goals and metrics, structures prompts into modular components, uses offline evaluation and bad‑case analysis, incorporates RAG or tools when needed, and continuously monitors accuracy, hallucination, latency and cost.

AI workflowFew-ShotPrompt Engineering
0 likes · 7 min read
How to Design Prompt Engineering in Your Project: A Complete Workflow
ByteDance SE Lab
ByteDance SE Lab
Mar 20, 2026 · Artificial Intelligence

How to Build a Multi‑Repo Semantic Code Q&A System with OpenViking

This guide explains the challenges of multi‑repository code retrieval, presents an experimental evaluation of OpenViking's semantic search, and provides step‑by‑step instructions for installing, configuring, importing repositories, and integrating the system into AI agents and chatbots.

AI AssistantMulti-repoOpenViking
0 likes · 16 min read
How to Build a Multi‑Repo Semantic Code Q&A System with OpenViking
AI Frontier Lectures
AI Frontier Lectures
Mar 16, 2026 · Artificial Intelligence

Can Multimodal LLMs Truly Understand Human Emotions? Introducing the MME-Emotion Benchmark

This article presents MME-Emotion, a large‑scale multimodal benchmark that evaluates both emotion recognition and reasoning abilities of multimodal large language models across 27 real‑world scenarios, revealing current models’ significant gaps in emotional intelligence and outlining future research directions.

AIBenchmarkDataset
0 likes · 9 min read
Can Multimodal LLMs Truly Understand Human Emotions? Introducing the MME-Emotion Benchmark
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 16, 2026 · Artificial Intelligence

HeartBench: Building the First Chinese AI Humanization Benchmark

This article details the creation of HeartBench, a Chinese benchmark for evaluating large language models' emotional and social intelligence, describing its background, design principles, data pipeline, evaluation methods, multi‑stage versioning, blind‑test validation, and lessons for building transferable AI assessment frameworks.

AI BenchmarkEmotion AIHumanization
0 likes · 25 min read
HeartBench: Building the First Chinese AI Humanization Benchmark
PaperAgent
PaperAgent
Mar 15, 2026 · Artificial Intelligence

Why LLM Tool‑Calling Benchmarks Miss Real Users: Introducing WildToolBench

WildToolBench reveals that existing LLM tool‑calling benchmarks overlook real‑world user behavior, and a comprehensive evaluation of 58 models shows even the strongest agents achieve less than 15% session accuracy, highlighting a huge gap between reported performance and practical usability.

Agentic AIBenchmarkLLM
0 likes · 10 min read
Why LLM Tool‑Calling Benchmarks Miss Real Users: Introducing WildToolBench
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 11, 2026 · Artificial Intelligence

Upgrade All Your Claude Skills Now: Harness the New Skill‑Creator Engine

Anthropic’s updated skill‑creator turns Skills into a core, engineering‑focused capability for Claude, offering a systematic workflow—baseline A/B testing, quantitative assertions, visual evaluation, and iterative description optimization—so developers can rebuild, refine, and reliably trigger their Skills for higher productivity.

AI AgentsAnthropicAutomation
0 likes · 13 min read
Upgrade All Your Claude Skills Now: Harness the New Skill‑Creator Engine
PaperAgent
PaperAgent
Mar 9, 2026 · Artificial Intelligence

How SkillNet Turns AI Agent Experience into Reusable Skills

SkillNet proposes a three‑layer infrastructure that extracts, evaluates, and connects over 200,000 AI‑agent skills into a structured graph, dramatically improving performance across benchmark environments while turning transient agent experience into durable, reusable assets.

AI AgentsLLMSkillNet
0 likes · 6 min read
How SkillNet Turns AI Agent Experience into Reusable Skills
AI Tech Publishing
AI Tech Publishing
Mar 7, 2026 · Artificial Intelligence

A Practical Guide to Evaluating Agent Skills

This article explains why many Agent Skills are released without testing, defines measurable success criteria, and presents a lightweight evaluation framework—including prompt set creation, deterministic checks, optional LLM‑based qualitative checks, and best‑practice recommendations—demonstrated by improving a Gemini Interactions API skill from 66.7% to 100% pass rate.

AI AgentsAgent SkillsGemini
0 likes · 13 min read
A Practical Guide to Evaluating Agent Skills
Amap Tech
Amap Tech
Mar 5, 2026 · Artificial Intelligence

How MobilityBench Measures the Real Power of AI Route‑Planning Agents

MobilityBench is an open‑source benchmark built from over 100 000 real user queries that evaluates AI route‑planning agents with a deterministic sandbox, multi‑dimensional metrics, and support for ReAct and Plan‑and‑Execute frameworks, revealing performance gaps between open‑source and closed‑source models.

AI AgentsBenchmarkMobilityBench
0 likes · 6 min read
How MobilityBench Measures the Real Power of AI Route‑Planning Agents
Data Party THU
Data Party THU
Feb 18, 2026 · Artificial Intelligence

Why Top AI Agents Fail in Real Work: Inside the Trainee‑Bench Benchmark

The article analyzes the gap between high benchmark scores and poor real‑world performance of AI agents, introduces the Trainee‑Bench workplace simulator, details its three evaluation dimensions, construction steps, and reveals that even state‑of‑the‑art models achieve low success rates, highlighting the need for autonomous learning and zero‑hand‑over.

AI AgentsTrainee-Benchcontinuous learning
0 likes · 11 min read
Why Top AI Agents Fail in Real Work: Inside the Trainee‑Bench Benchmark
AI Engineering
AI Engineering
Jan 29, 2026 · Artificial Intelligence

How a Tiny AGENTS.md Change Boosted AI Coding Accuracy from 53% to 100%

A Vercel team experiment shows that replacing the Skills approach with a small 8 KB AGENTS.md file raised AI coding agents' pass rate from 53% to a perfect 100%, revealing the fragility of explicit tool calls and the strength of passive, always‑available context.

AGENTS.mdAI coding agentsNext.js
0 likes · 11 min read
How a Tiny AGENTS.md Change Boosted AI Coding Accuracy from 53% to 100%
JD Tech
JD Tech
Jan 27, 2026 · Artificial Intelligence

How Uni-Layout Unifies Cross‑Task Layout Generation with Human‑Like Evaluation

Uni-Layout introduces a unified framework that integrates a universal layout generator, a human‑feedback‑simulating evaluator, and a dynamic margin preference optimization technique to align generation and evaluation across diverse e‑commerce design tasks, backed by a new 100k human‑annotated dataset.

Human Feedbackdynamic margin optimizatione-commerce design
0 likes · 11 min read
How Uni-Layout Unifies Cross‑Task Layout Generation with Human‑Like Evaluation
Architect
Architect
Jan 19, 2026 · Artificial Intelligence

How Cursor Scales Autonomous Coding Agents to Hundreds: Architecture Lessons for AI Systems

This article analyzes Cursor's engineering choices for running autonomous coding agents at scale, detailing the long‑running, drift, and evaluation concepts, the Planner‑Worker‑Judge pipeline, concurrency challenges, experimental results, and actionable rules for building robust multi‑agent systems.

Software EngineeringSystem Architectureevaluation
0 likes · 17 min read
How Cursor Scales Autonomous Coding Agents to Hundreds: Architecture Lessons for AI Systems
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Jan 15, 2026 · Operations

Why Most Supplier Evaluation Systems Fail and the 4 Metrics That Actually Matter

The article explains why traditional supplier evaluation forms often become meaningless, introduces four decisive metrics—delivery stability, quality consistency, cost transparency, and collaboration willingness—provides concrete scoring formulas for each, and shows how an SRM system can automate and visualize these indicators to help companies decide whether to replace a supplier.

OperationsSRMevaluation
0 likes · 10 min read
Why Most Supplier Evaluation Systems Fail and the 4 Metrics That Actually Matter
JD Cloud Developers
JD Cloud Developers
Jan 15, 2026 · Artificial Intelligence

Uni-Layout: Unifying Layout Generation with Human Feedback and Dynamic Alignment

Uni-Layout introduces a unified framework that combines a multimodal large language model‑based generator, a human‑like evaluator trained on the large Layout‑HF100k dataset, and a Dynamic Margin Preference Optimization (DMPO) method to align generation and evaluation, achieving state‑of‑the‑art results across diverse layout tasks.

DMPOHuman Feedbackevaluation
0 likes · 11 min read
Uni-Layout: Unifying Layout Generation with Human Feedback and Dynamic Alignment
JD Tech Talk
JD Tech Talk
Jan 15, 2026 · Artificial Intelligence

Uni-Layout: Harnessing Human Feedback for Unified Layout Generation and Evaluation

Uni-Layout introduces a unified framework that generates layouts across diverse tasks, simulates human evaluation with a novel feedback dataset, and aligns generation and assessment through dynamic margin preference optimization, achieving state‑of‑the‑art performance on multiple benchmarks.

AI designHuman Feedbackevaluation
0 likes · 11 min read
Uni-Layout: Harnessing Human Feedback for Unified Layout Generation and Evaluation
AI Insight Log
AI Insight Log
Jan 10, 2026 · Artificial Intelligence

Anthropic’s Full Practical Guide to Evaluating AI Agents – Key Insights

The article explains why evaluating AI agents is far more complex than testing deterministic code, outlines Anthropic’s anatomy of a complete evaluation system—including tasks, transcripts, and three grader types—and offers concrete best‑practice recommendations for building reliable agent pipelines.

AI AgentsAnthropicLLM testing
0 likes · 9 min read
Anthropic’s Full Practical Guide to Evaluating AI Agents – Key Insights
JD Retail Technology
JD Retail Technology
Jan 8, 2026 · Artificial Intelligence

Uni-Layout: Unified Cross-Task Layout Generation with Human-Aligned Evaluation

Uni-Layout introduces a unified layout generation framework that consolidates diverse design tasks, leverages multimodal large language models for flexible generation, and aligns outputs with human perception through a novel human‑feedback dataset (Layout‑HF100k) and a dynamic margin preference optimization (DMPO) evaluator.

ACM MultimediaHuman Feedbackdynamic margin optimization
0 likes · 11 min read
Uni-Layout: Unified Cross-Task Layout Generation with Human-Aligned Evaluation
DataFunSummit
DataFunSummit
Jan 3, 2026 · Artificial Intelligence

What Is Memory Engineering? Unlocking AI’s Long‑Term Recall and Future Potential

A comprehensive dialogue among industry experts explores the concept of memory engineering for AI agents, covering its definition, system‑level challenges from edge to cloud, hybrid technical routes, evaluation metrics, privacy safeguards, audience questions, future directions, and practical advice for developers.

AI AgentsHybrid Architectureevaluation
0 likes · 24 min read
What Is Memory Engineering? Unlocking AI’s Long‑Term Recall and Future Potential
AI Product Manager Community
AI Product Manager Community
Dec 27, 2025 · Product Management

Embracing Uncertainty: Redesigning AI Product Requirements

The article explores how product managers must shift from deterministic PRDs to uncertainty‑driven specifications for AI chatbots, replacing exhaustive logic with value‑based constraints, fuzzy‑evaluation metrics, dynamic benchmarks, and sample‑based requirements to better align with probabilistic large‑model behavior.

AIPRDPrompt Engineering
0 likes · 9 min read
Embracing Uncertainty: Redesigning AI Product Requirements
Alibaba Cloud Native
Alibaba Cloud Native
Dec 19, 2025 · Artificial Intelligence

What Enterprises Are Learning from the State of Agent Engineering Report

The recent LangChain "State of Agent Engineering" report, combined with data from the AI‑Native Application Architecture whitepaper, reveals rapid production adoption of AI agents, persistent quality challenges, widespread observability, multi‑model strategies, and evolving evaluation practices across organizations of all sizes.

AI AgentsLLMObservability
0 likes · 10 min read
What Enterprises Are Learning from the State of Agent Engineering Report
Model Perspective
Model Perspective
Dec 19, 2025 · Fundamentals

How a Multi‑Dimensional Model Ranks China’s Historical TV Dramas

This study builds a comprehensive evaluation model for Chinese historical drama series, defining four primary and nine secondary indicators, standardizing data, applying weighted calculations and a time‑compensation factor to score 127 candidates and produce a TOP‑100 ranking that highlights the influence of audience reputation, market impact, professional recognition, and historical value.

Modelevaluationhistorical drama
0 likes · 18 min read
How a Multi‑Dimensional Model Ranks China’s Historical TV Dramas
Youzan Coder
Youzan Coder
Nov 21, 2025 · Artificial Intelligence

How to Build, Evaluate, and Optimize AI Test Agents: A Practical Guide

This guide walks you through creating AI‑powered test agents, defining success metrics, building evaluation datasets, crafting and refining system prompts with techniques like chain‑of‑thought, XML, few‑shot and concise inputs, and scaling the workflow by splitting agents and managing prompt versions.

AI AgentsLLMPrompt Engineering
0 likes · 21 min read
How to Build, Evaluate, and Optimize AI Test Agents: A Practical Guide
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Nov 4, 2025 · Artificial Intelligence

Why Financial RAG Fails and How to Solve Its Core Challenges

This article explains why Retrieval‑Augmented Generation (RAG) projects in the financial sector often underperform, highlighting data‑structure complexities, document‑parsing hurdles, chunking strategies, compliance constraints, evaluation metrics, and engineering requirements, and offers practical solutions and code examples.

EngineeringFinancial AIRAG
0 likes · 10 min read
Why Financial RAG Fails and How to Solve Its Core Challenges
Open Source Tech Hub
Open Source Tech Hub
Oct 23, 2025 · Backend Development

Boost PHP Performance with CEL-PHP: A Fast, Safe Expression Engine

This guide introduces CEL-PHP, a high‑performance, non‑Turing‑complete expression engine for PHP 8+, showing how to install it, evaluate simple and contextual expressions, handle parsing and optimization, integrate caching, register custom functions, and avoid common pitfalls for robust backend rule evaluation.

CELExpression Languagecaching
0 likes · 8 min read
Boost PHP Performance with CEL-PHP: A Fast, Safe Expression Engine
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Oct 20, 2025 · Artificial Intelligence

nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation

This article revisits nanochat's core components, detailing the preparation of diverse training datasets, the scaling calculations for tokens and parameters, the model's MQA and KV‑cache design, the full training pipeline with gradient accumulation and mixed‑precision, cost breakdown, inference optimizations, evaluation tasks, and identified limitations with suggested improvements.

KV cacheLLMMQA
0 likes · 9 min read
nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 15, 2025 · Artificial Intelligence

Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends

Large language models are evolving from free‑form text generators to reliable data providers by mastering structured output through prompt engineering, validation frameworks, constrained decoding, supervised fine‑tuning, reinforcement learning, and API‑level capabilities, enabling seamless integration with software systems while addressing hallucinations and format reliability.

APILLMPrompt Engineering
0 likes · 28 min read
Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Oct 13, 2025 · Operations

How to Build a Fail‑Proof Procurement Process with Data‑Driven SRM

This article explains why many procurement processes fail despite formal procedures and provides a step‑by‑step, data‑driven approach—clarifying requirements, using SRM templates, screening suppliers with performance data, scoring comprehensively, ensuring traceability, and conducting post‑award reviews—to select the right suppliers and turn procurement into a strategic advantage.

Data-drivenSRMevaluation
0 likes · 8 min read
How to Build a Fail‑Proof Procurement Process with Data‑Driven SRM
Data Thinking Notes
Data Thinking Notes
Sep 10, 2025 · Artificial Intelligence

Why Do Language Models Hallucinate? Uncovering the Statistical Roots

OpenAI’s latest research reveals that language model hallucinations stem from training and evaluation incentives that favor confident guesses over acknowledging uncertainty, and proposes revised scoring methods that reward modesty, highlighting statistical mechanisms behind false answers and offering pathways to reduce hallucinations.

AI Safetyevaluationhallucination
0 likes · 10 min read
Why Do Language Models Hallucinate? Uncovering the Statistical Roots
Architect
Architect
Sep 9, 2025 · Artificial Intelligence

Why Do Language Models Hallucinate? Insights from OpenAI’s New Study

This article explains why large language models often produce confident but incorrect answers, detailing statistical inevitability, data scarcity, and model capacity limits, and proposes concrete solutions such as confidence thresholds and allowing abstention to reduce hallucinations.

AI SafetyPrompt Engineeringevaluation
0 likes · 8 min read
Why Do Language Models Hallucinate? Insights from OpenAI’s New Study
DataFunSummit
DataFunSummit
Aug 23, 2025 · Artificial Intelligence

Mastering Role‑Playing AI Agents: Challenges, Techniques, and Future Directions

This article surveys the latest research on role‑playing AI agents, covering their definition, core components, application scenarios, three main challenges—role fidelity, long‑term memory, and evaluation—and presents four technical approaches for each challenge along with future research directions and references.

AI AgentsMemoryPrompt Engineering
0 likes · 22 min read
Mastering Role‑Playing AI Agents: Challenges, Techniques, and Future Directions
Data Party THU
Data Party THU
Aug 23, 2025 · Artificial Intelligence

How MiroMind‑M1 Sets New Benchmarks in Open‑Source Math Reasoning

The article presents MiroMind‑M1, an open‑source math‑reasoning language model that combines a 719K high‑quality SFT dataset, a novel CAMPO reinforcement‑learning algorithm, and extensive evaluations on AIME24, AIME25, and MATH‑500, demonstrating state‑of‑the‑art performance while reducing token usage.

CAMPOevaluationmath reasoning
0 likes · 11 min read
How MiroMind‑M1 Sets New Benchmarks in Open‑Source Math Reasoning
JD Tech Talk
JD Tech Talk
Jul 27, 2025 · Artificial Intelligence

Evaluating JoyAgent‑JDGenie: A Lightweight Multi‑Agent AI Framework in Action

This article presents a thorough evaluation of the open‑source JoyAgent‑JDGenie multi‑agent AI framework, covering its background, test cases for restaurant recommendation and travel planning, deployment steps, performance metrics, and concluding recommendations, highlighting its efficiency, ease of deployment, and result quality.

AIDeploymentagents
0 likes · 8 min read
Evaluating JoyAgent‑JDGenie: A Lightweight Multi‑Agent AI Framework in Action
Zhihu Tech Column
Zhihu Tech Column
Jul 25, 2025 · Artificial Intelligence

Boost Creative Writing with Zhi-Create-Qwen3-32B: Training, Eval & Deployment

This article introduces the open‑source Zhi‑Create‑Qwen3‑32B model, detailing its fine‑tuned training on creative‑writing data, the multi‑domain dataset strategy, curriculum‑learning based SFT, evaluation on WritingBench, and practical deployment options across various hardware and inference frameworks.

Deploymentcreative writingevaluation
0 likes · 11 min read
Boost Creative Writing with Zhi-Create-Qwen3-32B: Training, Eval & Deployment
ELab Team
ELab Team
Jul 9, 2025 · Artificial Intelligence

How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding

This article explains the design of the edit_file tool, the fast‑apply model that rewrites whole files instead of diffs, its training and evaluation methodology, speculative decoding speed gains, and future research directions for large‑scale code‑editing AI systems.

AIModel Trainingcode editing
0 likes · 14 min read
How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding
DataFunTalk
DataFunTalk
Jul 3, 2025 · Artificial Intelligence

How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations

In an interview with Vivo AI engineer Liang Tianan, the article explores the challenges of post‑Q&A recommendation, the integration of large language models into recall, ranking and evaluation pipelines, and the engineering trade‑offs required to deliver high‑quality, diverse suggestions on mobile devices.

LLMMobile AIRecommendation Systems
0 likes · 15 min read
How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations
DataFunSummit
DataFunSummit
Jun 19, 2025 · Artificial Intelligence

How Large Models Are Revolutionizing Douyin’s User Experience – Expert Insights

In a detailed interview, ByteDance AI specialist Cai Conghuai explains how large‑model techniques such as SFT, DPO and RAG address Douyin’s multimodal user‑experience challenges, improve signal detection, root‑cause analysis, and outline future AI‑agent breakthroughs for content platforms.

AI AlgorithmsMultimodal LearningRAG
0 likes · 11 min read
How Large Models Are Revolutionizing Douyin’s User Experience – Expert Insights
Aikesheng Open Source Community
Aikesheng Open Source Community
Jun 17, 2025 · Artificial Intelligence

Introducing SCALE: An Open‑Source Benchmark Redefining LLM SQL Capabilities

This article presents SCALE, a community‑driven, open‑source benchmark that expands beyond simple Text‑to‑SQL accuracy to evaluate large language models on performance, dialect conversion, and deep SQL understanding, offering developers, researchers, and CTOs a realistic measure of AI‑assisted database tasks.

AIBenchmarkLLM
0 likes · 10 min read
Introducing SCALE: An Open‑Source Benchmark Redefining LLM SQL Capabilities
Tencent Technical Engineering
Tencent Technical Engineering
Jun 16, 2025 · Artificial Intelligence

Mastering RAG and AI Agents: Practical Tips, Code Samples, and Evaluation Strategies

This comprehensive guide walks you through the fundamentals of Retrieval‑Augmented Generation (RAG) and AI agents, explains their inner workings, shares optimization tricks, provides ready‑to‑run code snippets, and demonstrates how to evaluate performance with metrics such as recall, faithfulness, and answer relevance.

AI AgentsLLMPrompt Engineering
0 likes · 36 min read
Mastering RAG and AI Agents: Practical Tips, Code Samples, and Evaluation Strategies
Model Perspective
Model Perspective
May 25, 2025 · Fundamentals

Why We Pretend to Win: The Hidden Math Behind Evaluation Bias

The article explores how people manipulate evaluation systems by redefining variables, adjusting weights, and shifting perspectives, turning losses into perceived wins, and reveals the psychological and statistical biases that create this illusion, urging more honest, multi‑dimensional, transparent modeling for genuine assessment.

BiasModelingPsychology
0 likes · 9 min read
Why We Pretend to Win: The Hidden Math Behind Evaluation Bias
DataFunSummit
DataFunSummit
May 9, 2025 · Artificial Intelligence

Practical Experience Building Zhihu Direct Answer: An AI‑Powered Search Product

This article presents a comprehensive overview of Zhihu Direct Answer, describing its AI‑driven search architecture, RAG framework, query understanding, retrieval, chunking, reranking, generation, evaluation mechanisms, engineering optimizations, and the professional edition, while sharing concrete performance‑boosting practices and future development plans.

AIGenerationProduct Development
0 likes · 14 min read
Practical Experience Building Zhihu Direct Answer: An AI‑Powered Search Product
Architect
Architect
Apr 17, 2025 · Artificial Intelligence

The Second Half of AI: From Model Innovation to Real‑World Utility

The article argues that artificial intelligence has entered a new phase where reinforcement learning finally generalizes, evaluation becomes more important than pure model performance, and researchers must redesign benchmarks and utility‑focused tasks to drive truly transformative progress.

evaluationresearch strategy
0 likes · 16 min read
The Second Half of AI: From Model Innovation to Real‑World Utility
Nightwalker Tech
Nightwalker Tech
Apr 1, 2025 · Artificial Intelligence

Evaluation of AutoGLM: Features, Architecture, and Practical Test Results

This article reviews AutoGLM, the first "think‑while‑doing" AI agent released by Zhipu AI, detailing its core capabilities, full‑stack architecture, user experience, identified limitations, and the outcomes of three hands‑on tests using both the client application and a Chrome extension.

AI AgentAutoGLMevaluation
0 likes · 4 min read
Evaluation of AutoGLM: Features, Architecture, and Practical Test Results
Meituan Technology Team
Meituan Technology Team
Mar 27, 2025 · Artificial Intelligence

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

The Q‑Eval‑100K dataset, comprising 100 k AIGC images and videos with separate visual‑quality and textual‑consistency annotations, powers the open‑source Q‑Eval‑Score framework that fine‑tunes multimodal models to deliver state‑of‑the‑art, scalable, and objective evaluation—including a “vague‑to‑specific” strategy for long prompts—surpassing existing benchmarks.

AIGCDatasetevaluation
0 likes · 9 min read
Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 24, 2025 · Artificial Intelligence

Why LLM Internet Search Fails and How to Fix It: A Deep Dive into Qwen, Doubao, and DeepSeek

This article analyses the shortcomings of large‑model internet search—such as unverifiable sources, fabricated content, and poor instruction compliance—by comparing Qwen‑max, Doubao‑1.5‑pro‑256k, and DeepSeek‑v3, and proposes prompt engineering, post‑processing, and custom tool improvements to boost reliability.

AILLMevaluation
0 likes · 22 min read
Why LLM Internet Search Fails and How to Fix It: A Deep Dive into Qwen, Doubao, and DeepSeek
DaTaobao Tech
DaTaobao Tech
Mar 19, 2025 · Artificial Intelligence

Retrieval Augmented Generation (RAG): Principles, Challenges, and Implementation Techniques

Retrieval‑augmented generation (RAG) enhances large language models by integrating a preprocessing pipeline—cleaning, chunking, embedding, and vector storage—with a query‑driven retrieval and prompt‑injection workflow, leveraging vector databases, multi‑stage recall, advanced prompting, and comprehensive evaluation metrics to mitigate knowledge cut‑off, hallucinations, and security issues.

LLMRAGRetrieval Augmented Generation
0 likes · 27 min read
Retrieval Augmented Generation (RAG): Principles, Challenges, and Implementation Techniques
Efficient Ops
Efficient Ops
Mar 12, 2025 · Operations

How BizDevOps Is Accelerating Digital Transformation in Finance

This article explains the governmental push for digital transformation in financial institutions, introduces the BizDevOps integration model and its domestic and international standards, outlines the evaluation framework and process, showcases case studies, and announces the open registration for the 2025 BizDevOps assessment.

BizDevOpsDigital TransformationFinancial Industry
0 likes · 9 min read
How BizDevOps Is Accelerating Digital Transformation in Finance
AI Algorithm Path
AI Algorithm Path
Feb 20, 2025 · Artificial Intelligence

What Is Perplexity in Large Language Models?

The article explains perplexity as a metric for evaluating large language models, walks through a step‑by‑step probability calculation for a sample sentence, shows how to normalize by sentence length using the geometric mean, and demonstrates that lower perplexity indicates a more accurate and less uncertain model.

AILanguage ModelPerplexity
0 likes · 6 min read
What Is Perplexity in Large Language Models?
JD Tech
JD Tech
Feb 14, 2025 · Artificial Intelligence

JD Merchant Intelligent Assistant – Multi‑Agent System Architecture, Planning, and Evaluation

JD’s Merchant Intelligent Assistant leverages a large‑language‑model‑based multi‑agent architecture to provide 24/7 e‑commerce support, detailing its evolution, planning techniques, online inference, evaluation methods, sample generation, and practical insights for scalable AI‑driven operations.

AutomationE-commerce AILLM
0 likes · 22 min read
JD Merchant Intelligent Assistant – Multi‑Agent System Architecture, Planning, and Evaluation
JD Retail Technology
JD Retail Technology
Feb 10, 2025 · Artificial Intelligence

JD Merchant Intelligent Assistant: Multi‑Agent Architecture and Technical Exploration

The JD Merchant Intelligent Assistant employs a large‑language‑model‑driven multi‑agent architecture with dynamic ReAct planning, enabling merchants to query and execute store operations in under a second with over 90 % decision accuracy, while reducing inference cost, hallucinations, and engineering effort across diverse e‑commerce tasks.

AILLMMulti-Agent
0 likes · 25 min read
JD Merchant Intelligent Assistant: Multi‑Agent Architecture and Technical Exploration
DataFunSummit
DataFunSummit
Jan 25, 2025 · Artificial Intelligence

AI-Driven Next-Generation Sales: Project Overview, Core Technologies, System Deployment, and Future Outlook

This article explores how AI transforms next‑generation sales by detailing project background and goals, core technologies such as efficient sample generation, model training and evaluation, system deployment impact, practical case studies, challenges, solutions, and future directions across multiple industries.

AIModel TrainingSales Automation
0 likes · 25 min read
AI-Driven Next-Generation Sales: Project Overview, Core Technologies, System Deployment, and Future Outlook
Zhihu Tech Column
Zhihu Tech Column
Jan 17, 2025 · Artificial Intelligence

Zhihu Direct Answer: Product Overview and Technical Practices

This article summarizes the key technical insights from Zhihu Direct Answer, an AI-powered search product, covering its product overview, RAG framework, query understanding, retrieval strategies, chunking, reranking, generation techniques, evaluation methods, and engineering optimizations for cost and performance.

AI searchEngineering OptimizationGeneration
0 likes · 13 min read
Zhihu Direct Answer: Product Overview and Technical Practices
NewBeeNLP
NewBeeNLP
Jan 17, 2025 · Artificial Intelligence

Unlocking Multimodal Intelligence: A Deep Dive into Next Token Prediction

This comprehensive survey examines the foundations, tokenization techniques, model architectures, training paradigms, evaluation benchmarks, and open challenges of multimodal next‑token prediction (MMNTP), offering researchers a clear roadmap for future advances in multimodal AI.

Model architectureMultimodal AINext Token Prediction
0 likes · 9 min read
Unlocking Multimodal Intelligence: A Deep Dive into Next Token Prediction
Data Thinking Notes
Data Thinking Notes
Jan 7, 2025 · Databases

Unlocking LLM-Powered Text-to-SQL: From Basics to Cutting-Edge Techniques

This article provides a comprehensive overview of LLM-based Text-to-SQL technology, covering its background, evolution, challenges, various LLM-driven methods, benchmark datasets, evaluation metrics, and future research directions to guide researchers and practitioners in advancing natural language interfaces for databases.

LLMPrompt EngineeringText-to-SQL
0 likes · 18 min read
Unlocking LLM-Powered Text-to-SQL: From Basics to Cutting-Edge Techniques
DataFunSummit
DataFunSummit
Jan 1, 2025 · Artificial Intelligence

Challenges and Evaluation Strategies for LLM Agents in 2024

The article outlines the rapid progress of LLM agents in 2024 while highlighting key difficulties in planning capabilities, evaluation methods, dataset generation, and metric design, and suggests practical combinations and product‑level enhancements to improve efficiency, accuracy, and usability.

AIAgentDataset
0 likes · 3 min read
Challenges and Evaluation Strategies for LLM Agents in 2024
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Dec 12, 2024 · Artificial Intelligence

How PertEval Reveals the Real Knowledge Limits of Large Language Models

At NeurIPS 2024, Alibaba Cloud's PAI team presented the Spotlight paper PertEval, which introduces knowledge‑invariant perturbations to expose the true knowledge capacity of LLMs, critiques over‑optimistic static benchmarks, and showcases responsible AI solutions and platform demos for enterprise use.

Alibaba CloudNeurIPS 2024PertEval
0 likes · 6 min read
How PertEval Reveals the Real Knowledge Limits of Large Language Models
DevOps
DevOps
Nov 21, 2024 · Product Management

Comprehensive Product KPI Metrics and Evaluation Guidelines

This article presents a detailed collection of product key performance indicators (KPIs) covering user growth, retention, activity, satisfaction, market share, revenue, development cycles, resource utilization, team satisfaction, brand awareness, and strategic goal achievement, along with formulas, weighting, and scoring methods for systematic performance assessment.

KPIsevaluationperformance metrics
0 likes · 13 min read
Comprehensive Product KPI Metrics and Evaluation Guidelines
Fighter's World
Fighter's World
Nov 18, 2024 · Product Management

Uncovering AI Product Design Challenges: Insights from OpenAI and Anthropic CPOs

The article distills a fireside chat between OpenAI’s CPO Kevin Weil and Anthropic’s CPO Mike Krieger, highlighting how uncertainty, iterative co‑design, evolving product‑manager skills, human‑AI collaboration, non‑deterministic UI, and emerging trends like proactivity, asynchrony, multimodality and personalization shape modern AI product development.

AI product designco-designevaluation
0 likes · 13 min read
Uncovering AI Product Design Challenges: Insights from OpenAI and Anthropic CPOs
Baobao Algorithm Notes
Baobao Algorithm Notes
Nov 4, 2024 · Artificial Intelligence

Uncovering 16 Limits of AI Search Engines and 16 Design Recommendations

A user study with 21 participants reveals sixteen critical limitations of generative AI search engines, maps them to eight quantitative metrics, proposes sixteen design recommendations, and evaluates You.com, Perplexity and BingChat against this framework to highlight current performance gaps.

AI searchGenerative SearchLLM
0 likes · 12 min read
Uncovering 16 Limits of AI Search Engines and 16 Design Recommendations