Tag

evaluation

0 views collected around this technical thread.

Model Perspective
Model Perspective
May 25, 2025 · Fundamentals

Why We Pretend to Win: The Hidden Math Behind Evaluation Bias

The article explores how people manipulate evaluation systems by redefining variables, adjusting weights, and shifting perspectives, turning losses into perceived wins, and reveals the psychological and statistical biases that create this illusion, urging more honest, multi‑dimensional, transparent modeling for genuine assessment.

biasdecision makingevaluation
0 likes · 9 min read
Why We Pretend to Win: The Hidden Math Behind Evaluation Bias
DataFunSummit
DataFunSummit
May 9, 2025 · Artificial Intelligence

Practical Experience Building Zhihu Direct Answer: An AI‑Powered Search Product

This article presents a comprehensive overview of Zhihu Direct Answer, describing its AI‑driven search architecture, RAG framework, query understanding, retrieval, chunking, reranking, generation, evaluation mechanisms, engineering optimizations, and the professional edition, while sharing concrete performance‑boosting practices and future development plans.

AIGenerationRAG
0 likes · 14 min read
Practical Experience Building Zhihu Direct Answer: An AI‑Powered Search Product
Architect
Architect
Apr 17, 2025 · Artificial Intelligence

The Second Half of AI: From Model Innovation to Real‑World Utility

The article argues that artificial intelligence has entered a new phase where reinforcement learning finally generalizes, evaluation becomes more important than pure model performance, and researchers must redesign benchmarks and utility‑focused tasks to drive truly transformative progress.

Artificial IntelligenceResearch Strategyevaluation
0 likes · 16 min read
The Second Half of AI: From Model Innovation to Real‑World Utility
Nightwalker Tech
Nightwalker Tech
Apr 1, 2025 · Artificial Intelligence

Evaluation of AutoGLM: Features, Architecture, and Practical Test Results

This article reviews AutoGLM, the first "think‑while‑doing" AI agent released by Zhipu AI, detailing its core capabilities, full‑stack architecture, user experience, identified limitations, and the outcomes of three hands‑on tests using both the client application and a Chrome extension.

AI AgentArtificial IntelligenceAutoGLM
0 likes · 4 min read
Evaluation of AutoGLM: Features, Architecture, and Practical Test Results
Architect
Architect
Mar 31, 2025 · Artificial Intelligence

A Comprehensive Study of Failure Modes in Large‑Language‑Model Based Multi‑Agent Systems

This paper presents a systematic investigation of failure patterns in LLM‑driven multi‑agent systems, introducing a 14‑type taxonomy (MASFT) derived from over 150 annotated dialogues, evaluating it with an LLM‑as‑a‑judge pipeline, and exploring modest intervention strategies while releasing all data and tools for future research.

AILLMMulti-Agent Systems
0 likes · 29 min read
A Comprehensive Study of Failure Modes in Large‑Language‑Model Based Multi‑Agent Systems
DaTaobao Tech
DaTaobao Tech
Mar 19, 2025 · Artificial Intelligence

Retrieval Augmented Generation (RAG): Principles, Challenges, and Implementation Techniques

Retrieval‑augmented generation (RAG) enhances large language models by integrating a preprocessing pipeline—cleaning, chunking, embedding, and vector storage—with a query‑driven retrieval and prompt‑injection workflow, leveraging vector databases, multi‑stage recall, advanced prompting, and comprehensive evaluation metrics to mitigate knowledge cut‑off, hallucinations, and security issues.

LLMRAGRetrieval-Augmented Generation
0 likes · 27 min read
Retrieval Augmented Generation (RAG): Principles, Challenges, and Implementation Techniques
Efficient Ops
Efficient Ops
Mar 12, 2025 · Operations

How BizDevOps Is Accelerating Digital Transformation in Finance

This article explains the governmental push for digital transformation in financial institutions, introduces the BizDevOps integration model and its domestic and international standards, outlines the evaluation framework and process, showcases case studies, and announces the open registration for the 2025 BizDevOps assessment.

BizDevOpsDigital TransformationFinancial Industry
0 likes · 9 min read
How BizDevOps Is Accelerating Digital Transformation in Finance
JD Tech
JD Tech
Feb 14, 2025 · Artificial Intelligence

JD Merchant Intelligent Assistant – Multi‑Agent System Architecture, Planning, and Evaluation

JD’s Merchant Intelligent Assistant leverages a large‑language‑model‑based multi‑agent architecture to provide 24/7 e‑commerce support, detailing its evolution, planning techniques, online inference, evaluation methods, sample generation, and practical insights for scalable AI‑driven operations.

AutomationE-commerce AILLM
0 likes · 22 min read
JD Merchant Intelligent Assistant – Multi‑Agent System Architecture, Planning, and Evaluation
JD Retail Technology
JD Retail Technology
Feb 10, 2025 · Artificial Intelligence

JD Merchant Intelligent Assistant: Multi‑Agent Architecture and Technical Exploration

The JD Merchant Intelligent Assistant employs a large‑language‑model‑driven multi‑agent architecture with dynamic ReAct planning, enabling merchants to query and execute store operations in under a second with over 90 % decision accuracy, while reducing inference cost, hallucinations, and engineering effort across diverse e‑commerce tasks.

AILLMReact
0 likes · 25 min read
JD Merchant Intelligent Assistant: Multi‑Agent Architecture and Technical Exploration
DataFunSummit
DataFunSummit
Jan 25, 2025 · Artificial Intelligence

AI-Driven Next-Generation Sales: Project Overview, Core Technologies, System Deployment, and Future Outlook

This article explores how AI transforms next‑generation sales by detailing project background and goals, core technologies such as efficient sample generation, model training and evaluation, system deployment impact, practical case studies, challenges, solutions, and future directions across multiple industries.

AISales AutomationSample Generation
0 likes · 25 min read
AI-Driven Next-Generation Sales: Project Overview, Core Technologies, System Deployment, and Future Outlook
Zhihu Tech Column
Zhihu Tech Column
Jan 17, 2025 · Artificial Intelligence

Zhihu Direct Answer: Product Overview and Technical Practices

This article summarizes the key technical insights from Zhihu Direct Answer, an AI-powered search product, covering its product overview, RAG framework, query understanding, retrieval strategies, chunking, reranking, generation techniques, evaluation methods, and engineering optimizations for cost and performance.

AI SearchChunkingEngineering Optimization
0 likes · 13 min read
Zhihu Direct Answer: Product Overview and Technical Practices
DataFunSummit
DataFunSummit
Jan 1, 2025 · Artificial Intelligence

Challenges and Evaluation Strategies for LLM Agents in 2024

The article outlines the rapid progress of LLM agents in 2024 while highlighting key difficulties in planning capabilities, evaluation methods, dataset generation, and metric design, and suggests practical combinations and product‑level enhancements to improve efficiency, accuracy, and usability.

AILLMagent
0 likes · 3 min read
Challenges and Evaluation Strategies for LLM Agents in 2024
DataFunSummit
DataFunSummit
Dec 23, 2024 · Artificial Intelligence

Huolala's Large Model Evaluation Framework (LaLaEval) and Application Practices

This article presents Huolala's comprehensive LaLaEval framework for evaluating large language models, detailing the challenges of model deployment, the five‑step assessment process, two real‑world case studies in freight and driver invitation, and future directions toward more automated, product‑driven evaluation.

AILogisticsRAG
0 likes · 24 min read
Huolala's Large Model Evaluation Framework (LaLaEval) and Application Practices
DevOps
DevOps
Nov 21, 2024 · Product Management

Comprehensive Product KPI Metrics and Evaluation Guidelines

This article presents a detailed collection of product key performance indicators (KPIs) covering user growth, retention, activity, satisfaction, market share, revenue, development cycles, resource utilization, team satisfaction, brand awareness, and strategic goal achievement, along with formulas, weighting, and scoring methods for systematic performance assessment.

KPIsPerformance Metricsevaluation
0 likes · 13 min read
Comprehensive Product KPI Metrics and Evaluation Guidelines
Bilibili Tech
Bilibili Tech
Sep 18, 2024 · Artificial Intelligence

Index-1.9B-32K: A 2% GPT-Size Model with Powerful Long-Context Capabilities

Index-1.9B-32K is a 1.9B-parameter model with a 32K token context window, achieving strong long‑text performance comparable to larger models while using only about 2% of GPT‑4’s compute, trained via long pre‑training and supervised fine‑tuning, with a trade‑off of reduced short‑context ability.

AIFine-tuningPretraining
0 likes · 12 min read
Index-1.9B-32K: A 2% GPT-Size Model with Powerful Long-Context Capabilities
Architect
Architect
Jul 13, 2024 · Artificial Intelligence

Practical Guide to Building LLM Products: Prompt Engineering, RAG, Evaluation, and Operations

This article provides a comprehensive, step‑by‑step guide for developing large‑language‑model (LLM) applications, covering prompt design techniques, n‑shot and chain‑of‑thought strategies, retrieval‑augmented generation, structured I/O, workflow optimization, evaluation pipelines, operational best practices, and team organization to create reliable, scalable AI products.

AI operationsLLMRAG
0 likes · 54 min read
Practical Guide to Building LLM Products: Prompt Engineering, RAG, Evaluation, and Operations
DataFunTalk
DataFunTalk
Jul 7, 2024 · Artificial Intelligence

Large Model Application Development: Architecture, Lifecycle, and Prompt Engineering

This article presents a comprehensive knowledge map for developing large‑model applications, covering a four‑layer technical architecture, the full development lifecycle, core elements such as prompt engineering and model fine‑tuning, evaluation methods, and practical case studies, offering guidance for both enterprises and startups.

AI Application Developmentevaluationlarge model
0 likes · 15 min read
Large Model Application Development: Architecture, Lifecycle, and Prompt Engineering
Continuous Delivery 2.0
Continuous Delivery 2.0
Jul 3, 2024 · Artificial Intelligence

Applying Large Language Models to Software Engineering: Challenges, Cross‑File Editing Issues, Bug‑Fixing Evaluation, and SWE‑Bench Results

This article examines the practical challenges of using large language models in software development, including handling long contexts, cross‑file editing, bug‑fixing evaluation methods, and presents benchmark results from SWE‑Bench and its Lite subset to assess model capabilities.

Cross-File EditingLLMSWE-bench
0 likes · 7 min read
Applying Large Language Models to Software Engineering: Challenges, Cross‑File Editing Issues, Bug‑Fixing Evaluation, and SWE‑Bench Results