Tagged articles

Evaluation

213 articles · Page 2 of 3

Oct 23, 2025 · Backend Development

Boost PHP Performance with CEL-PHP: A Fast, Safe Expression Engine

This guide introduces CEL-PHP, a high‑performance, non‑Turing‑complete expression engine for PHP 8+, showing how to install it, evaluate simple and contextual expressions, handle parsing and optimization, integrate caching, register custom functions, and avoid common pitfalls for robust backend rule evaluation.

CELCachingEvaluation

0 likes · 8 min read

Boost PHP Performance with CEL-PHP: A Fast, Safe Expression Engine

AI2ML AI to Machine Learning

Oct 20, 2025 · Artificial Intelligence

nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation

This article revisits nanochat's core components, detailing the preparation of diverse training datasets, the scaling calculations for tokens and parameters, the model's MQA and KV‑cache design, the full training pipeline with gradient accumulation and mixed‑precision, cost breakdown, inference optimizations, evaluation tasks, and identified limitations with suggested improvements.

EvaluationKV cacheLLM

0 likes · 9 min read

nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation

Alibaba Cloud Developer

Oct 15, 2025 · Artificial Intelligence

Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends

Large language models are evolving from free‑form text generators to reliable data providers by mastering structured output through prompt engineering, validation frameworks, constrained decoding, supervised fine‑tuning, reinforcement learning, and API‑level capabilities, enabling seamless integration with software systems while addressing hallucinations and format reliability.

APIConstrained DecodingEvaluation

0 likes · 28 min read

Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends

HyperAI Super Neural

Oct 14, 2025 · Artificial Intelligence

NeurIPS 2025: OCRBench v2 Shows Gemini Leads Chinese OCR Ranking Yet Scores Only Pass

OCRBench v2, introduced at NeurIPS 2025, evaluates 58 multimodal models on 23 OCR‑related tasks in Chinese and English, revealing that even top models like Gemini‑2.5‑Pro barely exceed the passing threshold and that most models struggle with fine‑grained text localization and multilingual performance.

EvaluationGeminiNeurIPS 2025

0 likes · 8 min read

NeurIPS 2025: OCRBench v2 Shows Gemini Leads Chinese OCR Ranking Yet Scores Only Pass

Old Zhao – Management Systems Only

Oct 13, 2025 · Operations

How to Build a Fail‑Proof Procurement Process with Data‑Driven SRM

This article explains why many procurement processes fail despite formal procedures and provides a step‑by‑step, data‑driven approach—clarifying requirements, using SRM templates, screening suppliers with performance data, scoring comprehensively, ensuring traceability, and conducting post‑award reviews—to select the right suppliers and turn procurement into a strategic advantage.

Data-DrivenEvaluationSRM

0 likes · 8 min read

How to Build a Fail‑Proof Procurement Process with Data‑Driven SRM

Fun with Large Models

Sep 17, 2025 · Artificial Intelligence

Evaluating Fine-Tuned Large Model Performance: Methods and Interview Tips

The article explains how to assess fine‑tuned large models using both human judgment and dataset‑driven metrics, outlines common pitfalls, introduces benchmark datasets and evaluation frameworks, and provides concise answers to related interview questions.

EvalScopeEvaluationbenchmark datasets

0 likes · 7 min read

Evaluating Fine-Tuned Large Model Performance: Methods and Interview Tips

Data Thinking Notes

Sep 10, 2025 · Artificial Intelligence

Why Do Language Models Hallucinate? Uncovering the Statistical Roots

OpenAI’s latest research reveals that language model hallucinations stem from training and evaluation incentives that favor confident guesses over acknowledging uncertainty, and proposes revised scoring methods that reward modesty, highlighting statistical mechanisms behind false answers and offering pathways to reduce hallucinations.

AI safetyEvaluationHallucination

0 likes · 10 min read

Why Do Language Models Hallucinate? Uncovering the Statistical Roots

Architect

Sep 9, 2025 · Artificial Intelligence

Why Do Language Models Hallucinate? Insights from OpenAI’s New Study

This article explains why large language models often produce confident but incorrect answers, detailing statistical inevitability, data scarcity, and model capacity limits, and proposes concrete solutions such as confidence thresholds and allowing abstention to reduce hallucinations.

AI safetyEvaluationHallucination

0 likes · 8 min read

Why Do Language Models Hallucinate? Insights from OpenAI’s New Study

DataFunSummit

Aug 23, 2025 · Artificial Intelligence

Mastering Role‑Playing AI Agents: Challenges, Techniques, and Future Directions

This article surveys the latest research on role‑playing AI agents, covering their definition, core components, application scenarios, three main challenges—role fidelity, long‑term memory, and evaluation—and presents four technical approaches for each challenge along with future research directions and references.

AI AgentsEvaluationPrompt Engineering

0 likes · 22 min read

Mastering Role‑Playing AI Agents: Challenges, Techniques, and Future Directions

Data Party THU

Aug 23, 2025 · Artificial Intelligence

How MiroMind‑M1 Sets New Benchmarks in Open‑Source Math Reasoning

The article presents MiroMind‑M1, an open‑source math‑reasoning language model that combines a 719K high‑quality SFT dataset, a novel CAMPO reinforcement‑learning algorithm, and extensive evaluations on AIME24, AIME25, and MATH‑500, demonstrating state‑of‑the‑art performance while reducing token usage.

CAMPOEvaluationmath reasoning

0 likes · 11 min read

How MiroMind‑M1 Sets New Benchmarks in Open‑Source Math Reasoning

JD Tech Talk

Jul 27, 2025 · Artificial Intelligence

Evaluating JoyAgent‑JDGenie: A Lightweight Multi‑Agent AI Framework in Action

This article presents a thorough evaluation of the open‑source JoyAgent‑JDGenie multi‑agent AI framework, covering its background, test cases for restaurant recommendation and travel planning, deployment steps, performance metrics, and concluding recommendations, highlighting its efficiency, ease of deployment, and result quality.

AIAgentsEvaluation

0 likes · 8 min read

Evaluating JoyAgent‑JDGenie: A Lightweight Multi‑Agent AI Framework in Action

Zhihu Tech Column

Jul 25, 2025 · Artificial Intelligence

Boost Creative Writing with Zhi-Create-Qwen3-32B: Training, Eval & Deployment

This article introduces the open‑source Zhi‑Create‑Qwen3‑32B model, detailing its fine‑tuned training on creative‑writing data, the multi‑domain dataset strategy, curriculum‑learning based SFT, evaluation on WritingBench, and practical deployment options across various hardware and inference frameworks.

EvaluationLarge Language Modelcreative writing

0 likes · 11 min read

Boost Creative Writing with Zhi-Create-Qwen3-32B: Training, Eval & Deployment

ELab Team

Jul 9, 2025 · Artificial Intelligence

How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding

This article explains the design of the edit_file tool, the fast‑apply model that rewrites whole files instead of diffs, its training and evaluation methodology, speculative decoding speed gains, and future research directions for large‑scale code‑editing AI systems.

AIEvaluationModel Training

0 likes · 14 min read

How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding

DataFunTalk

Jul 3, 2025 · Artificial Intelligence

How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations

In an interview with Vivo AI engineer Liang Tianan, the article explores the challenges of post‑Q&A recommendation, the integration of large language models into recall, ranking and evaluation pipelines, and the engineering trade‑offs required to deliver high‑quality, diverse suggestions on mobile devices.

EvaluationLLMMultimodal

0 likes · 15 min read

How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations

DataFunSummit

Jun 19, 2025 · Artificial Intelligence

How Large Models Are Revolutionizing Douyin’s User Experience – Expert Insights

In a detailed interview, ByteDance AI specialist Cai Conghuai explains how large‑model techniques such as SFT, DPO and RAG address Douyin’s multimodal user‑experience challenges, improve signal detection, root‑cause analysis, and outline future AI‑agent breakthroughs for content platforms.

AI AlgorithmsEvaluationMultimodal Learning

0 likes · 11 min read

How Large Models Are Revolutionizing Douyin’s User Experience – Expert Insights

Aikesheng Open Source Community

Jun 17, 2025 · Artificial Intelligence

Introducing SCALE: An Open‑Source Benchmark Redefining LLM SQL Capabilities

This article presents SCALE, a community‑driven, open‑source benchmark that expands beyond simple Text‑to‑SQL accuracy to evaluate large language models on performance, dialect conversion, and deep SQL understanding, offering developers, researchers, and CTOs a realistic measure of AI‑assisted database tasks.

AIEvaluationLLM

0 likes · 10 min read

Introducing SCALE: An Open‑Source Benchmark Redefining LLM SQL Capabilities

Tencent Technical Engineering

Jun 16, 2025 · Artificial Intelligence

Mastering RAG and AI Agents: Practical Tips, Code Samples, and Evaluation Strategies

This comprehensive guide walks you through the fundamentals of Retrieval‑Augmented Generation (RAG) and AI agents, explains their inner workings, shares optimization tricks, provides ready‑to‑run code snippets, and demonstrates how to evaluate performance with metrics such as recall, faithfulness, and answer relevance.

AI AgentsEvaluationLLM

0 likes · 36 min read

Mastering RAG and AI Agents: Practical Tips, Code Samples, and Evaluation Strategies

Baobao Algorithm Notes

May 26, 2025 · Artificial Intelligence

Why Do Reasoning LLMs Lose Instruction-Following Ability? A Deep Dive into Recent Findings

This article compares two recent papers that investigate why large reasoning models such as Llama and Qwen show degraded instruction‑following performance when using chain‑of‑thought prompting, analyzing attention patterns, training effects, and proposed mitigation strategies.

Chain-of-ThoughtEvaluationLLM

0 likes · 11 min read

Why Do Reasoning LLMs Lose Instruction-Following Ability? A Deep Dive into Recent Findings

Model Perspective

May 25, 2025 · Fundamentals

Why We Pretend to Win: The Hidden Math Behind Evaluation Bias

The article explores how people manipulate evaluation systems by redefining variables, adjusting weights, and shifting perspectives, turning losses into perceived wins, and reveals the psychological and statistical biases that create this illusion, urging more honest, multi‑dimensional, transparent modeling for genuine assessment.

BiasEvaluationdecision-making

0 likes · 9 min read

Why We Pretend to Win: The Hidden Math Behind Evaluation Bias

DataFunSummit

May 9, 2025 · Artificial Intelligence

Practical Experience Building Zhihu Direct Answer: An AI‑Powered Search Product

This article presents a comprehensive overview of Zhihu Direct Answer, describing its AI‑driven search architecture, RAG framework, query understanding, retrieval, chunking, reranking, generation, evaluation mechanisms, engineering optimizations, and the professional edition, while sharing concrete performance‑boosting practices and future development plans.

AIEvaluationProduct Development

0 likes · 14 min read

Practical Experience Building Zhihu Direct Answer: An AI‑Powered Search Product

Architect

Apr 17, 2025 · Artificial Intelligence

The Second Half of AI: From Model Innovation to Real‑World Utility

The article argues that artificial intelligence has entered a new phase where reinforcement learning finally generalizes, evaluation becomes more important than pure model performance, and researchers must redesign benchmarks and utility‑focused tasks to drive truly transformative progress.

Evaluationresearch strategy

0 likes · 16 min read

The Second Half of AI: From Model Innovation to Real‑World Utility

Nightwalker Tech

Apr 1, 2025 · Artificial Intelligence

Evaluation of AutoGLM: Features, Architecture, and Practical Test Results

This article reviews AutoGLM, the first "think‑while‑doing" AI agent released by Zhipu AI, detailing its core capabilities, full‑stack architecture, user experience, identified limitations, and the outcomes of three hands‑on tests using both the client application and a Chrome extension.

AI AgentAutoGLMEvaluation

0 likes · 4 min read

Evaluation of AutoGLM: Features, Architecture, and Practical Test Results

Meituan Technology Team

Mar 27, 2025 · Artificial Intelligence

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

The Q‑Eval‑100K dataset, comprising 100 k AIGC images and videos with separate visual‑quality and textual‑consistency annotations, powers the open‑source Q‑Eval‑Score framework that fine‑tunes multimodal models to deliver state‑of‑the‑art, scalable, and objective evaluation—including a “vague‑to‑specific” strategy for long prompts—surpassing existing benchmarks.

AIGCEvaluationMultimodal

0 likes · 9 min read

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

Alibaba Cloud Developer

Mar 24, 2025 · Artificial Intelligence

Boost LLM Evaluation with Semantic Enrichment and Vector Search

This article explains how semantic enrichment, vector and hybrid search, and clustering techniques can be applied to large language model logs to evaluate inputs and outputs, improve compliance auditing, and enhance model iteration across various business scenarios.

AIEvaluationLLM

0 likes · 12 min read

Boost LLM Evaluation with Semantic Enrichment and Vector Search

Alibaba Cloud Developer

Mar 24, 2025 · Artificial Intelligence

Why LLM Internet Search Fails and How to Fix It: A Deep Dive into Qwen, Doubao, and DeepSeek

This article analyses the shortcomings of large‑model internet search—such as unverifiable sources, fabricated content, and poor instruction compliance—by comparing Qwen‑max, Doubao‑1.5‑pro‑256k, and DeepSeek‑v3, and proposes prompt engineering, post‑processing, and custom tool improvements to boost reliability.

AIEvaluationLLM

0 likes · 22 min read

Why LLM Internet Search Fails and How to Fix It: A Deep Dive into Qwen, Doubao, and DeepSeek

DaTaobao Tech

Mar 19, 2025 · Artificial Intelligence

Retrieval Augmented Generation (RAG): Principles, Challenges, and Implementation Techniques

Retrieval‑augmented generation (RAG) enhances large language models by integrating a preprocessing pipeline—cleaning, chunking, embedding, and vector storage—with a query‑driven retrieval and prompt‑injection workflow, leveraging vector databases, multi‑stage recall, advanced prompting, and comprehensive evaluation metrics to mitigate knowledge cut‑off, hallucinations, and security issues.

EvaluationLLMRAG

0 likes · 27 min read

Retrieval Augmented Generation (RAG): Principles, Challenges, and Implementation Techniques

Efficient Ops

Mar 12, 2025 · Operations

How BizDevOps Is Accelerating Digital Transformation in Finance

This article explains the governmental push for digital transformation in financial institutions, introduces the BizDevOps integration model and its domestic and international standards, outlines the evaluation framework and process, showcases case studies, and announces the open registration for the 2025 BizDevOps assessment.

BizDevOpsEvaluationFinancial Industry

0 likes · 9 min read

How BizDevOps Is Accelerating Digital Transformation in Finance

AI Algorithm Path

Feb 20, 2025 · Artificial Intelligence

What Is Perplexity in Large Language Models?

The article explains perplexity as a metric for evaluating large language models, walks through a step‑by‑step probability calculation for a sample sentence, shows how to normalize by sentence length using the geometric mean, and demonstrates that lower perplexity indicates a more accurate and less uncertain model.

AIEvaluationLanguage Model

0 likes · 6 min read

What Is Perplexity in Large Language Models?

JD Tech

Feb 14, 2025 · Artificial Intelligence

JD Merchant Intelligent Assistant – Multi‑Agent System Architecture, Planning, and Evaluation

JD’s Merchant Intelligent Assistant leverages a large‑language‑model‑based multi‑agent architecture to provide 24/7 e‑commerce support, detailing its evolution, planning techniques, online inference, evaluation methods, sample generation, and practical insights for scalable AI‑driven operations.

AutomationEvaluationLLM

0 likes · 22 min read

JD Merchant Intelligent Assistant – Multi‑Agent System Architecture, Planning, and Evaluation

JD Retail Technology

Feb 10, 2025 · Artificial Intelligence

JD Merchant Intelligent Assistant: Multi‑Agent Architecture and Technical Exploration

The JD Merchant Intelligent Assistant employs a large‑language‑model‑driven multi‑agent architecture with dynamic ReAct planning, enabling merchants to query and execute store operations in under a second with over 90 % decision accuracy, while reducing inference cost, hallucinations, and engineering effort across diverse e‑commerce tasks.

AIEvaluationLLM

0 likes · 25 min read

JD Merchant Intelligent Assistant: Multi‑Agent Architecture and Technical Exploration

DataFunSummit

Jan 25, 2025 · Artificial Intelligence

AI-Driven Next-Generation Sales: Project Overview, Core Technologies, System Deployment, and Future Outlook

This article explores how AI transforms next‑generation sales by detailing project background and goals, core technologies such as efficient sample generation, model training and evaluation, system deployment impact, practical case studies, challenges, solutions, and future directions across multiple industries.

AIEvaluationLarge Language Model

0 likes · 25 min read

AI-Driven Next-Generation Sales: Project Overview, Core Technologies, System Deployment, and Future Outlook

Zhihu Tech Column

Jan 17, 2025 · Artificial Intelligence

Zhihu Direct Answer: Product Overview and Technical Practices

This article summarizes the key technical insights from Zhihu Direct Answer, an AI-powered search product, covering its product overview, RAG framework, query understanding, retrieval strategies, chunking, reranking, generation techniques, evaluation methods, and engineering optimizations for cost and performance.

AI SearchChunkingEngineering Optimization

0 likes · 13 min read

Zhihu Direct Answer: Product Overview and Technical Practices

NewBeeNLP

Jan 17, 2025 · Artificial Intelligence

Unlocking Multimodal Intelligence: A Deep Dive into Next Token Prediction

This comprehensive survey examines the foundations, tokenization techniques, model architectures, training paradigms, evaluation benchmarks, and open challenges of multimodal next‑token prediction (MMNTP), offering researchers a clear roadmap for future advances in multimodal AI.

EvaluationMultimodal AINext Token Prediction

0 likes · 9 min read

Unlocking Multimodal Intelligence: A Deep Dive into Next Token Prediction

Data Thinking Notes

Jan 7, 2025 · Databases

Unlocking LLM-Powered Text-to-SQL: From Basics to Cutting-Edge Techniques

This article provides a comprehensive overview of LLM-based Text-to-SQL technology, covering its background, evolution, challenges, various LLM-driven methods, benchmark datasets, evaluation metrics, and future research directions to guide researchers and practitioners in advancing natural language interfaces for databases.

EvaluationLLMPrompt Engineering

0 likes · 18 min read

Unlocking LLM-Powered Text-to-SQL: From Basics to Cutting-Edge Techniques

DataFunSummit

Jan 1, 2025 · Artificial Intelligence

Challenges and Evaluation Strategies for LLM Agents in 2024

The article outlines the rapid progress of LLM agents in 2024 while highlighting key difficulties in planning capabilities, evaluation methods, dataset generation, and metric design, and suggests practical combinations and product‑level enhancements to improve efficiency, accuracy, and usability.

AIAgentEvaluation

0 likes · 3 min read

Challenges and Evaluation Strategies for LLM Agents in 2024

Alibaba Cloud Big Data AI Platform

Dec 12, 2024 · Artificial Intelligence

How PertEval Reveals the Real Knowledge Limits of Large Language Models

At NeurIPS 2024, Alibaba Cloud's PAI team presented the Spotlight paper PertEval, which introduces knowledge‑invariant perturbations to expose the true knowledge capacity of LLMs, critiques over‑optimistic static benchmarks, and showcases responsible AI solutions and platform demos for enterprise use.

Alibaba CloudEvaluationNeurIPS 2024

0 likes · 6 min read

How PertEval Reveals the Real Knowledge Limits of Large Language Models

DevOps

Nov 21, 2024 · Product Management

Comprehensive Product KPI Metrics and Evaluation Guidelines

This article presents a detailed collection of product key performance indicators (KPIs) covering user growth, retention, activity, satisfaction, market share, revenue, development cycles, resource utilization, team satisfaction, brand awareness, and strategic goal achievement, along with formulas, weighting, and scoring methods for systematic performance assessment.

EvaluationKPIsperformance metrics

0 likes · 13 min read

Comprehensive Product KPI Metrics and Evaluation Guidelines

Fighter's World

Nov 18, 2024 · Product Management

Uncovering AI Product Design Challenges: Insights from OpenAI and Anthropic CPOs

The article distills a fireside chat between OpenAI’s CPO Kevin Weil and Anthropic’s CPO Mike Krieger, highlighting how uncertainty, iterative co‑design, evolving product‑manager skills, human‑AI collaboration, non‑deterministic UI, and emerging trends like proactivity, asynchrony, multimodality and personalization shape modern AI product development.

AI product designEvaluationHuman-AI Collaboration

0 likes · 13 min read

Uncovering AI Product Design Challenges: Insights from OpenAI and Anthropic CPOs

Baobao Algorithm Notes

Nov 4, 2024 · Artificial Intelligence

Uncovering 16 Limits of AI Search Engines and 16 Design Recommendations

A user study with 21 participants reveals sixteen critical limitations of generative AI search engines, maps them to eight quantitative metrics, proposes sixteen design recommendations, and evaluates You.com, Perplexity and BingChat against this framework to highlight current performance gaps.

AI SearchEvaluationLLM

0 likes · 12 min read

Uncovering 16 Limits of AI Search Engines and 16 Design Recommendations

Baobao Algorithm Notes

Oct 17, 2024 · Artificial Intelligence

How Meta’s Movie Gen Pushes Text‑to‑Video Generation to New Heights

Meta’s newly released 92‑page Movie Gen paper introduces a multimodal LLM that unifies text‑to‑image, text‑to‑video, personalized video, precise video editing, and audio generation, detailing its dual‑model architecture, training pipeline, temporal auto‑encoder design, scaling strategies, evaluation benchmark, and ablation studies.

EvaluationModel Scalingdeep learning

0 likes · 34 min read

How Meta’s Movie Gen Pushes Text‑to‑Video Generation to New Heights

Bilibili Tech

Sep 18, 2024 · Artificial Intelligence

Index-1.9B-32K: A 2% GPT-Size Model with Powerful Long-Context Capabilities

Index-1.9B-32K is a 1.9B-parameter model with a 32K token context window, achieving strong long‑text performance comparable to larger models while using only about 2% of GPT‑4’s compute, trained via long pre‑training and supervised fine‑tuning, with a trade‑off of reduced short‑context ability.

AIEvaluationLong Context

0 likes · 12 min read

Index-1.9B-32K: A 2% GPT-Size Model with Powerful Long-Context Capabilities

Model Perspective

Jul 20, 2024 · Fundamentals

Why Evaluation Is Only the First Step Toward Effective Optimization in Mathematical Modeling

The article explains that evaluation in mathematical modeling is essential for analyzing and comparing solutions, but it must be followed by continuous optimization and actionable improvements to avoid stagnation and achieve real progress.

EvaluationOptimizationcontinuous improvement

0 likes · 3 min read

Why Evaluation Is Only the First Step Toward Effective Optimization in Mathematical Modeling

Architect

Jul 13, 2024 · Artificial Intelligence

Practical Guide to Building LLM Products: Prompt Engineering, RAG, Evaluation, and Operations

This article provides a comprehensive, step‑by‑step guide for developing large‑language‑model (LLM) applications, covering prompt design techniques, n‑shot and chain‑of‑thought strategies, retrieval‑augmented generation, structured I/O, workflow optimization, evaluation pipelines, operational best practices, and team organization to create reliable, scalable AI products.

AI OperationsEvaluationLLM

0 likes · 54 min read

Practical Guide to Building LLM Products: Prompt Engineering, RAG, Evaluation, and Operations

DataFunTalk

Jul 7, 2024 · Artificial Intelligence

Large Model Application Development: Architecture, Lifecycle, and Prompt Engineering

This article presents a comprehensive knowledge map for developing large‑model applications, covering a four‑layer technical architecture, the full development lifecycle, core elements such as prompt engineering and model fine‑tuning, evaluation methods, and practical case studies, offering guidance for both enterprises and startups.

AI application developmentEvaluationPrompt Engineering

0 likes · 15 min read

Large Model Application Development: Architecture, Lifecycle, and Prompt Engineering

AI Large Model Application Practice

Jul 4, 2024 · Artificial Intelligence

Mastering Multimodal RAG: From PDF Parsing to Advanced Query Rewriting

This article explains how to handle complex multimodal PDFs in RAG systems, outlines extraction, indexing, and multimodal model integration, details four query‑rewriting strategies (HyDE, stepwise, sub‑question, backward), and presents key evaluation metrics and tools for assessing RAG performance.

Document ParsingEvaluationMultimodal

0 likes · 12 min read

Mastering Multimodal RAG: From PDF Parsing to Advanced Query Rewriting

Continuous Delivery 2.0

Jul 3, 2024 · Artificial Intelligence

Applying Large Language Models to Software Engineering: Challenges, Cross‑File Editing Issues, Bug‑Fixing Evaluation, and SWE‑Bench Results

This article examines the practical challenges of using large language models in software development, including handling long contexts, cross‑file editing, bug‑fixing evaluation methods, and presents benchmark results from SWE‑Bench and its Lite subset to assess model capabilities.

Cross-File EditingEvaluationLLM

0 likes · 7 min read

Applying Large Language Models to Software Engineering: Challenges, Cross‑File Editing Issues, Bug‑Fixing Evaluation, and SWE‑Bench Results

DataFunSummit

Jun 16, 2024 · Artificial Intelligence

Reinforcement Learning in Recommendation Systems: Practice, Challenges, and Industry Advances

This article presents a comprehensive overview of applying reinforcement learning to recommendation systems, covering background challenges, practical exploration, frontier research directions, multi‑agent and inverse RL approaches, evaluation methods, and future outlooks, based on a KDD‑published study and industry experience.

EvaluationInverse RLOffline RL

0 likes · 24 min read

Reinforcement Learning in Recommendation Systems: Practice, Challenges, and Industry Advances

Bilibili Tech

Jun 14, 2024 · Artificial Intelligence

Technical Report on the Index-1.9B Series: Model Variants, Pre‑training Optimizations, and Alignment Experiments

The report presents the open‑source Index‑1.9B family—base, pure, chat, and character variants—detailing benchmark results, pre‑training optimizations such as a normalized LM‑Head and deeper‑slim architectures, the importance of modest instruction data, alignment via SFT/DPO, role‑play enhancements with RAG, and acknowledges remaining safety and factual limitations.

EvaluationInstruction TuningLLM

0 likes · 15 min read

Technical Report on the Index-1.9B Series: Model Variants, Pre‑training Optimizations, and Alignment Experiments

DataFunSummit

Jun 10, 2024 · Artificial Intelligence

Xiaomi Agent Technology: Architecture, Prompt Management, and Evaluation

This article presents Xiaomi's work on LLM‑based Agent technology, covering its perception‑thinking‑action pipeline, technical framework, prompt management, executor and API platform, workflow, optimization strategies, evaluation metrics, and future directions for AI assistants.

AI assistantAgentEvaluation

0 likes · 17 min read

Xiaomi Agent Technology: Architecture, Prompt Management, and Evaluation

DevOps

May 23, 2024 · Information Security

Guidelines for Evaluating Large Language Models in Cybersecurity Tasks

The article examines the opportunities and risks of applying large language models (LLMs) to cybersecurity, outlines fourteen practical recommendations for assessing their real‑world capabilities, and concludes with an invitation to the upcoming R&D Efficiency Conference covering AI, product management, and related topics.

AI safetyEvaluationLLM

0 likes · 11 min read

Guidelines for Evaluating Large Language Models in Cybersecurity Tasks

NewBeeNLP

May 18, 2024 · Artificial Intelligence

How to Detect Test Set Contamination in Black‑Box Language Models

Researchers propose a black‑box method to expose test‑set leakage in large language models by comparing log‑probability shifts when test items are shuffled, using Monte‑Carlo estimation and a sharded likelihood test, and demonstrate its effectiveness on several models including Mistral‑7B.

EvaluationLLMblack-box detection

0 likes · 8 min read

How to Detect Test Set Contamination in Black‑Box Language Models

DataFunTalk

May 7, 2024 · Artificial Intelligence

Large Language Models and Knowledge Graphs: Recent Advances, Synergies, and Future Directions

This article reviews the rapid progress of large language models, compares them with knowledge graphs, explores how LLMs can aid knowledge extraction and completion, discusses how knowledge graphs can evaluate and enhance LLMs, and outlines future interactive integration between the two technologies.

AIEvaluationKnowledge Graphs

0 likes · 12 min read

Large Language Models and Knowledge Graphs: Recent Advances, Synergies, and Future Directions

AI Large Model Application Practice

May 3, 2024 · Artificial Intelligence

Can Giant Context LLMs Replace RAG? Exploring the Limits of Long‑Context Retrieval

This article examines whether the rapid growth of large‑language‑model context windows can eliminate the need for retrieval‑augmented generation, presenting experimental needle‑in‑a‑haystack tests, analysis of model performance across token lengths and needle positions, and practical guidance using an open‑source evaluation tool.

AIEvaluationLLM

0 likes · 13 min read

Can Giant Context LLMs Replace RAG? Exploring the Limits of Long‑Context Retrieval

360 Tech Engineering

Apr 17, 2024 · Artificial Intelligence

HiCo: A Hierarchical Controllable Diffusion Model for Layout‑to‑Image Generation

The 360 AI Research Institute introduces HiCo, a hierarchical controllable diffusion model that enables fine‑grained layout control across up to eight image regions, integrates seamlessly with existing Stable Diffusion ecosystems, and demonstrates superior performance on the GRIT‑VAL benchmark for layout‑aware image synthesis.

AI drawingEvaluationHiCo

0 likes · 8 min read

HiCo: A Hierarchical Controllable Diffusion Model for Layout‑to‑Image Generation

Tech Architecture Stories

Jan 29, 2024 · R&D Management

Mastering Tech Promotion Reviews: Proven Strategies to Accelerate Your Career

This guide shares years of promotion‑review experience from major tech firms, outlining company‑specific promotion processes and five essential content elements—systematic design, detailed data, derivation reasoning, upstream/downstream context, and comparative analysis—plus practical presentation and logical techniques to help engineers secure promotions and salary raises.

EvaluationR&D Managementcareer advancement

0 likes · 8 min read

Mastering Tech Promotion Reviews: Proven Strategies to Accelerate Your Career

DataFunSummit

Jan 14, 2024 · Artificial Intelligence

Large Language Model Innovations for the Financial Industry: From General to Finance‑Specific Models, Training Techniques, Evaluation Methods, and Real‑World Applications

This article details how the financial sector is adopting large language models, describing the shift from generic to finance‑specific models, the technical challenges and cost considerations, the XuanYuan model releases, novel training and evaluation approaches, and a range of practical applications such as marketing, service, operations, office assistance, and risk control.

AIApplicationsEvaluation

0 likes · 17 min read

Large Language Model Innovations for the Financial Industry: From General to Finance‑Specific Models, Training Techniques, Evaluation Methods, and Real‑World Applications

DataFunTalk

Jan 2, 2024 · Artificial Intelligence

Mid‑Stage Reflections on Large‑Model Technology and Its Industry Impact

This article offers a comprehensive mid‑stage analysis of large‑model technology, discussing its rapid development, emerging challenges such as cost and hallucinations, positioning, scenario applications, cost‑value trade‑offs, and strategic pathways for future research and deployment.

AIApplicationsEvaluation

0 likes · 21 min read

Mid‑Stage Reflections on Large‑Model Technology and Its Industry Impact

Rare Earth Juejin Tech Community

Dec 29, 2023 · Artificial Intelligence

Overview of Major Benchmark Datasets for Evaluating Large Language Models

This article provides a comprehensive overview of major benchmark datasets—including CMMLU, MMLU, C‑Eval, GSM8K, Gaokao‑Bench, AGIEval, MATH, BBH, HumanEval, and MBPP—used to evaluate large language models' knowledge, reasoning, and coding abilities, and summarizes related leaderboards and evaluation tools.

EvaluationLLMartificial-intelligence

0 likes · 14 min read

Overview of Major Benchmark Datasets for Evaluating Large Language Models

Baidu Geek Talk

Dec 20, 2023 · Artificial Intelligence

A Unified Platform for Prompt Development, Evaluation, and Iteration in Large Language Model Applications

The proposed unified platform centralizes prompt creation, evaluation, and iteration for large‑model applications, offering one‑stop hosting, metric‑driven testing, seamless resource integration, model switching, fine‑grained traffic control, and an automated data‑flywheel with QEP scoring, cutting optimization cycles from weeks to days while paving the way for advanced fine‑tuning techniques.

AI platformAutomationData Flywheel

0 likes · 17 min read

A Unified Platform for Prompt Development, Evaluation, and Iteration in Large Language Model Applications

AntTech

Dec 19, 2023 · Artificial Intelligence

RJUA‑QA: A Comprehensive Urology QA Dataset for Large Language Model Evaluation

RJUA‑QA is a newly released, large‑scale urology question‑answer dataset constructed from virtual patient records based on clinical experience, featuring 2,132 QA pairs with extensive context, designed to benchmark and improve large language models’ medical reasoning, diagnosis, and treatment recommendation capabilities.

EvaluationQA datasetUrology

0 likes · 12 min read

RJUA‑QA: A Comprehensive Urology QA Dataset for Large Language Model Evaluation

JD Cloud Developers

Nov 28, 2023 · Backend Development

Choosing the Right Java Expression Engine: Performance, Security, and Community Insights

This article provides a comprehensive overview and comparative analysis of popular Java expression engines—including AviatorScript, MVEL, OGNL, SpEL, QLExpress, JEXL, JUEL, and Janino—covering their features, community support, size, performance benchmarks, security settings, usage cases, and syntax differences to guide developers in selecting the most suitable engine for their projects.

EvaluationExpression EngineJava

0 likes · 23 min read

Choosing the Right Java Expression Engine: Performance, Security, and Community Insights

Ant R&D Efficiency

Nov 24, 2023 · Artificial Intelligence

CodeFuseEval: An Enterprise‑Level Multi‑Task Benchmark for Evaluating Code Large Models

CodeFuseEval is an enterprise‑grade, multi‑task benchmark that evaluates code‑generation large models across six languages and thousands of real‑world tasks using both objective metrics (pass@k, BLEU, CodeBLEU) and expert human review, with an open‑source framework, continuous dataset expansion, and a focus on correctness, efficiency, robustness, and service‑level quality.

AIEvaluationbenchmark

0 likes · 12 min read

CodeFuseEval: An Enterprise‑Level Multi‑Task Benchmark for Evaluating Code Large Models

Baobao Algorithm Notes

Oct 23, 2023 · Artificial Intelligence

Why Multimodal AI Agents Could Be the Next Killer App for Large Models

The article recounts a personal test of a multimodal AI agent in Newport Beach and expands into a detailed analysis of current multimodal LLM architectures, memory mechanisms, task planning, tool usage, personality modeling, cost constraints, evaluation challenges, and the broader social and reliability implications of deploying such agents.

AI AgentsEvaluationMultimodal

0 likes · 44 min read

Why Multimodal AI Agents Could Be the Next Killer App for Large Models

Software Development Quality

Oct 19, 2023 · Artificial Intelligence

Beyond ROUGE: GLUE, SuperGLUE, MMLU, C‑Eval & HELM Transform NLP Evaluation

Evaluating language models solely with ROUGE or BLEU is insufficient, so comprehensive benchmarks like GLUE, SuperGLUE, MMLU, C‑Eval, and HELM provide diverse tasks and metrics that more accurately assess linguistic understanding, knowledge acquisition, and robustness across English and Chinese NLP systems.

AIEvaluationLanguage Models

0 likes · 9 min read

Beyond ROUGE: GLUE, SuperGLUE, MMLU, C‑Eval & HELM Transform NLP Evaluation

Architecture and Beyond

Sep 3, 2023 · R&D Management

Effective Team Management: Definitions, Development Stages, and Best Practices

This article explains what a team is, describes its open‑system nature and three‑layer composition, outlines the Tuckman development model and leadership growth stages, and provides practical guidance on direction, leadership, roles, systems, communication, relationships, and evaluation for managing high‑performing technical teams.

EvaluationLeadershipTeam Development

0 likes · 45 min read

Effective Team Management: Definitions, Development Stages, and Best Practices

Baobao Algorithm Notes

Aug 22, 2023 · Artificial Intelligence

Why Do Large Language Models Hallucinate? Definitions, Causes, and Mitigation Strategies

This article defines hallucination in LLMs as a failure of faithfulness or factualness, explores data‑level and model‑level origins, reviews reference‑based and reference‑free evaluation metrics, and surveys current research on data‑centric and model‑centric mitigation techniques along with future directions.

EvaluationHallucinationfactuality

0 likes · 16 min read

Why Do Large Language Models Hallucinate? Definitions, Causes, and Mitigation Strategies

DataFunTalk

Aug 11, 2023 · Artificial Intelligence

Multimodal Dialogue Large Model mPLUG-Owl: Technology, Applications, and Evaluation

mPLUG-Owl is a modular multimodal dialogue large model from Alibaba DAMO Academy that builds on the mPLUG series, offering advanced image, video, OCR, and multilingual capabilities, with extensive evaluations showing superior performance over MiniGPT‑4, LLaVA, and other multimodal LLMs across various tasks.

EvaluationMultimodal AImPLUG-Owl

0 likes · 17 min read

Multimodal Dialogue Large Model mPLUG-Owl: Technology, Applications, and Evaluation

Rare Earth Juejin Tech Community

Jul 24, 2023 · Artificial Intelligence

Comprehensive Survey of Large Language Models: History, Key Technologies, Resources, and Future Directions

This article provides a detailed overview of large language models (LLMs), tracing their evolution from statistical and neural language models to modern pre‑trained transformers, discussing scaling, training, adaptation, utilization, evaluation methods, available resources, and outlining current challenges and future research directions.

EvaluationModel ScalingPre‑training

0 likes · 26 min read

Comprehensive Survey of Large Language Models: History, Key Technologies, Resources, and Future Directions

DevOps

May 19, 2023 · Cloud Computing

Comprehensive Guide to Cloud Migration: Evaluation, Planning, Execution, and Cost Optimization

This article provides a detailed guide to cloud migration, covering evaluation and analysis, pilot projects, assessment strategies, planning and design with cloud services, verification and implementation steps, continuous measurement, and cost optimization through FinOps to ensure successful and secure migration.

Cloud ComputingCloud MigrationEvaluation

0 likes · 10 min read

Comprehensive Guide to Cloud Migration: Evaluation, Planning, Execution, and Cost Optimization

DataFunSummit

May 4, 2023 · Artificial Intelligence

LLM Ranking Arena: Elo‑Based Competitive Evaluation of Open‑Source Chatbots

A recent study by the LMSYS organization introduces an Elo‑rated, 1v1 battle arena for large language models, ranking open‑source chatbots like Vicuna, Koala, and ChatGLM, while discussing the limitations of traditional benchmarks and the advantages of crowd‑sourced, scalable evaluation.

AI benchmarkingChatbot ArenaElo rating

0 likes · 7 min read

LLM Ranking Arena: Elo‑Based Competitive Evaluation of Open‑Source Chatbots

Architect

Apr 9, 2023 · Artificial Intelligence

Evaluating the Commonsense Knowledge and Reasoning Capabilities of ChatGPT and Other Large Language Models

This study systematically evaluates ChatGPT and other large language models on their ability to answer commonsense questions, assess their knowledge awareness, and utilize generated knowledge for reasoning, revealing strong QA performance but notable gaps in social and temporal commonsense and in leveraging contextual knowledge.

ChatGPTEvaluationNLP

0 likes · 20 min read

Evaluating the Commonsense Knowledge and Reasoning Capabilities of ChatGPT and Other Large Language Models

Programmer DD

Apr 9, 2023 · Artificial Intelligence

How Does Alibaba’s Tongyi Qianwen Compare to ChatGPT? A Hands‑On Evaluation

This article reviews Alibaba’s Tongyi Qianwen large‑language model by testing its self‑introduction, code generation, literary creation, mathematical reasoning, Chinese language understanding, and casual chatting abilities, summarizing strengths, weaknesses, and overall performance compared with other LLMs.

Chinese LanguageEvaluationartificial-intelligence

0 likes · 7 min read

How Does Alibaba’s Tongyi Qianwen Compare to ChatGPT? A Hands‑On Evaluation

Programmer DD

Apr 7, 2023 · Artificial Intelligence

How Vicuna-13B Achieves ChatGPT‑Level Performance with Low‑Cost Open‑Source Training

The Vicuna-13B open‑source chatbot, fine‑tuned from LLaMA on 70k ShareGPT conversations, matches over 90% of ChatGPT and Google Bard quality while costing only about $300 to train, thanks to memory optimizations, multi‑turn dialogue handling, and cheap spot‑instance training.

AIChatbotEvaluation

0 likes · 8 min read

How Vicuna-13B Achieves ChatGPT‑Level Performance with Low‑Cost Open‑Source Training

DataFunTalk

Apr 6, 2023 · Artificial Intelligence

A Comprehensive Survey of Large Language Models: Background, Capabilities, Key Technologies, and Future Directions

This article reviews the rapid progress of large language models (LLMs), covering their historical development, scaling laws, emergent abilities, core technologies such as training and alignment, resource ecosystems, evaluation methods, safety concerns, and prospective research challenges.

AI researchEvaluationLLM

0 likes · 21 min read

A Comprehensive Survey of Large Language Models: Background, Capabilities, Key Technologies, and Future Directions

DataFunSummit

Mar 19, 2023 · Artificial Intelligence

Complex Question Answering Evaluation of ChatGPT

This paper presents a large‑scale evaluation of ChatGPT on knowledge‑base complex question answering, introducing a feature‑driven multi‑label annotation framework and CheckList‑based functional, robustness, and controllability tests, and comparing its performance with other LLMs across multiple English and multilingual datasets.

Chain-of-ThoughtChatGPTComplex QA

0 likes · 25 min read

Complex Question Answering Evaluation of ChatGPT

Model Perspective

Nov 6, 2022 · Fundamentals

Unlock Objective Decision-Making with the Entropy Weight Method

The Entropy Weight Method (EWM) offers an objective, data‑driven way to calculate indicator weights by measuring information entropy, avoiding subjective bias and improving the reliability of multi‑criteria evaluations across fields such as water quality and resource management.

Evaluationdecision makingentropy weight method

0 likes · 4 min read

Unlock Objective Decision-Making with the Entropy Weight Method

DataFunSummit

Sep 23, 2022 · Artificial Intelligence

A Comprehensive Overview of Automatic Text Summarization: Methods, Datasets, Evaluation, and Future Directions

This article surveys automatic text summarization, detailing system classifications, extractive, abstractive and hybrid techniques, notable recent research, multi‑document and cross‑lingual challenges, major datasets, evaluation metrics, and promising future research avenues in the field.

EvaluationNLPabstractive

0 likes · 21 min read

A Comprehensive Overview of Automatic Text Summarization: Methods, Datasets, Evaluation, and Future Directions

Laiye Technology Team

Sep 23, 2022 · Artificial Intelligence

Overview of Automatic Text Summarization: Methods, Datasets, and Future Directions

This article provides a comprehensive overview of automatic text summarization, covering extractive, abstractive, and hybrid methods, system classifications, applications, datasets, evaluation metrics, and future research directions within the field of artificial intelligence.

EvaluationNLPabstractive

0 likes · 23 min read

Overview of Automatic Text Summarization: Methods, Datasets, and Future Directions

DataFunSummit

Sep 5, 2022 · Artificial Intelligence

Comprehensive Evaluation of Long‑Audio Speech‑to‑Text Services from Major Cloud Providers

This article presents a systematic, multi‑dimensional benchmark of six leading cloud speech‑recognition platforms—Alibaba Cloud, Tencent Cloud, iFlytek, Baidu Cloud, Huawei Cloud, and Microsoft Azure—using a 22.6‑hour, 81‑file Mandarin dataset, scoring with the CORR metric and SCTK tool, and discusses each provider's workflow, strengths, pitfalls, and cost.

AICloud ServicesEvaluation

0 likes · 15 min read

Comprehensive Evaluation of Long‑Audio Speech‑to‑Text Services from Major Cloud Providers

DataFunSummit

Aug 20, 2022 · Information Security

Content Risk Control Industry Overview and Evaluation System

The article reviews the development background of the digital economy‑driven content risk control industry, examines current content moderation technologies and challenges, describes the establishment of a content technology promotion alliance, outlines its research directions and evaluation standards, and includes a Q&A on regulatory collaboration.

EvaluationStandardsartificial-intelligence

0 likes · 16 min read

Content Risk Control Industry Overview and Evaluation System

Model Perspective

Jul 2, 2022 · Operations

Top Resources for Evaluation & Optimization Models – A Curated Guide

This article compiles and categorizes recent model‑related publications, offering a comprehensive list of evaluation‑model resources—including concepts, preprocessing techniques, weighting methods, and various algorithms—and optimization‑model references covering linear and integer programming, graph theory, network flows, and meta‑heuristics.

EvaluationLinear ProgrammingOperations

0 likes · 4 min read

Top Resources for Evaluation & Optimization Models – A Curated Guide

Architecture and Beyond

May 1, 2022 · R&D Management

Effective Questioning Techniques for Promotion Review Panels

The article outlines systematic questioning strategies for judges in corporate promotion defenses, detailing how to clarify definitions, probe processes, assess difficulty, evaluate big‑picture thinking, explore methodology, and link technical work to business value, thereby ensuring fair and insightful evaluations.

EvaluationR&D Managementcareer development

0 likes · 13 min read

Effective Questioning Techniques for Promotion Review Panels

Full-Stack Internet Architecture

Nov 13, 2021 · R&D Management

Understanding Performance Evaluation in Big Tech: Logic, Rules, and Strategies

The article explains why performance assessments are crucial in large tech firms, outlines the KPI/OKR systems and the typical 271 distribution, reveals hidden rules that affect bonuses and promotions, and offers short‑ and long‑term strategies for employees to navigate and improve their ratings.

Career AdviceEvaluationKPI

0 likes · 8 min read

Understanding Performance Evaluation in Big Tech: Logic, Rules, and Strategies

DataFunTalk

Oct 5, 2021 · Artificial Intelligence

From Technology to Experience: Vivo Machine Translation Deployment Practice

This article presents a comprehensive guide to deploying machine translation at Vivo, covering business analysis, algorithm choices beyond standard NMT, language detection challenges, data collection and cleaning, scientific evaluation methods, and engineering optimizations to deliver a seamless user experience.

AIEvaluationMachine Translation

0 likes · 20 min read

From Technology to Experience: Vivo Machine Translation Deployment Practice

IT Architects Alliance

Aug 14, 2021 · Backend Development

When to Adopt Microservices: Evaluation, Risks, and Best Practices

This article examines the trade‑offs of microservice architecture, comparing it with monolithic design, and provides practical guidance on when to transition, how to assess technical and team readiness, and what risks and splitting strategies to consider.

EvaluationMicroservicesarchitecture

0 likes · 17 min read

When to Adopt Microservices: Evaluation, Risks, and Best Practices

IT Architects Alliance

Jul 26, 2021 · R&D Management

How to Conduct a Comprehensive Architecture Evaluation: A Step-by-Step Guide

This article outlines a thorough methodology for evaluating software, hardware, and overall system architectures, detailing assessment criteria, a five‑stage evaluation process, quality‑assurance measures, and best‑practice checkpoints to ensure high availability, scalability, security, and cost‑effectiveness of complex engineering projects.

EvaluationSystem Designarchitecture

0 likes · 12 min read

How to Conduct a Comprehensive Architecture Evaluation: A Step-by-Step Guide

Liulishuo Tech Team

Jul 7, 2021 · Frontend Development

Evaluation and Evolution of Mini‑Program Development Frameworks for Frontend Teams

This article reviews the background, key considerations, architectural principles, evolution, performance comparison, and a customized solution for building mini‑programs using frameworks such as WePY, Taro, and UniApp, highlighting cross‑platform support, TypeScript integration, and development experience improvements.

Evaluationframeworkperformance

0 likes · 12 min read

Evaluation and Evolution of Mini‑Program Development Frameworks for Frontend Teams

Efficient Ops

Apr 16, 2021 · Operations

How Anxin Securities Achieved Top RPA Maturity: Insights from China’s First RPA Standard Evaluation

Anxin Securities’ RPA Unified Management Platform earned the highest 3+ maturity rating at China’s inaugural RPA standard assessment, showcasing extensive automation across finance, operations, and disaster recovery, while outlining future SmartRPA initiatives and AI‑driven enhancements for digital transformation.

AI integrationEvaluationRPA

0 likes · 10 min read

How Anxin Securities Achieved Top RPA Maturity: Insights from China’s First RPA Standard Evaluation

21CTO

Feb 26, 2021 · Artificial Intelligence

Why One Metric Isn't Enough: Multi‑Dimensional Evaluation of Recommendation Systems

The article explains why relying on a single metric like click‑through rate is insufficient for recommendation systems, and outlines a comprehensive, multi‑dimensional evaluation framework that combines business indicators, user behavior metrics, and algorithmic performance measures such as recall, precision, and AUC.

AB testingAIAUC

0 likes · 10 min read

Why One Metric Isn't Enough: Multi‑Dimensional Evaluation of Recommendation Systems

ITFLY8 Architecture Home

Feb 26, 2021 · Artificial Intelligence

Inside Toutiao's Transparent Real-Time Recommendation Engine

This article details how Toutiao's senior algorithm architect designs a transparent recommendation system, covering system overview, three-dimensional feature modeling, real-time training pipelines, recall strategies, content analysis, user tagging, evaluation methods, and content safety measures.

Content SafetyEvaluationReal-time Training

0 likes · 17 min read

Inside Toutiao's Transparent Real-Time Recommendation Engine

21CTO

Jan 11, 2021 · Artificial Intelligence

How to Build a Recommendation System from Scratch: Key Concepts and Strategies

This article explains the fundamentals of recommendation systems, covering data collection, user and content profiling, system architecture, algorithmic pipelines such as recall, filtering, ranking, and evaluation metrics, while also discussing practical challenges like echo chambers and long‑term user value.

EvaluationRankingalgorithm

0 likes · 16 min read

How to Build a Recommendation System from Scratch: Key Concepts and Strategies

NetEase Yanxuan Technology Product Team

Nov 27, 2020 · Product Management

How to Build Effective Decision‑Making Products: A Practical Blueprint

This article outlines a comprehensive framework for designing decision‑type products, covering their evolution stages, core elements of model‑data‑strategy, domain modeling techniques, data‑to‑knowledge transformation, business and process value, and a feedback‑driven decision loop with evaluation and simulation.

Business AnalyticsDecision ProductsEvaluation

0 likes · 20 min read

How to Build Effective Decision‑Making Products: A Practical Blueprint

Programmer DD

Oct 24, 2020 · Cloud Native

Should You Switch to Microservices? Evaluation Tips and Migration Steps

This article examines the fundamentals of monolithic and microservice architectures, outlines the advantages and drawbacks of each, provides criteria for deciding when to adopt microservices, and offers practical guidance on technical, talent, and organizational considerations for a successful migration.

Evaluationarchitecturecloud-native

0 likes · 16 min read

Should You Switch to Microservices? Evaluation Tips and Migration Steps

ITFLY8 Architecture Home

Oct 1, 2020 · Cloud Native

When Should You Adopt Microservices? A Practical Evaluation Guide

This article explores the fundamentals of monolithic and microservice architectures, assesses the benefits, costs, and risks of adopting microservices, and provides practical criteria—including business complexity, team size, and technical readiness—to help decide the optimal moment for migration.

EvaluationMicroservicesbackend

0 likes · 16 min read

When Should You Adopt Microservices? A Practical Evaluation Guide

Top Architect

Sep 19, 2020 · Artificial Intelligence

Architecture and Evaluation of Toutiao's Large-Scale Recommendation System

The article details the end‑to‑end architecture of Toutiao's massive recommendation platform, covering system overview, content and user feature extraction, model training, recall strategies, evaluation methodology, and content safety mechanisms, while highlighting practical challenges and engineering solutions.

Content SafetyEvaluationModel Training

0 likes · 18 min read

Architecture and Evaluation of Toutiao's Large-Scale Recommendation System

Sohu Tech Products

Sep 16, 2020 · Artificial Intelligence

Open-Domain Dialogue Systems: Current State, Challenges, and Future Directions

This article reviews the latest advances in open-domain dialogue systems, covering classification, end‑to‑end generation challenges, knowledge‑controlled generation, automated evaluation, large‑scale latent‑space models such as PLATO, and outlines future research directions for building more coherent and controllable conversational AI.

Dialogue SystemsEvaluationknowledge grounding

0 likes · 14 min read

Open-Domain Dialogue Systems: Current State, Challenges, and Future Directions

Efficient Ops

Aug 20, 2020 · Operations

Understanding China’s First DevOps Capability Maturity Model and Evaluation Process

This article introduces China’s inaugural DevOps Capability Maturity Model, outlines its eight-part structure—including system and tool requirements—describes the standardized evaluation methodology, registration details, and provides contact information for organizations seeking certification.

Capability Maturity ModelEvaluationStandardization

0 likes · 6 min read

Understanding China’s First DevOps Capability Maturity Model and Evaluation Process

DataFunTalk

Apr 2, 2020 · Artificial Intelligence

Practical Guide to Evaluating Recommendation Systems: Metrics, Scenarios, and Best Practices

This article explains how to choose and combine appropriate evaluation metrics for recommendation systems by considering the specific scenario, business model, offline versus online testing, ecosystem balance, and user behavior, providing practical methods and a concise summary of common metric types.

AIEvaluationMetrics

0 likes · 18 min read

Practical Guide to Evaluating Recommendation Systems: Metrics, Scenarios, and Best Practices

21CTO

Feb 18, 2020 · Artificial Intelligence

Inside Toutiao’s Real‑Time Recommendation Engine: Architecture, Features, and Evaluation

This article details Toutiao’s large‑scale recommendation system, explaining how it models content, user, and environment features, the variety of algorithms and real‑time training pipelines used, feature engineering categories, recall strategies, content analysis, user tagging, evaluation methods, and content‑safety mechanisms.

Content SafetyEvaluationReal-time Training

0 likes · 18 min read

Inside Toutiao’s Real‑Time Recommendation Engine: Architecture, Features, and Evaluation

Sohu Tech Products

Dec 25, 2019 · Artificial Intelligence

Hot Topic Detection Algorithms and Article Deduplication Evaluation without a Test Set

The article discusses how to discover hot topics using algorithms such as TextRank, BERT embeddings, and BM25, outlines the lifecycle of a hot topic, and proposes practical methods for evaluating article deduplication accuracy and recall when no labeled test set is available.

BERTEvaluationNLP

0 likes · 4 min read

Hot Topic Detection Algorithms and Article Deduplication Evaluation without a Test Set