Tag

AI evaluation

0 views collected around this technical thread.

Architect
Architect
Jun 12, 2025 · Artificial Intelligence

Why Large Reasoning Models Collapse Under Complex Tasks: Insights from Apple’s Study

Apple’s research reveals that large reasoning models, despite sophisticated self‑reflection mechanisms, experience a complete performance collapse when problem complexity exceeds a threshold, highlighting fundamental limits in their ability to achieve generalized reasoning.

AI evaluationlarge reasoning modelsmodel limitations
0 likes · 7 min read
Why Large Reasoning Models Collapse Under Complex Tasks: Insights from Apple’s Study
DataFunTalk
DataFunTalk
Apr 3, 2025 · Artificial Intelligence

Large Language Models GPT-4.5 and LLaMa-3.1-405B Pass Standard Turing Test in UCSD Study

A UC San Diego study found that GPT-4.5 was judged human 73% of the time and LLaMa-3.1-405B 56%, demonstrating that both large language models can pass a standard three‑party Turing test, with detailed methodology, results, and analysis of judge behavior.

AI evaluationGPT-4.5LLaMa-3.1
0 likes · 5 min read
Large Language Models GPT-4.5 and LLaMa-3.1-405B Pass Standard Turing Test in UCSD Study
Nightwalker Tech
Nightwalker Tech
Mar 28, 2025 · Artificial Intelligence

Comprehensive Evaluation of GPT-4o Multimodal Image Generation Capabilities

This article presents a thorough assessment of GPT‑4o’s new image generation features, detailing multiple test scenarios—from simple portrait creation and style transfer to UI design, product rendering, and educational illustrations—comparing its output with Claude‑3.7‑Sonnet, highlighting strengths in realism and weaknesses in Chinese text handling.

AI evaluationGPT-4oImage Generation
0 likes · 16 min read
Comprehensive Evaluation of GPT-4o Multimodal Image Generation Capabilities
Model Perspective
Model Perspective
Mar 23, 2025 · Artificial Intelligence

How to Quantify AI’s Role in Mathematical Modeling with a Contribution Index

This article proposes an AI Contribution Index for mathematical modeling, explains its weighted‑average construction, provides concrete formulas and examples, and discusses broader applications and philosophical implications of quantifying AI involvement across various stages of problem solving.

AI contributionAI evaluationmathematical modeling
0 likes · 6 min read
How to Quantify AI’s Role in Mathematical Modeling with a Contribution Index
AntTech
AntTech
Feb 26, 2025 · Artificial Intelligence

Ant Group’s 18 Accepted Papers at AAAI 2025: Summaries and Highlights

This article presents concise English summaries of the 18 Ant Group papers accepted at AAAI 2025, covering topics such as privacy‑preserving large‑model tuning, knowledge‑graph integration, AI‑generated image detection, multi‑task learning, generative retrieval, role‑playing evaluation, and video hallucination mitigation.

AAAI 2025AI evaluationVideo Hallucination
0 likes · 29 min read
Ant Group’s 18 Accepted Papers at AAAI 2025: Summaries and Highlights
DataFunTalk
DataFunTalk
Feb 11, 2025 · Artificial Intelligence

Roundtable on Enhancing Large Model Effectiveness: RAG, Tool Use, and Knowledge Engineering

Experts from Dipu, Ant Financial, iKang, and Zhihu discuss practical strategies for improving large model performance, covering RAG, tool‑using, offline knowledge engineering, multimodal training, evaluation metrics, and future trends, while sharing case studies from manufacturing, healthcare, retail, and C‑end applications.

AI evaluationRAGknowledge engineering
0 likes · 9 min read
Roundtable on Enhancing Large Model Effectiveness: RAG, Tool Use, and Knowledge Engineering
Alimama Tech
Alimama Tech
Dec 25, 2024 · Artificial Intelligence

WiS Platform: Evaluating LLM Multi-Agent Systems via Game-Based Analysis

The WiS Platform provides a game‑based environment for benchmarking large language models in multi‑agent settings, measuring reasoning, deception and collaboration through dynamic scenarios, offering fair experimental design, real‑time competition, visualizations, detailed metrics, and open‑source tools, with GPT‑4o outperforming other models such as Qwen2.5‑72B‑Instruct.

AI evaluationDefense StrategiesGame-Based Testing
0 likes · 8 min read
WiS Platform: Evaluating LLM Multi-Agent Systems via Game-Based Analysis
DataFunSummit
DataFunSummit
Dec 3, 2024 · Artificial Intelligence

Applying Large Language Models to NPC Role‑Playing and Game Localization at Tencent

This article details Tencent's practical exploration of large language model deployment in overseas game scenarios, covering the design of customized NPC role‑playing models, multilingual localization pipelines, data construction, training, evaluation frameworks, multi‑agent improvement loops, and insights from a comprehensive Q&A session.

AI evaluationNPC AITencent
0 likes · 17 min read
Applying Large Language Models to NPC Role‑Playing and Game Localization at Tencent
Kuaishou Tech
Kuaishou Tech
Sep 20, 2024 · Artificial Intelligence

Building an LLM-Based Agent Platform for Enterprise Commercialization: Strategies, Architecture, and Practical Insights

This article details the strategic development and technical architecture of SalesCopilot, an LLM-driven agent platform designed for enterprise commercialization, highlighting the implementation of RAG and agent technologies, addressing practical challenges, and sharing key insights for building scalable AI applications.

AI agentsAI evaluationPlatform Architecture
0 likes · 15 min read
Building an LLM-Based Agent Platform for Enterprise Commercialization: Strategies, Architecture, and Practical Insights
Kuaishou Tech
Kuaishou Tech
Jul 18, 2024 · Artificial Intelligence

Multidimensional Preference Model (MPS) for Text-to-Image Generation: Dataset, Architecture, and Experimental Analysis

This article introduces the Multidimensional Preference Model (MPS), the first multi‑dimensional scoring system for evaluating text‑to‑image generation, built on the newly released MHP dataset with extensive human annotations across aesthetic, semantic alignment, detail quality, and overall preference dimensions, and demonstrates its superior performance through comprehensive experiments and RLHF integration.

AI evaluationMHP datasetMPS
0 likes · 10 min read
Multidimensional Preference Model (MPS) for Text-to-Image Generation: Dataset, Architecture, and Experimental Analysis
Java Tech Enthusiast
Java Tech Enthusiast
Jul 16, 2024 · Artificial Intelligence

LLMs Misjudge Simple Number Comparison: 9.11 vs 9.9

Recent tests reveal that popular large language models—including GPT‑4o, Gemini Advanced, and Claude 3.5—often claim 9.11 is larger than 9.9 because their tokenizers split the numbers, but rephrasing, zero‑shot chain‑of‑thought prompts, or treating the values as floating‑point numbers can correct the mistake, a pattern also seen variably in Chinese models.

AI evaluationLLMnumeric comparison
0 likes · 7 min read
LLMs Misjudge Simple Number Comparison: 9.11 vs 9.9
DataFunSummit
DataFunSummit
Jul 6, 2024 · Artificial Intelligence

Synergy Between Large Language Models and Knowledge Graphs: Recent Advances, Evaluation, and Future Integration

This article reviews the rapid progress of large language models and their complementary relationship with knowledge graphs, covering comparative strengths, knowledge extraction and completion, evaluation benchmarks, deployment benefits, complex reasoning support, and prospects for interactive fusion toward more reliable and explainable AI systems.

AI evaluationKnowledge Graphsknowledge extraction
0 likes · 12 min read
Synergy Between Large Language Models and Knowledge Graphs: Recent Advances, Evaluation, and Future Integration
DataFunSummit
DataFunSummit
Apr 13, 2024 · Artificial Intelligence

Understanding and Mitigating Hallucinations in Large Language Model Industry Q&A with Knowledge Graphs

This article examines why large language models often produce hallucinations in industry question‑answering, defines the phenomenon, explores its data and training origins, proposes evaluation metrics, and presents practical strategies—including high‑quality fine‑tuning data, honest refusal mechanisms, advanced decoding methods, and external knowledge‑graph augmentation—to reduce hallucinations and improve reliability.

AI evaluationRetrieval-Augmented Generationhallucination
0 likes · 21 min read
Understanding and Mitigating Hallucinations in Large Language Model Industry Q&A with Knowledge Graphs
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jan 3, 2024 · Artificial Intelligence

Llama 2: Open Foundation and Fine‑Tuned Chat Models – Ghost Attention, RLHF Results, and Safety Evaluation

This article summarizes the Llama 2 series, describing the Ghost Attention technique for maintaining system‑message consistency across multi‑turn dialogs, presenting RLHF and human evaluation results, and discussing extensive safety pre‑training, benchmark assessments, and model release details.

AI evaluationGhost AttentionLlama 2
0 likes · 20 min read
Llama 2: Open Foundation and Fine‑Tuned Chat Models – Ghost Attention, RLHF Results, and Safety Evaluation
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Dec 17, 2023 · Artificial Intelligence

Levels of AGI: A Framework for Evaluating Artificial General Intelligence

The article presents Google DeepMind's AGI evaluation framework, outlining six guiding principles, nine representative definitions, and a hierarchical five‑level classification system to assess AGI performance, autonomy, and societal impact, aiming to provide a common language for model comparison, risk assessment, and progress tracking.

AGIAI evaluationArtificial General Intelligence
0 likes · 15 min read
Levels of AGI: A Framework for Evaluating Artificial General Intelligence
DataFunTalk
DataFunTalk
Aug 9, 2023 · Artificial Intelligence

Key Technologies for Domain‑Specific Large Models: Insights from the World AI Conference

This report, based on Professor Xiao Yanghua’s presentation at the World AI Conference, examines why vertical domains need general large models, outlines their key capabilities such as open‑world understanding, combinatorial innovation, evaluation, complex instruction execution, task planning, and symbolic reasoning, and discusses current limitations and optimization strategies for domain‑specific deployment.

AI evaluationData Governancelarge language models
0 likes · 17 min read
Key Technologies for Domain‑Specific Large Models: Insights from the World AI Conference
Baidu Tech Salon
Baidu Tech Salon
Aug 8, 2023 · Artificial Intelligence

Tsinghua University Report Ranks Baidu Wenxin Yiyan First Among Chinese Large Language Models

A Tsinghua University evaluation of seven large language models found Baidu’s Wenxin Yiyan topping the domestic rankings with the highest overall score across 20 metrics—especially Chinese semantic understanding and safety—surpassing ChatGPT and tying GPT‑4, while also demonstrating rapid training, inference speed, and broad industry adoption.

AI evaluationBaidu WenxinChinese NLP
0 likes · 4 min read
Tsinghua University Report Ranks Baidu Wenxin Yiyan First Among Chinese Large Language Models
php中文网 Courses
php中文网 Courses
Aug 2, 2023 · Artificial Intelligence

Stanford and UC Berkeley Study Finds Significant Decline in GPT-4 Capabilities Across Math, Coding, and Visual Reasoning

A joint Stanford and UC Berkeley study reveals that GPT‑4’s performance on mathematics, code generation, and visual‑reasoning tasks sharply declined between March and June 2023, with accuracy dropping from 97.6% to 2.4% on a prime‑checking benchmark and executable code rates falling from 52% to 10%.

AI evaluationGPT-4Natural Language Processing
0 likes · 3 min read
Stanford and UC Berkeley Study Finds Significant Decline in GPT-4 Capabilities Across Math, Coding, and Visual Reasoning
DataFunTalk
DataFunTalk
Mar 27, 2023 · Artificial Intelligence

GPT-4 Shows Early Signs of Artificial General Intelligence: Insights from the "Sparks of AGI" Paper

A recent 154‑page Microsoft paper titled "Sparks of Artificial General Intelligence: Early Experiments with GPT‑4" argues that GPT‑4, despite being an early prototype, already exhibits many capabilities—multimodal reasoning, programming, mathematics, and human‑like interaction—suggesting it may be an early form of AGI, though experts highlight significant limitations and ongoing debates.

AI evaluationArtificial General IntelligenceGPT-4
0 likes · 15 min read
GPT-4 Shows Early Signs of Artificial General Intelligence: Insights from the "Sparks of AGI" Paper
IT Services Circle
IT Services Circle
Feb 7, 2023 · Artificial Intelligence

ChatGPT’s Bug‑Fixing Ability Reaches State‑of‑the‑Art on the QuixBugs Benchmark

Researchers from Germany and the UK evaluated ChatGPT and three other AI models on the QuixBugs benchmark, finding that ChatGPT correctly fixed 31 of 40 bugs—outperforming CodeX, CoCoNut, and Standard APR—and sparked mixed reactions about its impact on software engineering and OpenAI’s broader strategies.

AI evaluationChatGPTQuixBugs
0 likes · 8 min read
ChatGPT’s Bug‑Fixing Ability Reaches State‑of‑the‑Art on the QuixBugs Benchmark