Artificial Intelligence 12 min read

Uncovering 16 Limits of AI Search Engines and 16 Design Recommendations

A user study with 21 participants reveals sixteen critical limitations of generative AI search engines, maps them to eight quantitative metrics, proposes sixteen design recommendations, and evaluates You.com, Perplexity and BingChat against this framework to highlight current performance gaps.

Baobao Algorithm Notes

Nov 4, 2024

Uncovering 16 Limits of AI Search Engines and 16 Design Recommendations

Background

Generative search engines that use large language models (LLMs) are replacing traditional keyword‑based search. A user study with 21 participants compared AI‑augmented search to conventional search, identifying 16 limitations across answer text, citation, sources, and user interface.

Evaluation Framework (AEE)

The AI Search Engine Evaluation (AEE) framework defines eight quantitative metrics: One‑Sided Answer, Overconfident Answer, Relevant Statements, Unsupported Statements, Citation Accuracy, Citation Thoroughness, Source Necessity, and Uncited Sources. Automated evaluation was applied to three popular engines—You.com, Perplexity.ai, and BingChat.

Diagram of AI search engine components and evaluation framework

Identified Limitations

Answer Text

Insufficient objective detail – all participants noted shallow answers.

Lack of diverse viewpoints – many answers were biased.

Over‑confident language – statements were presented with unwarranted certainty.

Over‑simplified writing – limited creativity and critical reasoning.

Citation

Misattribution and misunderstanding of sources .

Context‑driven selective information – models cherry‑pick data.

Missing citations for key statements .

Opaque source selection – lack of transparency in ranking.

Sources

Low‑frequency source usage – few sources cited.

More retrieved than used sources – mismatch between retrieved set and those actually used.

Distrust of source types .

Redundant source content – duplicate information across sources.

User Interface

Missing source‑filtering controls .

Limited human input in generation .

Extra effort required to verify answers .

Non‑standard citation format .

Design Recommendations

Answer Text

Provide balanced answers that avoid reinforcing user bias.

Include objective details such as data and statistics.

Eliminate irrelevant filler; keep every sentence on‑topic.

Make source selection transparent to enhance trust.

Citation

Ensure every statement has a proper supporting reference.

Cross‑check citation accuracy against external sources.

Reference all relevant sources for multi‑point statements.

Match the number of listed sources to those actually used.

Sources

Prioritize expert and authoritative sources.

Retrieve and use only necessary sources for each answer.

Distinguish model‑generated content from source‑derived content.

Explicitly evaluate source types for credibility.

User Interface

Incorporate human feedback on both sources and generated text.

Implement interactive citations (e.g., hover pop‑ups).

Provide paragraph‑level local citations indicating exact provenance.

Avoid forced answers when information is insufficient.

Quantitative Evaluation of Three Engines

Using the eight AEE metrics, the study measured performance of You.com, Perplexity.ai, and BingChat.

Performance chart of three AI search engines across eight metrics

One‑Sided Answer : All engines frequently produce one‑sided answers (50‑80%); Perplexity performs worst.

Overconfident Answer : Perplexity shows the highest rate of overconfident responses on debate questions.

Relevant Statements : Similar rates across engines (≈75‑82%).

Unsupported Statements : A sizable portion of statements lack supporting citations.

Citation Accuracy : All engines struggle to correctly cite sources.

Citation Thoroughness : No engine cites all possible accurate sources.

Source Necessity : Engines often list more sources than needed.

Uncited Sources : You.com ensures most listed sources are used; BingChat has the highest proportion of uncited sources.

Overall, no engine excels across most metrics, indicating substantial room for improvement in handling hallucinations, unsupported statements, and citation fidelity. You.com shows modest advantages in confidence handling and source presentation, while Perplexity scores lowest due to overconfidence and citation issues. BingChat falls in the middle, listing many sources without consistent coverage improvement.

Eight Quantitative Metrics (AEE)

One‑Sided Answer

Overconfident Answer

Relevant Statements

Unsupported Statements

Citation Accuracy

Citation Thoroughness

Source Necessity

Uncited Sources

Reference

https://arxiv.org/pdf/2410.22349

Search Engines in an AI Era: The False Promise of Factual and Verifiable Source‑Cited Responses

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Metrics Evaluation AI Search generative search design recommendations

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.