Artificial Intelligence 22 min read

Why LLM Internet Search Fails and How to Fix It: A Deep Dive into Qwen, Doubao, and DeepSeek

This article analyses the shortcomings of large‑model internet search—such as unverifiable sources, fabricated content, and poor instruction compliance—by comparing Qwen‑max, Doubao‑1.5‑pro‑256k, and DeepSeek‑v3, and proposes prompt engineering, post‑processing, and custom tool improvements to boost reliability.

Alibaba Cloud Developer

Mar 24, 2025

Why LLM Internet Search Fails and How to Fix It: A Deep Dive into Qwen, Doubao, and DeepSeek

1. Background and Principles

Large‑model internet search lets a model query real‑time information via external search tools, addressing the timeliness gap of pre‑trained knowledge bases. Typical scenarios include current date, weather, news, and prices, where the model splits the query, invokes a search tool, and integrates the retrieved results.

The search tool is essentially a function that the model calls when the input matches its description. Example: a translation tool def translate_text(text, target_language): ... is registered with name, description, and parameters, and the model decides to use it based on similarity.

From a workflow perspective, the model first assesses whether a tool is needed, extracts arguments, calls the function, receives the result, and then composes the final answer.

2. Current Issues

Testing three mainstream models (Qwen‑max, Doubao‑1.5‑pro‑256k, DeepSeek‑v3) on news and weather queries revealed several problems with Qwen‑max:

Unverifiable sources: many returned URLs are dead or unrelated, giving a credibility score of only 1/3 effective links and 0% real links.

Fabricated content: the model sometimes generates false answers (e.g., wrong weather) and invents URLs like https://www.caixin.com/2025-01/gdp-growth that do not exist.

Poor instruction following: even when prompts restrict the search to specific domains (e.g., Sina News), Qwen still returns links from other sources.

Root causes include limited semantic understanding, inadequate instruction‑following ability, and sub‑optimal similarity‑based ranking of search results.

3. Optimization Directions

3.1 Prompt reinforcement: constrain the model to output only the domain homepage when the exact URL cannot be verified, preventing 404 links.

##角色
你是一个搜索专家，能快速从互联网上搜索信息
##任务，搜索新闻
用互联网搜索引擎搜索2025年1月的前3个重大民生新闻，给出新闻的来源网站
##限制
如果实际网址中没有信息，输出实际网址对应的可访问的域名首页，严禁自己编造来源；
##输出格式
标准json，{"news1":"xx","link1":"xx","news2":"xx","link2":"xx","news3":"xx","link3":"xx"}

After applying this prompt, Qwen‑max achieved 93.3% accessible links, though only 10% matched the news content.

3.2 Post‑processing: a script extracts the domain from each returned URL, checks its HTTP status, and replaces invalid URLs with the homepage.

# Extract domain and verify
links = [res['link1'], res['link2'], res['link3']]
new_links = []
for link in links:
    if 'http' in link:
        parts = link.split('/')
        if len(parts) >= 3:
            new_links.append('/'.join(parts[:3]))

def check(url):
    try:
        r = requests.get(url)
        return r.status_code == 200
    except requests.RequestException:
        return None

for nl in new_links:
    if check(nl):
        print(nl)

This converts fabricated URLs like https://www.caixin.com/2025-01/gdp-growth to https://www.caixin.com, which is reachable.

3.3 Custom search tool: replace the built‑in search function with a more robust one that fetches top‑20 results, validates URLs, filters by keywords, and re‑ranks using a similarity model (e.g., DashScopeRerank).

from llama_index.core.data_structs import Node
from llama_index.postprocessor.dashscope_rerank import DashScopeRerank
from langchain_core.tools import Tool
from langchain_google_community import GoogleSearchAPIWrapper
import requests

def google_search(text):
    search = GoogleSearchAPIWrapper()
    tool = Tool(name="google_search", description="Search Google for recent results.", func=search.run, topk=20)
    result = tool.run(text)
    new_result = []
    keywords = text.split(',')
    for each in result:
        content, link = each['content'], each['url']
        try:
            if requests.get(link).status_code == 200:
                if all(k in content for k in keywords):
                    new_result.append((content, link))
        except:
            continue
    nodes = [Node(text=link + '&' + content) for content, link in new_result]
    rerank = DashScopeRerank(top_n=3)
    ranked = rerank.postprocess_nodes(nodes, query_str=text)
    return sorted([(r.node.get_content(), r.score) for r in ranked], key=lambda x: x[1], reverse=True)

search_tool = {"type":"function","function":{"name":"google_search","description":"When real‑time information is needed, use this tool","parameters":{"type":"object","properties":{"text":{"type":"string","description":"The query"}},"required":["text"]}}}

Using this custom tool, Qwen‑max achieved 100% non‑404 links and 90% real‑link accuracy in 30 news‑search tests.

References

Aliyun Model Studio Documentation

Volcengine Documentation

DeepSeek Function Calling Guide

AI LLM evaluation internet search tool function

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.