Artificial Intelligence 38 min read

How to Build a Multimodal AI Assistant with FastAPI, Alibaba Cloud and DashScope

This guide walks through configuring Alibaba Cloud credentials, implementing a FastAPI backend with email function calling, Alibaba OpenSearch, image generation via DashScope, speech recognition, and a responsive HTML/CSS/JavaScript front‑end that supports text chat, image recognition, image synthesis, and voice interaction.

Woodpecker Software Testing

Jan 27, 2026

How to Build a Multimodal AI Assistant with FastAPI, Alibaba Cloud and DashScope

Prerequisites

Set the following environment variables before running the application:

ALIYUN_ACCESS_KEY_ID

ALIYUN_ACCESS_KEY_SECRET

ALIYUN_APP_KEY

ALIYUN_ASR_APP_KEY (same value as ALIYUN_APP_KEY)

Aliyun_Search_Key (used for web search)

AMAP_API_KEY (used for weather queries)

Dashscope_API_Key (DashScope SDK authentication)

Backend Modules

func_calling.py

Implements an email‑sending utility using smtplib and email.mime. The function send_email(receiver, content, subject=None) builds a MIMEMultipart message, logs into smtp.126.com with the password from Mail_Password, and sends the email. The module also defines a function‑calling schema for OpenAI‑compatible tools API, exposing send_email with required parameters receiver and content.

import smtplib, os
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from dotenv import load_dotenv
load_dotenv()

def send_email(receiver, content, subject=None):
    sender = '[email protected]'
    msg = MIMEMultipart()
    msg['Subject'] = subject if subject is not None else f"来自{sender}的问候邮件"
    msg['From'] = sender
    msg['To'] = receiver
    body = MIMEText(content, 'html', 'utf-8')
    msg.attach(body)
    smtpObj = smtplib.SMTP_SSL('smtp.126.com', 465)
    smtpObj.login(user=sender, password=os.getenv("Mail_Password"))
    smtpObj.sendmail(sender, receiver, str(msg))
    smtpObj.quit()
    return "邮件已经成功发送到：" + receiver

system_prompt = """
你是一名AI助手，具备函数调用的能力，但是如果提供的信息已经足够回答用户的问题，则不需要再进行函数调用。
同时，请严格按照函数调用的方式进行处理，如果用户未提供函数所需参数，则必须询问，而不能自作主张。
"""

functions = [{
    "type": "function",
    "function": {
        "name": "send_email",
        "description": "向指定邮箱地址发送一封邮件",
        "parameters": {
            "type": "object",
            "properties": {
                "receiver": {"type": "string", "description": "邮件的收件地址"},
                "content": {"type": "string", "description": "邮件的正文内容，支持HTML格式"},
                "subject": {"type": "string", "description": "邮件的标题，如果没有标题，可以设置为空"}
            },
            "required": ["receiver", "content"]
        }
    }
}]

if __name__ == '__main__':
    send_email("[email protected]", "祝你节日快乐，工作顺利。")

module.py

Provides a thin client for Alibaba Cloud OpenSearch. It sends a POST request to the OpenSearch endpoint, prints debugging information, and extracts search results from several possible response structures.

import requests, os, json
from dotenv import load_dotenv
load_dotenv()

def aliyun_search(content):
    print(f"🔍 搜索内容: {content}")
    print(f"📝 使用的密钥前几位: {os.getenv('Aliyun_Search_Key')[:15] if os.getenv('Aliyun_Search_Key') else '无密钥'}")
    url = "http://default-cu35.platform-cn-shanghai.opensearch.aliyuncs.com/v3/openapi/workspaces/default/web-search/ops-web-search-001"
    header = {"Content-Type": "application/json", "Authorization": f"Bearer {os.getenv('Aliyun_Search_Key')}"}
    data = {"query": content, "top_k": 3, "way": "full", "content_type": "summary"}
    print(f"🌐 请求URL: {url}")
    print(f"📦 请求数据: {data}")
    try:
        resp = requests.post(url, headers=header, json=data, timeout=10)
        print(f"📡 状态码: {resp.status_code}")
        if resp.status_code != 200:
            print(f"❌ 请求失败! 响应内容: {resp.text}")
            return []
        response_json = resp.json()
        print("📄 完整响应结构:")
        print(json.dumps(response_json, indent=2, ensure_ascii=False, default=str))
        if 'result' in response_json and 'search_result' in response_json['result']:
            results = response_json['result']['search_result']
        elif 'search_result' in response_json:
            results = response_json['search_result']
        elif 'data' in response_json and 'search_result' in response_json['data']:
            results = response_json['data']['search_result']
        elif 'items' in response_json:
            results = response_json['items']
        elif 'hits' in response_json and 'hits' in response_json['hits']:
            results = response_json['hits']['hits']
        else:
            print("⚠️  警告: 未找到预期的响应结构")
            results = response_json if isinstance(response_json, list) else []
        print(f"✅ 成功解析 {len(results)} 个结果")
        return results
    except requests.exceptions.Timeout:
        print("⏰ 请求超时")
        return []
    except json.JSONDecodeError:
        print("❌ 响应不是有效的JSON格式")
        print(f"原始响应: {resp.text[:500]}...")
        return []
    except Exception as e:
        print(f"💥 搜索请求出错: {type(e).__name__}: {e}")
        return []

generate.py

Exposes a FastAPI router that calls DashScope’s ImageSynthesis SDK (model wanx2.1-t2i-turbo ) to generate a single image from a text prompt. The image is downloaded and saved under ./static/images/ , and the relative URL is returned.

from fastapi import APIRouter, Body
from http import HTTPStatus
import os, requests
from dashscope import ImageSynthesis
from dotenv import load_dotenv
load_dotenv()

generate = APIRouter()

@generate.post("/generate")
def generate_image(data: dict = Body()):
    api_key = os.getenv("Dashscope_API_Key")
    rsp = ImageSynthesis.call(
        api_key=api_key,
        model="wanx2.1-t2i-turbo",
        prompt=data['content'],
        n=1,
        size='1024*1024'
    )
    if rsp.status_code == HTTPStatus.OK:
        for result in rsp.output.results:
            file_name = result.url.split('/')[-1].split('?')[0]
            with open(f'./static/images/{file_name}', 'wb') as f:
                f.write(requests.get(result.url).content)
    return {"message": "successful", "image_url": f"/static/images/{file_name}"}

recognize.py

Implements image‑recognition by sending a Base64‑encoded image to the DashScope model qwen-vl-max-latest . The endpoint streams the model’s response back to the client as Server‑Sent Events (SSE).

from fastapi import APIRouter, Body
from fastapi.responses import StreamingResponse
import json, os, base64
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()

recog = APIRouter()

@recog.post("/recognize")
def recognize_image(data: dict = Body()):
    b64str = data['base64'].split(',')[1]
    def stream_chat():
        client = OpenAI(api_key=os.getenv("Dashscope_API_Key"), base_url="https://dashscope.aliyuncs.com/compatible-mode/v1")
        completion = client.chat.completions.create(
            model="qwen-vl-max-latest",
            messages=[
                {"role": "system", "content": "你是一名专业的AI助手，可以帮助用户解答任何问题，也能以精准简洁的语言识别并描述出图像的内容。"},
                {"role": "user", "content": [{"type": "image_url", "image_url": data['base64']}, {"type": "text", "text": data['content']}]}
            ],
            stream=True
        )
        for chunk in completion:
            choice = chunk.choices[0].delta.content
            yield json.dumps({"content": choice}) + "
"
    return StreamingResponse(stream_chat(), media_type="text/event-stream")

qa.py

Provides a streaming chat endpoint that optionally performs an external knowledge search via aliyun_search , then sends the combined prompt to the LLM model qwen-plus with function‑calling enabled. If the model returns a tool call, the corresponding Python function (e.g., send_email ) is executed and its result is fed back to the conversation.

from fastapi import APIRouter, Body
from fastapi.responses import StreamingResponse
import json, os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
from module import aliyun_search
from func_calling import send_email, functions

qa = APIRouter()
messages = [{"role": "system", "content": "你是一名专业的AI助手，可以帮助用户解答任何问题。"}]

@qa.post("/stream")
def stream(question: dict = Body()):
    content = question['content']
    search = question['search']
    if search == True:
        search_result = aliyun_search(content)
        message = {"role": "user", "content": f"请使用以下内容:
{search_result}
并基于用户的提问:
{content}
来进行回答。"}
    else:
        message = {"role": "user", "content": content}

    def check_func_call(msg):
        messages.append(msg)
        client = OpenAI(api_key=os.getenv("Dashscope_API_Key"), base_url="https://dashscope.aliyuncs.com/compatible-mode/v1")
        completion = client.chat.completions.create(
            model="qwen-plus",
            messages=messages,
            stream=False,
            tools=functions
        )
        return completion.choices[0].message

    def stream_chat():
        client = OpenAI(api_key=os.getenv("Dashscope_API_Key"), base_url="https://dashscope.aliyuncs.com/compatible-mode/v1")
        completion = client.chat.completions.create(
            model="qwen-plus",
            messages=messages,
            stream=True,
            stream_options={"include_usage": False}
        )
        reply = ""
        for chunk in completion:
            choice = chunk.choices[0].delta.content
            reply += choice
            yield json.dumps({"content": choice}) + "
"
        messages.append({"role": "assistant", "content": reply})

    output = check_func_call(message)
    if output.tool_calls:
        func_name = output.tool_calls[0].function.name
        func_args = eval(output.tool_calls[0].function.arguments)
        func = globals()[func_name]
        result = func(**func_args)
        messages.append({"role": "user", "content": f"请将以下内容直接回复给用户: {result}"})
    else:
        messages.append(message)
    return StreamingResponse(stream_chat(), media_type="text/event-stream")

main.py

Entry point that creates a FastAPI app, configures CORS, mounts static files, includes the routers defined above, and provides two simple endpoints: the root HTML page and /api/info which reports service metadata and configuration status.

from fastapi import FastAPI, Request
from fastapi.templating import Jinja2Templates
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from qa import qa
from recognize import recog
from generate import generate
import uvicorn, os
from dotenv import load_dotenv
load_dotenv()

app = FastAPI(
    title="AI Assistant API",
    description="Multi‑modal AI assistant with Alibaba Cloud services",
    version="1.0.0"
)
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"]
)
app.mount("/static", StaticFiles(directory="static"), name="static")
templates = Jinja2Templates(directory="templates")

app.include_router(qa)
app.include_router(recog)
app.include_router(generate)

@app.get('/')
def chat(request: Request):
    return templates.TemplateResponse(request=request, name="index.html")

@app.get('/api/info')
async def api_info():
    return {
        "service": "AI Assistant",
        "version": "1.0.0",
        "features": ["普通聊天", "高德地图MCP集成"],
        "endpoints": {
            "POST /stream": "流式聊天接口",
            "GET /health": "健康检查",
            "GET /api/info": "API信息"
        },
        "config_status": {
            "dashscope": bool(os.getenv("Dashscope_API_Key")),
            "amap": bool(os.getenv("AMAP_API_KEY"))
        }
    }

if __name__ == '__main__':
    uvicorn.run(app, host="0.0.0.0", port=8000, log_level="info")

Front‑End

The UI is a single‑page application defined in templates/index.html . It consists of a header with a logo, a mode selector (currently only "chat"), a chat area that displays messages, and an input panel with controls for text entry, image upload, voice start/stop, and image generation. Font Awesome icons are used for buttons, and three JavaScript bundles drive the interactive behaviour.

<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
    <title>智能多模态AI助手</title>
    <script src="/static/script.js"></script>
    <script src="/static/script1.js"></script>
    <script src="/static/script_moblie.js"></script>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
    <link rel="stylesheet" href="../static/styles_mobile.css">
</head>
<body>
    <div class="app-container">
        <header>
            <div class="logo">
                <div class="logo-icon"><i class="fas fa-robot"></i></div>
                <div><h1>多模态AI助手</h1><p class="tagline">支持对话、识图、绘图与语音交互</p></div>
            </div>
            <div class="mode-selector">
                <button class="mode-btn active" data-mode="chat"><i class="fas fa-comments"></i> 对话</button>
            </div>
        </header>
        <div class="main-content">
            <div class="chat-container">
                <div class="chat-header">
                    <h2>AI对话</h2>
                    <button class="clear-chat-btn" id="clearChat" onclick="clearChat()"><i class="fas fa-trash-alt"></i> 清空对话</button>
                </div>
                <div class="net-search">
                    <input type="checkbox" id="net-search">联网
                </div>
                <div class="messages-container" id="chatbox">
                    <!-- 动态消息插入点 -->
                </div>
                <div class="input-area">
                    <div class="action-buttons">
                        <button class="action-btn" id="recognize-image" title="上传图片" onclick="addImage()"><i class="fas fa-image"></i></button>
                        <button class="action-btn" id="voiceBtn" title="语音输入" onclick="connectWebSocket()"><i class="fas fa-microphone"></i></button>
                        <button class="action-btn" id="attachBtn" title="停止录音" onclick="stopRecording()"><i class="fas fa-stop-circle"></i></button>
                    </div>
                    <textarea id="question" class="text-input" placeholder="输入您的问题或指令..." onkeyup="doEnter(event)"></textarea>
                    <div class="send-buttons">
                        <button class="send-btn" id="qa-button" onclick="doAsk()"><i class="fas fa-paper-plane"></i></button>
                        <button class="generate-image" id="generate-image" onclick="generateImage()"><i class="fas fa-wand-magic-sparkles"></i></button>
                    </div>
                </div>
                <div class="status info" id="typingStatus"><i class="fas fa-circle-notch fa-spin"></i> AI正在思考...</div>
            </div>
            <div class="side-panel">
                <!-- 预留的图像识别、生成结果展示区域 -->
            </div>
        </div>
        <footer>
            <p>多模态AI助手 © 2025 | 支持文本、图像、语音交互</p>
        </footer>
    </div>
</body>
</html>

styles_mobile.css

Mobile‑first CSS that defines layout, colour themes, flexbox containers, and responsive adjustments for larger screens. It also includes touch‑device optimisations and custom scrollbar styling.

static/script.js

Handles voice input via Alibaba NLS WebSocket. It obtains a temporary token from /api/speech/token , opens a WebSocket connection, streams PCM‑16 audio data captured from the microphone, and processes intermediate and final transcription results. The script updates the UI status, logs messages, and inserts recognized text into the question textarea.

static/script1.js

Manages the chat workflow: Maintains a switch_voice flag for optional speech synthesis using the browser SpeechSynthesis API. Provides doAsk() to send a user query (or image‑recognition request) to the appropriate backend endpoint ( /stream for chat, /recognize for image recognition, /generate for image generation). Uses the Fetch API with a streaming response reader to render incremental AI replies, appending a "朗读" button that triggers readText(this) for speech synthesis. Implements image upload handling, Base64 conversion, preview rendering, and session storage of the image data. Provides utility functions such as clearChat() , scrollToBottom() , and UI state toggles.

Generated Sample Images

Two example outputs produced by the /generate endpoint: Prompt: "武松打虎"

Prompt: "帮我生成一幅美丽的风景画"

Key Takeaways

The project demonstrates how to combine FastAPI, Alibaba Cloud OpenSearch, DashScope multimodal models, and browser‑side JavaScript to build a multi‑modal AI assistant capable of text dialogue, image generation, image recognition, and voice interaction, while also supporting function calling (e.g., sending emails) and optional external knowledge retrieval.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI Python Function Calling Image Generation FastAPI Alibaba Cloud speech recognition Dashscope

Written by

Woodpecker Software Testing

The Woodpecker Software Testing public account shares software testing knowledge, connects testing enthusiasts, founded by Gu Xiang, website: www.3testing.com. Author of five books, including "Mastering JMeter Through Case Studies".

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.