Artificial Intelligence 38 min read

Build an AI Agent with FastAPI & Alibaba Cloud: Text Q&A, Image Recognition, and Text‑to‑Image

This guide walks through designing and implementing an AI assistant that connects FastAPI to Alibaba Cloud large‑model services, supports streaming text Q&A, image understanding, text‑to‑image generation, network search, and MCP‑based map queries, with full front‑end and back‑end code examples.

Woodpecker Software Testing

Jan 21, 2026

Build an AI Agent with FastAPI & Alibaba Cloud: Text Q&A, Image Recognition, and Text‑to‑Image

Overview

The article describes the end‑to‑end development of an AI agent that integrates FastAPI with Alibaba Cloud large‑model APIs to provide interactive text Q&A, image recognition, and text‑to‑image generation, while also offering network search and map‑based services via the MCP platform.

Core Feature List

FastAPI integration with Alibaba Cloud large‑model endpoints and streaming responses.

Pure‑text AI chat similar to consumer chat applications.

Video‑understanding model for image recognition.

Text‑to‑image generation using the Tongyi Wanxiang model.

Network search that injects search results into the model prompt.

High‑definition map queries via the AMap MCP service (geolocation, weather, routing, etc.).

JavaScript‑based local speech synthesis for voice read‑out.

Frontend Design

The UI is built with plain HTML+CSS. Dynamic chat bubbles are created with document.createElement('div'). Image upload elements are hidden and triggered programmatically. A checkbox toggles network search, and buttons invoke the respective back‑end APIs.

Key Front‑end Functions

function doAsk() {
  let ask = document.createElement('div');
  ask.setAttribute('class', 'ask-box');
  if (sessionStorage.getItem('image')) {
    ask.innerHTML = '<img src="' + sessionStorage.getItem('image') + '" style="width:100%"><br/>' + question.value;
    document.getElementById('chatbox').append(ask);
    scrollToBottom();
    recognizeImage();
  } else {
    ask.innerHTML = document.getElementById('question').value;
    document.getElementById('chatbox').append(ask);
    scrollToBottom();
    doAnswer();
  }
}

Speech synthesis is handled by creating a SpeechSynthesisUtterance and toggling playback with a global flag.

Back‑end Structure

FastAPI is split into multiple routers for modularity: qa_1.py implements the streaming text Q&A endpoint. recognize.py handles image‑recognition requests using the qwen‑vl‑max‑latest model. generate.py calls the Tongyi Wanxiang V2 image‑synthesis SDK. mcp_client_2.py wraps AMap MCP tool discovery, invocation, and result streaming. func_calling.py defines a Python email‑sending function and its OpenAI function schema.

Streaming Text Q&A

from fastapi import APIRouter, Body
from fastapi.responses import StreamingResponse
import json, os
from openai import OpenAI

qa = APIRouter()
messages = [{"role": "system", "content": "You are a helpful AI assistant."}]

@qa.post("/stream")
async def stream(question: dict = Body(...)):
    content = question["content"]
    search = question["search"]
    # Build the request payload
    params = {"content": content, "search": search}
    # Forward to the model with streaming enabled
    async def generate_response():
        # omitted for brevity – see source for full implementation
        yield json.dumps({"content": "..."}) + "
"
    return StreamingResponse(generate_response(), media_type="text/event-stream")

Image Recognition Endpoint

from fastapi import APIRouter, Body
from fastapi.responses import StreamingResponse
import json, os, base64
from openai import OpenAI

recog = APIRouter()

@recog.post("/recognize")
async def recognize_image(data: dict = Body(...)):
    b64str = data['base64'].split(',')[1]
    async def stream_chat():
        client = OpenAI(api_key=os.getenv("Dashscope_API_Key"), base_url="https://dashscope.aliyuncs.com/compatible-mode/v1")
        completion = client.chat.completions.create(
            model="qwen-vl-max-latest",
            messages=[
                {"role": "system", "content": "You are an AI assistant that describes images concisely."},
                {"role": "user", "content": [{"type": "image_url", "image_url": data['base64']}, {"type": "text", "text": data['content']}]}
            ],
            stream=True
        )
        for chunk in completion:
            choice = chunk.choices[0].delta.content
            yield json.dumps({"content": choice}) + "
"
    return StreamingResponse(stream_chat(), media_type="text/event-stream")

Text‑to‑Image Generation

from fastapi import APIRouter, Body
from http import HTTPStatus
import os, requests
from dashscope import ImageSynthesis

generate = APIRouter()

@generate.post("/generate")
def generate_image(data: dict = Body(...)):
    api_key = os.getenv("Dashscope_API_Key")
    rsp = ImageSynthesis.call(
        api_key=api_key,
        model="wanx2.1-t2i-turbo",
        prompt=data['content'],
        n=1,
        size='1024*1024'
    )
    if rsp.status_code == HTTPStatus.OK:
        for result in rsp.output.results:
            file_name = result.url.split('/')[-1].split('?')[0]
            with open(f'./static/images/{file_name}', 'wb') as f:
                f.write(requests.get(result.url).content)
            return {"message": "successful", "image_url": f"/static/images/{file_name}"}

Network Search Integration

When the "联网" checkbox is selected, the front‑end sends {"search": true}. The back‑end calls Alibaba OpenSearch, formats the top‑k results, and prepends them to the user prompt before invoking the large model.

MCP (Map Cloud Platform) Client

The MCP client uses the asynchronous mcp SDK to discover available tools (e.g., routing, weather, POI search), invoke them, and stream the combined result back to the user. Initialization and cleanup are performed with await client.init_session() and await client.cleanup().

Function Calling for Email

A simple send_email function is exposed to the model via OpenAI‑compatible function schemas. If the model decides a function call is needed, the server executes the function and returns the result as a normal chat message.

Putting It All Together

The main FastAPI application mounts the static directory, registers all routers, and adds CORS middleware. Health and API‑info endpoints provide configuration status. The front‑end HTML page ( chat.html) loads script.js and style.css, presenting a chat window where users can type, upload images, toggle network search, or request map‑based queries.

Overall, the article demonstrates a complete workflow: requirement analysis, modular back‑end design, front‑end interaction handling, integration of multiple Alibaba Cloud services, and optional function calling, enabling a versatile AI assistant.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MCP text-to-image image recognition FastAPI Alibaba Cloud AI chatbot

Written by

Woodpecker Software Testing

The Woodpecker Software Testing public account shares software testing knowledge, connects testing enthusiasts, founded by Gu Xiang, website: www.3testing.com. Author of five books, including "Mastering JMeter Through Case Studies".

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.