How MaxCompute Enables Multimodal Storage and Hybrid Computing for Powerful Digital Agents
The article details MaxCompute's three‑stage approach—production‑ready Agent access via MCP and Skill, a business‑oriented semantic layer, and multimodal Blob storage with hybrid compute—culminating in a CPU‑only home‑design demo that showcases end‑to‑end Agent workflows, security controls, and mobile integration.
As large‑model applications deepen, digital agents begin handling data queries, report generation, operations diagnostics, and multimodal content processing. MaxCompute product expert Sun Daibo presented a three‑stage practice: production‑ready access, a semantic layer, and multimodal storage & hybrid compute, ending with an end‑to‑end home‑design demonstration.
Production‑ready access : MaxCompute offers a Model Context Protocol (MCP) service that announces capabilities such as table lookup, SQL execution, and metadata reading. To turn capabilities into tasks, a Skill package supplies reusable ability‑combination templates (SQL generation, Python coding, metadata query, audit, ops). For example, when an Agent receives “diagnose today’s failed jobs”, the Skill guides sequential calls to log read, status check, and cause classification. Authentication supports AK/SK, STS, and password‑less login in trusted ECS environments.
Security boundaries : The first layer applies dynamic masking to sensitive fields before returning results, keeping raw data hidden while processing continues. The second layer enforces behavior constraints via MCP configuration, defining per‑Agent role boundaries (read‑only with optional cost‑approval, write, delete) to achieve task‑level permission isolation.
Multiple entry points : Besides MCP, a CLI with --help and JSON output and a desktop client for cross‑validation are provided. Mobile bots (DingTalk, Feishu, WeChat) enable on‑the‑fly scaling, rerunning failed jobs, and checking quota without console login, turning passive tickets into proactive collaboration.
Semantic layer : This layer maps low‑level tables, fields, and compute logic to business‑friendly terms, metrics, and scenarios. Construction proceeds in three stages: (1) users build a semantic package from their data; MaxCompute scans historical usage to extract joins, common metrics, then an LLM calibrates terminology. (2) Skill drives the Agent to scan data and generate semantic mappings. (3) A cloud‑wide semantic layer shares agents to avoid cognitive divergence. An example package (Information_Schema) contains 54 terms, 34 entities, 16 associations, 35 metrics, and 17 scenarios (Playbooks). Playbooks guide Agents to answer questions like “why storage growth is fast” by checking top tables, partition granularity, cold data, or breaking down CU consumption per owner or task type.
Multimodal storage & hybrid compute : MaxCompute introduces a Blob data type that stores binary large objects (images, audio, video, documents) alongside structured columns (ID, tags, duration). Structured columns reside in columnar Data Files (supporting predicate push‑down) while Blob content lives in Blob Files referenced by pointers. Benefits include transparent compression that halves storage cost, batch reads that dramatically reduce I/O requests (e.g., reading 10 000 files individually vs a single column read), and predicate push‑down that enables fast filtering such as “pictures containing cats” by first filtering tag columns.
Multimodal operators : The platform provides OCR, grayscale/histogram processing, document parsing (PDF/Excel/Word/PPT/Markdown/HTML), speech‑to‑text, and can run Hugging Face models at scale. MaxFrame Coding Skill injects distributed data‑processing knowledge into an AI programming assistant, offering operators such as df.apply (model batch inference), df.mf.flatmap (video frame slicing), and df.mf.rebalance (parallel tuning). Operators are accessed via @with fs mount to OSS and cached with ctx.
End‑to‑end home‑design demo (CPU‑only) : Three core tables – customer consultation (audio Blob, budget, household), material price, and rendering (image Blob) – are used. Pipeline: upload audio → speech‑to‑text → translation → an 8B model extracts style keywords → Stable Diffusion generates renderings. All steps are triggered by a single SQL UPDATE that calls the audio‑to‑text operator, requiring no drag‑and‑drop. The semantic layer then builds a package from the three tables, extracting joins, metrics, and enumerations; the Agent uses the package to answer queries like “show design renderings for a specific house”, performing ID resolution, cross‑table joins, and returning correct images. The workflow supports idempotent reruns and Delta Table TimeTravel rollback.
Key takeaway : By combining MCP, Skill, a semantic layer, and Blob‑based table storage, MaxCompute lowers the cognitive and engineering cost of Agent‑driven data access. Coupled with AI assistants and mobile bot integration, enterprises can achieve “chat‑as‑a‑service” data workflows without GPU clusters, using only CPUs and lightweight models.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
