From Hackathon to Scalable AI Customer Service: Lessons and Best Practices
This article chronicles the end‑to‑end development of an AI‑driven customer service system, detailing the shift from a rapid‑prototype Dify platform to a hybrid engineering architecture, model selection strategies, workflow design, knowledge engineering, evaluation methods, and future directions for continuous improvement.
01 Introduction
Industry trends show that large‑model‑driven intelligent customer service is reshaping traditional support tools. High repetition in online support makes cost reduction and efficiency gains a primary value proposition. Large‑model agents improve product capability and user experience while dramatically lowering deployment and operation costs.
Yozan, an e‑commerce SaaS platform, has long faced strong demand for robust customer‑service solutions. Early 2025 the team began exploring this space, investing heavily and documenting the journey from zero to one from a R&D perspective, focusing on core implementation points, critical steps, and reflective insights.
02 Intelligent Agent Development Platform vs. Engineered Code
The first version of the AI customer‑service system emerged from an internal hackathon. To meet tight deadlines the team chose a low‑code agent platform, Dify, which offered fast iteration and tool integration, enabling rapid product‑market fit validation.
As traffic grew, production‑grade challenges surfaced:
Performance bottleneck : Simple Python nodes sometimes took 40‑100 ms.
Retrieval latency : Knowledge‑search components were too slow for real‑time support.
Ops standards : No established release or version‑control workflow.
The adopted solution follows three phases:
Validation phase : Use Dify to quickly build an MVP and verify business model and user need; speed is paramount.
Growth phase : Combine Dify with custom engineered services, moving high‑concurrency and complex logic to self‑built code while keeping non‑core flows on the platform. Early on the knowledge‑retrieval component was migrated to the engineered system.
Maturity phase : When performance, customisation, and stability become insufficient, fully transition to engineered code. Tools such as Spring AI Alibaba Studio can export Dify‑designed logic into standard Spring AI projects.
03 Model Selection
The model acts as the AI’s brain; choosing the right one is crucial. Key take‑aways:
Prompt optimisation can be bypassed by switching to a more suitable model.
Model capability evaluation started with manual checks and later incorporated the langfuse evaluation tool.
Later stages avoid frequent model changes because re‑evaluation and prompt tuning are costly and outcomes uncertain.
Smaller model outputs yield faster responses and lower costs; avoid returning JSON when a constant suffices.
3.1 Selection Process
Model evolution across AI‑service scenarios is illustrated below:
04 Workflow vs. Agent Choice
Early on the team debated between a pure Agent approach and a Workflow‑centric design. Agents offer high autonomy but sacrifice determinism and performance, so a conservative Workflow‑first strategy was chosen to build stable, controllable processes before gradually introducing Agents.
05 Workflow Design and Iteration
5.1 Supporting Pre‑sale Inquiries
The initial 1.0 version delegated almost all responsibilities to a single, well‑known model, including intent detection, question rewriting, duplicate detection, and sentiment analysis, enabling rapid launch.
5.2 Cost Reduction
After 1.0 went live, the team focused on raising AI handling rate while cutting costs. Measures included compressing prompts, chunking knowledge, optimising retrieval, switching the intent‑recognition node from GPT‑4.1 to Qwen, and splitting that node. Qwen proved more sensitive to Dify’s memory mechanism, causing hallucinations, so a global variable was introduced to store conversation history and appended to system prompts for intent‑related nodes.
5.3 Improving AI Handling Rate
To boost AI coverage, the team enriched knowledge sources and began handling post‑sale queries. A key decision was to separate pre‑sale and post‑sale flows because their response styles differ (proactive vs. rigorous) and they require distinct knowledge bases.
5.4 Process Summary
The team adopted a conservative intent‑recognition and dispatch strategy: prefer no answer over a wrong answer. This approach eventually hits a ceiling; further improvements require expanding scenarios.
Current limitations include a single node handling both intent detection and question rewriting, and overlap between post‑sale classification and intent detection. Future iterative refinements may involve:
Splitting intent detection from question rewriting to reduce node complexity.
Enhancing the intent node with business rules, clarification intents, and small‑talk handling.
06 Context Engineering
Effective context management is vital; information overload leads to memory loss, repetition, or hallucinations. Even with large‑window models, indiscriminate filling is harmful. The team’s first steps focused on information acquisition, filtering, and assembly.
6.1 Information Acquisition
Multimodal knowledge extraction : Use multimodal models to slice and store product information.
Real‑time dynamic injection : Distinguish static knowledge from real‑time data; retrieve dynamic info via API at query time.
Historical dialogue recall : Extract high‑frequency Q&A from past conversations to supplement implicit knowledge.
6.2 Filtering and Purification
Increasing sources raises the risk of hallucination. A routing‑and‑filter pipeline ensures a high signal‑to‑noise ratio.
Scene‑directed retrieval
Product detail page: retrieve only product‑specific information.
Pre‑sale / post‑sale routing: guide retrieval toward guidance‑oriented or policy‑oriented knowledge respectively.
Semantic relevance filtering
Improve knowledge quality by discarding unrelated fragments based on similarity to the current query.
Introduce weight‑based scoring (e.g., different weights for homepage vs. detail‑page queries) and combine similarity with weight to decide inclusion.
6.3 Information Assembly
Early attempts concatenated knowledge snippets into a plain text string, which still caused overlap, conflict, and overload, leading to bad cases. The team switched to a structured assembly approach, pairing knowledge blocks with priority prompts to curb hallucinations.
Typical context structure (simplified):
<knowledge_base>
<product_info>
<product_name>knowledge_fragment_x</product_name>
<product_detail>knowledge_fragment_x</product_detail>
</product_info>
<logistics_policy>knowledge_fragment_x</logistics_policy>
<...>knowledge_fragment_x</...>
</knowledge_base>07 Knowledge Engineering
7.1 Product Knowledge
Product knowledge consists of pre‑learned static data and real‑time queries. Pre‑learned data includes product images, specifications, and details. Multimodal models extract information from images, handling structured data better than OCR and filtering irrelevant content.
Model‑extracted knowledge can be erroneous; for example, a product image showed a purple car that does not exist in the specification, causing a wrong answer.
To mitigate such errors, the team later introduced knowledge‑priority mechanisms to reduce incorrect model outputs.
7.2 Historical Dialogue Knowledge
High‑frequency Q&A from historical chats are mined to improve AI coverage. Challenges included massive data volume, difficulty extracting useful knowledge, and low‑efficiency deduplication.
Switching from engineered code to Dify simplified debugging and integration.
Analyzing historical dialogues with RAG capabilities, de‑duplicating repeated questions, and classifying QA pairs.
Separately recalling historical dialogue knowledge with higher scores to reduce bad cases.
7.3 Document Knowledge
Merchant‑uploaded documents are chunked with chunk_size = 600 and chunk_overlap = 100 to balance context completeness and information density, preserving semantic continuity across chunk boundaries.
08 Evaluation and Feedback Optimization
8.1 Why Evaluate
Agent testing differs from traditional software testing because agents are nondeterministic, open‑ended, and scenario‑rich. Model swaps, prompt tweaks, or feature updates invalidate conventional test suites.
8.2 Evaluation Implementation
An effective evaluation system covers four pillars: evaluation target, dataset, metrics, and feedback‑driven optimisation.
Evaluation target : Business scenarios (e.g., product‑detail entry), full‑process coverage, and node‑level capabilities such as intent‑recognition accuracy.
Evaluation metrics : Custom metrics beyond manual correctness, leveraging similarity scores on historical dialogue data.
Evaluation dataset : Pre‑launch, a small test set built from historical dialogues and manually crafted cases; post‑launch, Goodcase/Badcase samples are continuously collected.
Feedback optimisation : Real‑time traffic monitoring feeds a "problem‑identification → root‑cause analysis → optimisation iteration" loop, turning Badcases into model improvements.
8.3 Process Summary
Evaluation is a continuous loop requiring collaboration across product, R&D, and testing teams. Emphasis should be on building robust evaluation foundations, expanding test case libraries, and maintaining an efficient "evaluate → optimise → re‑evaluate" cycle.
09 Collaboration and Project Management
AI projects differ from traditional software development: prompts, context, and business logic are tightly coupled, blurring role boundaries. Collaboration shifts from hand‑off interfaces to joint iteration of prompts, logic, and data. The flexibility of Agent/Workflow architectures disperses responsibility, and varying familiarity with new technologies creates a learning‑while‑doing environment.
Establish AI‑native collaboration flow : Prompt review mechanisms, metric alignment, and product‑research cooperation.
Document decisions : Record why Workflow was chosen over Agent, capture specific considerations, and maintain process logs.
10 Conclusion and Outlook
The journey from a hackathon prototype to a stable, evolving AI customer‑service system represents not only a technology‑stack upgrade but also a profound shift in development mindset.
Future focus areas include:
Transitioning from single‑responsibility flows to intelligent collaboration by introducing more autonomous Agents for complex interactions.
Deepening multi‑dimensional knowledge exploitation—automating historical dialogue cleaning, QA conversion, and product learning to turn massive data into AI evolution drivers.
Building a more agile AI‑native collaboration framework with prompt review processes and metric alignment to break down product‑research silos.
Under the AI wave, intelligent customer service will keep evolving, guided by the principle "prefer no answer to a wrong answer," continuously expanding AI’s application boundaries for merchants.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
