Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform
The article describes how algorithm testing teams tackled data‑quality and interface‑semantic monitoring problems by building a unified business monitoring platform that checks table, storage and service consistency, validates response semantics, and, through dashboards, alerts and correction tools, quickly identified dozens of offline and online issues, guiding future reliability enhancements.
This article shares the challenges faced by algorithm testing teams in data quality and interface semantic monitoring, and describes how they built a functional business monitoring platform to address these challenges. It also provides practical results and guidance for other development and testing teams to co‑create data quality and interface semantic monitoring capabilities.
Background
Business characteristics
Algorithm services heavily depend on data quality. As the saying goes, "Data and features determine the upper bound of machine learning, while models and algorithms only approximate that bound." In model training, large amounts of offline and feature data are required; any deviation in data quality leads to model drift and inaccurate predictions. Real‑time predictions also rely on timely data; data quality issues directly affect business outcomes.
Examples include the algorithm model deployment flow (illustrated in the original diagram) and the Haro map data ETL process, where any data quality problem in the ETL chain can cause abnormal online business data.
Strong reliance on external services
The intelligent customer service system depends on internal platforms (marketing, transaction, payment, account, risk) and external services (voice‑to‑text, hotline, Alipay). Failure of any dependent service disrupts the entire workflow.
Problem Introduction
Battery‑swap Scheduling
Scenario 1 : Operations report many low‑battery vehicles at stations, but the swap scheduling algorithm does not generate swap tasks.
Scenario 2 : Operations receive a swap task, go to the station, and find the vehicle is not low‑battery.
Data Processing Flow
Real‑time data warehouse processes data from app events, binlog, IoT via Kafka, Flink, and stores results in Elasticsearch (ES) for downstream services.
Root Cause Analysis
1. Flink message processing pressure caused backlog and delayed data updates, so the latest battery level was not available for scheduling.
2. The code only handled specific battery‑type IDs (e.g., 591, 371, 668, 663). Vehicles with IDs starting with 376 were not in the logic, resulting in a default battery level of 0, causing false low‑battery alerts.
Offline Testing Challenges
High cost: requires dedicated performance testing.
Huge scenario space: multiple topics, sub‑fields, and constantly changing message schemas.
Incomplete test‑environment data prevents realistic validation of Flink scenarios.
Site Data Update
Scenario
Operations notice that site basic information (name, coordinates, capacity) is outdated or incorrect.
Data Processing Flow
Site data is updated via APP or PC, sent to the Oasis service, which publishes MQ messages to downstream services (e.g., Battery service). The Battery service consumes the MQ, processes the data, and provides SOA queries. Any failure in MQ publishing/consumption or secondary processing can cause data anomalies.
Root Cause Analysis
1. When the APP updates a site, it calls both the tag‑update and site‑update APIs. If the tag‑update API is invoked first, the service does not send an MQ message, so the Battery service misses the change.
2. Legacy flow: write to MongoDB → query MongoDB for success → send MQ. Because MongoDB uses a primary‑secondary architecture, there is a lag between primary write and secondary read. If the query reads the secondary before replication, the MQ is not sent, causing delayed updates. The flow was changed to: write to MongoDB → send MQ directly.
Offline Testing Challenges
Covering all abnormal scenarios is difficult and requires deep domain knowledge.
Data Consistency
Asset testing teams ask how to ensure consistency and timeliness during data synchronization from DB to ES.
Data Sync Flow
Asset data is persisted to the database, then synchronized in real time to ES, which serves SOA queries. The challenge is to guarantee that millions of daily changes are correctly synced and that the sync service’s reliability is monitored.
Service Reliability
Customer service sometimes cannot see or receives incorrect transcription of voice recordings.
ASR (speech‑to‑text) is an internal call, not an HTTP API, so traditional status‑code monitoring does not apply. Moreover, some voice inputs legitimately cannot be transcribed, so a semantic‑level check is needed: send a known audio sample and verify that the returned text matches the expected phrase.
Solution
Monitoring Objectives
Improve online issue perception and quickly locate problems.
Minimize cost while reducing the impact of online issues.
The goal is to deliver Data Quality Monitoring 1.0 and Interface Semantic Monitoring 1.0, complementing existing monitoring solutions.
Monitoring Objects and Scenarios
Data Quality Monitoring
Three layers:
Table‑level (using DQC platform to ensure post‑ETL data quality).
Storage‑level (e.g., ES, Redis).
Application‑service level (cross‑source logical consistency checks).
Example: Haro map data quality monitoring covers single‑source quality analysis (field‑level non‑null, uniqueness, range, format) and multi‑source consistency analysis.
Interface Semantic Monitoring
Monitors Web, RPC, message, and storage services. Three scenarios:
Service availability (alive, correct status code).
Response time (threshold‑based alerts).
Semantic correctness (field format, value rules, key‑field changes).
Our platform focuses on the third scenario, complementing the company’s existing monitoring.
Platform Design
Four functional modules plus a platform layer:
Data & Service Module – reads various data sources and invokes different service types.
Rule & Strategy Module – supports custom anomaly rules, alert strategies, scheduling, severity, and shielding.
Alert Feedback Module – manages alert channels, governance workflow, and integrates with ticketing systems.
Basic Function Module – user/permission management, task configuration, alert groups, dashboards, reports.
The platform layer ensures usability, flexibility, and unified integration.
System layering separates data ingestion, processing, storage, and presentation, facilitating scalability and reliability.
Practical Results
Algorithm Testing Platform
The AI testing platform implements data quality and interface semantic monitoring with five core features:
Data dashboard – shows task distribution, key metrics, and abnormal task reports.
Task management – filter, create, edit, trigger, view logs, and status.
Monitoring reports – visualize task information, multi‑form data, highlight anomalies.
Alert governance – end‑to‑end alert handling, notification, confirmation, ticket submission, and tracking.
Data correction – fix abnormal data with permission control and rollback.
Case Study – Data Quality Monitoring
By February 2021, eight data‑quality monitoring tasks (algorithm platform, supply chain, battery service) discovered 18 issues (61 % from offline testing, 39 % from online monitoring).
For the battery‑swap scenario, 8 bugs were found (4 offline, 2 online). 67 % were data‑processing‑logic bugs, 33 % were non‑functional (capacity, performance).
Case Study – Site Data Update
11 issues were identified (7 offline, 4 online). 64 % were data‑processing‑logic problems, 36 % non‑functional (MQ loss, concurrency, sync gaps).
Case Study – Interface Semantic Monitoring
Two monitoring tasks (algorithm platform, supply chain) discovered one online issue: the ASR service became unavailable due to excessive invalid recordings after a hotline service change.
Implementation Details
Data comparison between the real‑time warehouse (ES) and the reference OssMap service is performed with tolerance thresholds. Example code:
if (Math.abs(warehouseBattery - ossMapBattery) > 2) {
triggerAlert();
} else {
// consider batteries matched, no alert
}For semantic monitoring of the voice‑to‑text service, a known audio file (e.g., saying "你好") is sent periodically; the returned text is compared to the expected phrase, and mismatches generate alerts.
Platform Integration Process
Requirement alignment – prepare documentation and discuss with the platform team.
Write monitoring rule scripts – use the provided Python template to define cross‑source comparison logic.
Validate scripts – submit to GitLab and test in a staging environment.
Deploy monitoring tasks – after offline testing, launch tasks on the platform and monitor alerts.
Future Plans
Three dimensions of development:
Enhance existing capabilities : complete data dashboard, task management, monitoring reports, and add alert governance and data correction.
Expand business capabilities : extend monitoring to UI element checks by combining APP UI automation, real‑device testing, and the monitoring platform.
Co‑creation : invite more developers and testers to contribute to the functional business monitoring ecosystem.
The team aims to continuously improve reliability, alert effectiveness, and to empower other teams with easy‑to‑use monitoring solutions.
HelloTech
Official Hello technology account, sharing tech insights and developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.