How Qualitis Ensures High‑Availability Data Quality Monitoring on Big Data Platforms
Qualitis is a big‑data‑platform‑based data‑quality‑management service that defines, detects, and reports data‑set quality issues, featuring idempotent backend services, load‑balanced high‑availability, Zookeeper‑coordinated process synchronization, thread‑pool throttling, and clearly separated internal and external APIs.
Introduction
Data quality monitoring is a critical step in big‑data processing, providing the necessary support for data services, analytics and mining.
Project Overview
The document proposes Qualitis, a data‑quality‑management service built on a big‑data platform, offering a unified workflow to define, detect, and report data‑set quality issues in a timely manner.
Glossary
Project : a collection of rules that determines alert recipients and severity; it is a unit of task scheduling.
Rule : definition of a data‑quality model for a data source; it decides whether an alert is triggered and serves as the basic unit for task scheduling.
Application : a data‑quality‑checking task; executing the task yields quality verification results.
Overall Design
Architecture
Gray‑Release Design
Because each Qualitis backend service is idempotent, gray‑release is achieved by isolating a single backend instance so that it no longer receives user requests.
High‑Availability and Performance
Qualitis services are idempotent and can be deployed in multiple instances behind a load balancer to achieve both high availability and performance improvement.
Additional performance ideas (not yet implemented) include query caching using a distributed cache to reduce database load and accelerate response times.
Multi‑Thread Synchronization
Process synchronization is required because multiple Qualitis instances may simultaneously refresh monitoring task states. Qualitis uses Zookeeper to coordinate processes; instances compete to create an ephemeral node, and the winner becomes the Monitor responsible for task status updates.
Thread Throttling
When monitoring tasks submit to Hive Metastore, high request volume can overload the metastore. Qualitis employs a thread‑pool throttling mechanism: if no thread is available, the task waits until one is obtained before connecting to the metastore.
Module Design
Module Diagram
Use‑Case Diagram
API Design
Internal APIs
Two categories: Administrator APIs (/qualitis/api/v1/admin/*) and User APIs (/qualitis/api/v1/projector/*), separating permissions.
External APIs
External endpoint pattern: /qualitis/outer/api/v1/*. Calls must include the following query parameters:
app_id (string): system‑assigned application identifier.
timestamp (string): millisecond‑level timestamp, valid for 7 days.
nonce (string): random string of length 5.
signature (string): MD5(md5(appId + nonce + timestamp) + appToken), 32‑character lowercase hash.
app_id and appToken must be granted by an administrator.
System Engineering Structure
The system consists of two layers: a Web layer (Controller and Service) that exposes services, and a Core layer containing core business logic and storage components.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
