Automating Consumer Insight Testing with Spark, Hive, and ClickHouse
This article explains how to build a big‑data consumer insight platform using Spark applications, Hive, MySQL and ClickHouse, and how to automate data validation and algorithm testing to improve coverage, efficiency, and reliability of insight services.
What is Consumer Insight?
Consumer insight builds on big data, adding a layer of analysis to provide valuable reports that reflect consumer status and guide enterprise decisions, effectively turning traditional consulting into a SaaS‑style, data‑driven service.
Business Implementation
Overall Architecture
Data Ingestion and Storage
We write a Spark application deployed on NetEase Mummu data platform to sync wide tables and tag tables from Hive to MySQL and ClickHouse:
Write tag categories, tags, and tag enumerations to MySQL.
Read wide tables from Hive and insert directly into ClickHouse.
Data Service
User Portrait: Provide various tag combinations for different industries, generating visual insight portraits.
Tag Management: Manage tag hierarchy and special tags, exposing APIs for external tag queries; all tag data resides in MySQL.
Testing the correctness of insight algorithms and synchronized tags is critical to ensure reliable user portraits.
Data Flow
Consumer insight data is cleaned, analyzed, and stored in Hive; Spark extracts wide tables to ClickHouse and tag tables to MySQL, and the business layer reads from ClickHouse for presentation.
Current Situation
Large data volume makes manual tag rule verification time‑consuming (often >1 day).
Numerous insight algorithms increase regression workload.
Frequent data updates cause repetitive testing effort.
Solution
Data Validation Automation
After data sync, an automated QA tag‑validation mechanism is triggered via an API. The platform validates tag data asynchronously and alerts on issues.
DataValidataController provides the API; ValidataImpl implements:
Data completeness checks: volume, tag count, period‑over‑period, tag search.
Data accuracy checks: uniqueness and relational integrity.
Insight Algorithm Service
The service generates algorithmic SQL automatically based on project information, avoiding manual SQL entry, and offers aggregated insight APIs.
Overall Idea: Implement an automated testing framework for consumer insight using ClickHouse.
Application Layer
Aggregates algorithms for product insight, competitor insight, replacement insight, and industry segmentation, exposing them via API services.
@RequestMapping("/userProfile")
public String userProfile(@RequestParam String projectId) @RequestMapping("/insight/summary")
public String summaryInsight(@RequestParam String projectId, @RequestParam String key) @RequestMapping("/insight/avg")
public String avgInsight(@RequestParam String projectId, @RequestParam String key)ClickHouse connections are as simple as MySQL.
CI Integration
Integrate goapi (API management) with overmind (R&D efficiency platform) to trigger automated test scenarios and assertions when insight features are submitted.
Common Pitfalls in Big‑Data Insight Validation
Avoid using ClickHouse functions like hasAny that have poor performance; use row‑to‑column transformations instead.
Handle high‑precision calculations carefully, as ClickHouse may lose precision.
Complex WHERE clauses can degrade performance, especially when multiple projects sync simultaneously.
Results
Test Coverage Improvement
Before: Covered tag volume and tag search.
After: Covered data volume, tag volume, period‑over‑period, tag up/down, tag search, uniqueness, and relational checks.
Test Efficiency Improvement
Before: Data sync testing required 1 person‑day, algorithm testing 2 person‑days.
After: No manual testing effort needed.
Conclusion
Consumer insight services will continue to support new industry analyses, but the core remains parsing user tags and behavior into portraits and insights; the data and algorithm pillars can be iteratively improved through automated big‑data testing.
NetEase Smart Enterprise Tech+
Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
