Big Data 8 min read

Automating Consumer Insight Testing with Spark, Hive, and ClickHouse

This article explains how to build a big‑data consumer insight platform using Spark applications, Hive, MySQL and ClickHouse, and how to automate data validation and algorithm testing to improve coverage, efficiency, and reliability of insight services.

NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Automating Consumer Insight Testing with Spark, Hive, and ClickHouse

What is Consumer Insight?

Consumer insight builds on big data, adding a layer of analysis to provide valuable reports that reflect consumer status and guide enterprise decisions, effectively turning traditional consulting into a SaaS‑style, data‑driven service.

Business Implementation

Overall Architecture

Data Ingestion and Storage

We write a Spark application deployed on NetEase Mummu data platform to sync wide tables and tag tables from Hive to MySQL and ClickHouse:

Write tag categories, tags, and tag enumerations to MySQL.

Read wide tables from Hive and insert directly into ClickHouse.

Data Service

User Portrait: Provide various tag combinations for different industries, generating visual insight portraits.

Tag Management: Manage tag hierarchy and special tags, exposing APIs for external tag queries; all tag data resides in MySQL.

Testing the correctness of insight algorithms and synchronized tags is critical to ensure reliable user portraits.

Data Flow

Consumer insight data is cleaned, analyzed, and stored in Hive; Spark extracts wide tables to ClickHouse and tag tables to MySQL, and the business layer reads from ClickHouse for presentation.

Current Situation

Large data volume makes manual tag rule verification time‑consuming (often >1 day).

Numerous insight algorithms increase regression workload.

Frequent data updates cause repetitive testing effort.

Solution

Data Validation Automation

After data sync, an automated QA tag‑validation mechanism is triggered via an API. The platform validates tag data asynchronously and alerts on issues.

DataValidataController provides the API; ValidataImpl implements:

Data completeness checks: volume, tag count, period‑over‑period, tag search.

Data accuracy checks: uniqueness and relational integrity.

Insight Algorithm Service

The service generates algorithmic SQL automatically based on project information, avoiding manual SQL entry, and offers aggregated insight APIs.

Overall Idea: Implement an automated testing framework for consumer insight using ClickHouse.

Application Layer

Aggregates algorithms for product insight, competitor insight, replacement insight, and industry segmentation, exposing them via API services.

@RequestMapping("/userProfile")
public String userProfile(@RequestParam String projectId)
@RequestMapping("/insight/summary")
public String summaryInsight(@RequestParam String projectId, @RequestParam String key)
@RequestMapping("/insight/avg")
public String avgInsight(@RequestParam String projectId, @RequestParam String key)

ClickHouse connections are as simple as MySQL.

CI Integration

Integrate goapi (API management) with overmind (R&D efficiency platform) to trigger automated test scenarios and assertions when insight features are submitted.

Common Pitfalls in Big‑Data Insight Validation

Avoid using ClickHouse functions like hasAny that have poor performance; use row‑to‑column transformations instead.

Handle high‑precision calculations carefully, as ClickHouse may lose precision.

Complex WHERE clauses can degrade performance, especially when multiple projects sync simultaneously.

Results

Test Coverage Improvement

Before: Covered tag volume and tag search.

After: Covered data volume, tag volume, period‑over‑period, tag up/down, tag search, uniqueness, and relational checks.

Test Efficiency Improvement

Before: Data sync testing required 1 person‑day, algorithm testing 2 person‑days.

After: No manual testing effort needed.

Conclusion

Consumer insight services will continue to support new industry analyses, but the core remains parsing user tags and behavior into portraits and insights; the data and algorithm pillars can be iteratively improved through automated big‑data testing.

Big Datadata pipelineautomated testingClickHouseSparkconsumer insight
NetEase Smart Enterprise Tech+
Written by

NetEase Smart Enterprise Tech+

Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.