Comprehensive Guide to Data Collection, Event Modeling, and Tracking in Big Data Platforms
The guide explains how comprehensive data collection in big‑data platforms relies on a standardized event model, passive and code‑based embedding, multi‑platform SDKs, a log‑middleware layer, precise location tracking, and an embedding management platform that supports workflow, testing, quality monitoring, and scalable infrastructure for future enhancements.
1. Introduction
Big data applications generally consist of five stages: collection, processing, storage, computation, and visualization. The collection stage is the source; only when data is collected comprehensively, accurately, and timely can the processed metrics be valuable.
Embedding (埋点) is an important collection method that transforms user behavior into data assets, supporting product analysis, business decisions, and ad recommendation.
When business demand is low, simple methods can quickly collect user behavior. However, with many business lines, terminals, and diverse data needs, a well‑designed embedding model, collection specifications, and tool‑/platform‑/process‑based management are required to ensure quality.
2. Event Model
who 访客标识、设备指纹、登录ID
when 事件发生时间、上报时间
where 设备环境、网络环境、业务环境等
what 事件标识、事件参数
We designed a log model that can carry the above information while maintaining extensibility, mapping data to schema fields to fully record a single user action.
3. Collection Methods
3.1 Passive (All‑in‑One) Embedding
Using browser or app built‑in listeners to collect page visits, clicks, etc. Collected information includes:
Page URL, app package name, etc.
Element xpath, title, or agreed DOM element
Advantages:
Low front‑end integration cost, no extra development needed
Complete user action collection without loss
Problems:
Both useful and useless data are collected
Cannot capture special actions or business parameters
Collected data needs secondary annotation for user recognition
Dynamic button positions, duplicate names, or page refactoring make accurate identification difficult
Passive embedding is generally used for coarse‑grained rapid business exploration.
3.2 Code Embedding
Code embedding relies on front‑end developers to customize listeners and collection logic.
Advantages:
Clear event identifiers
Rich business parameters
Flexible trigger mechanisms
Easier and more precise analysis
Problems:
Front‑end development and management cost
Only events after code deployment can be collected
When passive embedding cannot meet analysis needs, code embedding is required.
4. Embedding SDK
To simplify front‑end developers' work, Youzan provides SDKs for multiple platforms (js, mini‑program, Android, iOS, Java) that handle visitor identification, session management, environment parameter collection, parameter lifecycle, default event collection, cross‑platform communication, special business logic, log formatting, merging, lifecycle, and upload mechanisms.
Front‑end developers only need to focus on:
SDK initialization configuration
Event identification
Event parameters
Event triggering
5. Log Middleware Layer
Raw logs are highly compact and need further processing into a middle‑layer format, involving:
Splitting batch‑reported logs
Formatting log model
Secondary processing and dimension expansion e.g., IP, http_agent parsing
Cleaning abnormal traffic
Supplementing session information e.g., landing page, second‑page, dwell time calculation
Business‑based log stream and table partitioning
Real‑time middle layer is stored as JSON in Kafka with corresponding JavaBean classes for stream processing; offline layer stores the same schema in tables partitioned by date and business, automatically creating view tables for unified adjustments and data‑warehouse permission management.
At this stage, with a common log model and SDK, embedding work can be standardized, but increasing business lines bring new challenges.
6. Position Tracking Specification
Precise location tracking is needed for fine‑grained operations and algorithmic recommendation. A unified location specification avoids duplicated development across business lines.
Location is divided into four granularity levels:
Business
Page domain (page type and page ID)
Component domain (component type and index, highlighted in red in the diagram)
Slot domain (slot identifier and index, highlighted in green)
Business + page domain + component domain + slot domain + page random code uniquely determines a visit location. The derived dimensions enable easy analysis of exposure, click, and visit metrics at each granularity.
7. Embedding Management Platform
Early on, embedding schemes were recorded in a wiki, which caused inconsistencies, missing key information, difficulty in event discovery, fragmented updates, and poor progress monitoring.
The platform provides:
Embedding metadata management and open API
Embedding workflow management
With metadata, additional capabilities become possible:
Automatic embedding testing
Self‑service analysis
Development hints
Quality monitoring
7.1 Embedding Metadata Management
Metadata consists of:
业务 页面 组件 展位 事件Definitions:
业务 : uniquely identified by business type (e.g., micro‑mall, retail) and SDK type (js/mini‑program/android/ios/java). Pages, components, slots, and events belong to a single business.
页面 : a class of web or mobile pages sharing the same structure.
组件 : blocks within a page, possibly reusable across pages.
展位 : the finest granularity within a component, identified by incremental order, fixed position, or regex.
事件 : the basic unit of embedding, representing a user action (page entry, button click, product exposure, etc.) with optional parameters. Events can be global, page‑level, or component‑level.
7.2 Project Workflow Management
When a new project starts, a batch of embedding requirements is created. Project‑level management involves PMs, front‑end, data, BI, and testing teams, tracking stages from initiation, design, development, integration, to launch.
7.3 Embedding Testing
Pre‑release testing directly impacts data quality. Early testing used packet capture tools with manual verification, which was inefficient and error‑prone. The platform now offers an online testing function:
Test users input project and visitor ID; the module stores the ID in Redis.
Consume real‑time logs, periodically sync embedding metadata and visitor sets, validate logs, and collect them into the platform.
Return collected real‑time logs to the user.
Summarize tested events and generate overview data.
Testing items include log format compliance, completeness of common business parameters, registration of business/page/component/event, and parameter completeness/format.
Detection levels are Warning or Error, with corresponding messages.
7.4 Quality Monitoring
Incomplete test coverage or routine development iterations can cause online embedding quality issues. For example, a code change may drop an event, leading to metric anomalies that are hard to recover.
Real‑time monitoring of traffic logs and immediate alerts to responsible parties are needed to prevent recurrence.
8. Embedding Development Process
Initially, the data team handled embedding design and integration, becoming a bottleneck as business scale grew. With the platform, the process is standardized and led by PMs, coordinating resources and controlling progress.
PM defines data requirements in PRD, specifying metric definitions and acquisition methods.
PM confirms resources and schedule; designers register embedding schemes on the platform, which are evaluated by front‑end and analysis teams.
Front‑end developers implement embedding according to the scheme.
After development, front‑end and PM test the embedding to ensure all events pass before launch.
Analysis team prepares code in advance; once embedding goes live, relevant metrics are produced promptly.
If alerts are received from the embedding platform, responsible parties must address issues and report impact.
This workflow frees data team resources for higher‑value tasks.
Core business embedding still requires data team involvement to guarantee quality.
9. Underlying Embedding Framework
Log flow consists of:
Front‑end monitors user behavior and reports via HTTP.
NIO high‑concurrency log receiver forwards logs to rsyslog, then to Logstash, and finally to Kafka (raw logs).
Java embedding sends logs asynchronously to NSQ, which Flume syncs to Kafka.
Flink real‑time ETL transforms raw logs into a standardized middle‑layer format and writes back to Kafka.
Kafka logs are synced by Flume to HDFS, partitioned hourly.
Spark hourly jobs convert HDFS logs into Hive tables.
10. Future Outlook
The platform currently supports dozens of business lines (micro‑mall, retail, beauty, etc.) with over 20 new projects per month. Future directions include:
More user‑friendly platform guidance for non‑technical users.
Enhanced front‑end development efficiency with visual, configurable embedding integrated with frameworks.
Reduced SDK report loss rate.
Quick log lookup across all terminals to improve testing and troubleshooting.
Smarter quality management for rapid issue localization and resolution.
Extended dimensions in real‑time middle layer for business domains.
Automatic page classification for passive embedding.
Improved analysis efficiency, integration with metric libraries, and easier conversion and attribution models.
Support for algorithmic A/B testing.
Finally, the big‑data team is hiring for data warehouse, infrastructure, and product roles. Interested candidates can email [email protected].
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.