Big Data 19 min read

Comprehensive Guide to Data Collection, Event Modeling, and Tracking in Big Data Platforms

The guide explains how comprehensive data collection in big‑data platforms relies on a standardized event model, passive and code‑based embedding, multi‑platform SDKs, a log‑middleware layer, precise location tracking, and an embedding management platform that supports workflow, testing, quality monitoring, and scalable infrastructure for future enhancements.

Youzan Coder
Youzan Coder
Youzan Coder
Comprehensive Guide to Data Collection, Event Modeling, and Tracking in Big Data Platforms

1. Introduction

Big data applications generally consist of five stages: collection, processing, storage, computation, and visualization. The collection stage is the source; only when data is collected comprehensively, accurately, and timely can the processed metrics be valuable.

Embedding (埋点) is an important collection method that transforms user behavior into data assets, supporting product analysis, business decisions, and ad recommendation.

When business demand is low, simple methods can quickly collect user behavior. However, with many business lines, terminals, and diverse data needs, a well‑designed embedding model, collection specifications, and tool‑/platform‑/process‑based management are required to ensure quality.

2. Event Model

who 访客标识、设备指纹、登录ID

when 事件发生时间、上报时间

where 设备环境、网络环境、业务环境等

what 事件标识、事件参数

We designed a log model that can carry the above information while maintaining extensibility, mapping data to schema fields to fully record a single user action.

3. Collection Methods

3.1 Passive (All‑in‑One) Embedding

Using browser or app built‑in listeners to collect page visits, clicks, etc. Collected information includes:

Page URL, app package name, etc.

Element xpath, title, or agreed DOM element

Advantages:

Low front‑end integration cost, no extra development needed

Complete user action collection without loss

Problems:

Both useful and useless data are collected

Cannot capture special actions or business parameters

Collected data needs secondary annotation for user recognition

Dynamic button positions, duplicate names, or page refactoring make accurate identification difficult

Passive embedding is generally used for coarse‑grained rapid business exploration.

3.2 Code Embedding

Code embedding relies on front‑end developers to customize listeners and collection logic.

Advantages:

Clear event identifiers

Rich business parameters

Flexible trigger mechanisms

Easier and more precise analysis

Problems:

Front‑end development and management cost

Only events after code deployment can be collected

When passive embedding cannot meet analysis needs, code embedding is required.

4. Embedding SDK

To simplify front‑end developers' work, Youzan provides SDKs for multiple platforms (js, mini‑program, Android, iOS, Java) that handle visitor identification, session management, environment parameter collection, parameter lifecycle, default event collection, cross‑platform communication, special business logic, log formatting, merging, lifecycle, and upload mechanisms.

Front‑end developers only need to focus on:

SDK initialization configuration

Event identification

Event parameters

Event triggering

5. Log Middleware Layer

Raw logs are highly compact and need further processing into a middle‑layer format, involving:

Splitting batch‑reported logs

Formatting log model

Secondary processing and dimension expansion e.g., IP, http_agent parsing

Cleaning abnormal traffic

Supplementing session information e.g., landing page, second‑page, dwell time calculation

Business‑based log stream and table partitioning

Real‑time middle layer is stored as JSON in Kafka with corresponding JavaBean classes for stream processing; offline layer stores the same schema in tables partitioned by date and business, automatically creating view tables for unified adjustments and data‑warehouse permission management.

At this stage, with a common log model and SDK, embedding work can be standardized, but increasing business lines bring new challenges.

6. Position Tracking Specification

Precise location tracking is needed for fine‑grained operations and algorithmic recommendation. A unified location specification avoids duplicated development across business lines.

Location is divided into four granularity levels:

Business

Page domain (page type and page ID)

Component domain (component type and index, highlighted in red in the diagram)

Slot domain (slot identifier and index, highlighted in green)

Business + page domain + component domain + slot domain + page random code uniquely determines a visit location. The derived dimensions enable easy analysis of exposure, click, and visit metrics at each granularity.

7. Embedding Management Platform

Early on, embedding schemes were recorded in a wiki, which caused inconsistencies, missing key information, difficulty in event discovery, fragmented updates, and poor progress monitoring.

The platform provides:

Embedding metadata management and open API

Embedding workflow management

With metadata, additional capabilities become possible:

Automatic embedding testing

Self‑service analysis

Development hints

Quality monitoring

7.1 Embedding Metadata Management

Metadata consists of:

业务
页面
组件
展位
事件

Definitions:

业务 : uniquely identified by business type (e.g., micro‑mall, retail) and SDK type (js/mini‑program/android/ios/java). Pages, components, slots, and events belong to a single business.

页面 : a class of web or mobile pages sharing the same structure.

组件 : blocks within a page, possibly reusable across pages.

展位 : the finest granularity within a component, identified by incremental order, fixed position, or regex.

事件 : the basic unit of embedding, representing a user action (page entry, button click, product exposure, etc.) with optional parameters. Events can be global, page‑level, or component‑level.

7.2 Project Workflow Management

When a new project starts, a batch of embedding requirements is created. Project‑level management involves PMs, front‑end, data, BI, and testing teams, tracking stages from initiation, design, development, integration, to launch.

7.3 Embedding Testing

Pre‑release testing directly impacts data quality. Early testing used packet capture tools with manual verification, which was inefficient and error‑prone. The platform now offers an online testing function:

Test users input project and visitor ID; the module stores the ID in Redis.

Consume real‑time logs, periodically sync embedding metadata and visitor sets, validate logs, and collect them into the platform.

Return collected real‑time logs to the user.

Summarize tested events and generate overview data.

Testing items include log format compliance, completeness of common business parameters, registration of business/page/component/event, and parameter completeness/format.

Detection levels are Warning or Error, with corresponding messages.

7.4 Quality Monitoring

Incomplete test coverage or routine development iterations can cause online embedding quality issues. For example, a code change may drop an event, leading to metric anomalies that are hard to recover.

Real‑time monitoring of traffic logs and immediate alerts to responsible parties are needed to prevent recurrence.

8. Embedding Development Process

Initially, the data team handled embedding design and integration, becoming a bottleneck as business scale grew. With the platform, the process is standardized and led by PMs, coordinating resources and controlling progress.

PM defines data requirements in PRD, specifying metric definitions and acquisition methods.

PM confirms resources and schedule; designers register embedding schemes on the platform, which are evaluated by front‑end and analysis teams.

Front‑end developers implement embedding according to the scheme.

After development, front‑end and PM test the embedding to ensure all events pass before launch.

Analysis team prepares code in advance; once embedding goes live, relevant metrics are produced promptly.

If alerts are received from the embedding platform, responsible parties must address issues and report impact.

This workflow frees data team resources for higher‑value tasks.

Core business embedding still requires data team involvement to guarantee quality.

9. Underlying Embedding Framework

Log flow consists of:

Front‑end monitors user behavior and reports via HTTP.

NIO high‑concurrency log receiver forwards logs to rsyslog, then to Logstash, and finally to Kafka (raw logs).

Java embedding sends logs asynchronously to NSQ, which Flume syncs to Kafka.

Flink real‑time ETL transforms raw logs into a standardized middle‑layer format and writes back to Kafka.

Kafka logs are synced by Flume to HDFS, partitioned hourly.

Spark hourly jobs convert HDFS logs into Hive tables.

10. Future Outlook

The platform currently supports dozens of business lines (micro‑mall, retail, beauty, etc.) with over 20 new projects per month. Future directions include:

More user‑friendly platform guidance for non‑technical users.

Enhanced front‑end development efficiency with visual, configurable embedding integrated with frameworks.

Reduced SDK report loss rate.

Quick log lookup across all terminals to improve testing and troubleshooting.

Smarter quality management for rapid issue localization and resolution.

Extended dimensions in real‑time middle layer for business domains.

Automatic page classification for passive embedding.

Improved analysis efficiency, integration with metric libraries, and easier conversion and attribution models.

Support for algorithmic A/B testing.

Finally, the big‑data team is hiring for data warehouse, infrastructure, and product roles. Interested candidates can email [email protected].

SDKanalyticsBig Datadata collectiondata pipelineevent trackinglog-processing
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.