Big Data 17 min read

Comprehensive Guide to Big Data Interview Topics: Log Collection, Data Synchronization, Offline Development, Real‑time Technology, Data Services, and Data Mining

This article provides an extensive overview of big‑data interview subjects, covering browser and mobile log collection methods, data synchronization techniques (batch, real‑time, sharding), offline data development platforms, streaming architectures, data service evolution, performance optimization, and data‑mining layers and applications.

Big Data Technology & Architecture

Nov 28, 2022

Comprehensive Guide to Big Data Interview Topics: Log Collection, Data Synchronization, Offline Development, Real‑time Technology, Data Services, and Data Mining

1. Log Collection

This chapter introduces Alibaba's log collection system, which consists of two main solutions: Aplus.JS for web (browser) logs and UserTrack for mobile (APP) logs.

1.1 Browser Page Log Collection

Page view logs are collected when a page is rendered, forming the basis for Page View (PV) and Unique Visitor (UV) metrics. The article explains the challenges and methods of collecting these fundamental logs.

Interaction logs capture user actions after page load, enabling analysis of user interests and experience optimization. Specialized logs such as exposure and real‑time online status are also discussed.

1.2 Mobile Client Log Collection

Mobile log collection uses the UserTrack SDK, categorizing events (e.g., page events, widget clicks) to facilitate downstream analysis, handling hybrid H5/Native logs, device identification, data upload, and processing.

1.3 Log Collection Challenges

Challenges include handling massive log volumes, structuring and normalizing logs, efficient downstream computation, and providing flexible support for algorithms.

2. Data Synchronization

2.1 Basics of Data Synchronization

Three approaches: direct data extraction, file‑based synchronization, and database log parsing.

2.2 Data Synchronization Strategies

2.2.1 Batch Synchronization

Data is unified as strings, DataX plugins read from various sources, and the entire transfer occurs in memory without disk I/O.

2.2.2 Real‑time Synchronization

Real‑time incremental updates are obtained by parsing MySQL binlog and delivered via a publish‑subscribe model with load balancing and filtering capabilities.

2.3 Data Synchronization Issues

2.3.1 Sharding and Table Partitioning

A logical middle table aggregates sharded data, allowing external access as if it were a single table.

2.3.2 Efficient and Batch Synchronization

Metadata management platform provides transparent configuration, enabling one‑click table creation, task configuration, publishing, and testing.

2.3.3 Merging Incremental and Full Synchronization

Full outer join and insert‑overwrite replace merge/update; daily full snapshots are kept for a short period and merged with incremental data.

2.3.4 Synchronization Performance

...

2.3.5 Data Drift

Occurs around midnight when data spans days; remedied by re‑sorting, deduplication, and re‑partitioning.

3. Offline Data Development

3.1 Unified Computing Platform

Handles data ingestion, storage, and various computations, organized into client, access, logic, and compute layers.

Roles in the logic layer: Worker (handles RESTful requests and job submission), Scheduler (dispatches MaxCompute instances), Executor (executes tasks on the compute cluster).

3.2 Unified Development Platform

Includes development/debugging, code quality control, data quality monitoring, and testing platforms.

3.3 Task Scheduling System

Consists of a scheduling engine and execution engine, with workflow and task state machines, and event‑driven instance generation.

3.4 Features

Dependency management, automatic input‑output table detection, cycle scheduling, and manual execution after automatic release.

4. Real‑time Technology

4.1 Streaming Architecture

Four components: data collection, processing, storage, and service.

4.1.1 Data Collection

Data is collected as files, with size and interval thresholds controlling frequency; collected data is then consumed by a streaming engine.

4.1.2 Data Processing

Supports SQL‑based stream analysis, multi‑layer processing, data skew handling, deduplication (exact via skew, fuzzy via hash), and transaction handling (timeout retries, batch IDs, backup to external storage).

4.1.3 Data Storage

Requires multi‑threaded, low‑latency storage; table and rowkey design aim for balanced distribution and same primary dimension co‑location.

4.2 Streaming Data Model

4.2.1 Data Layering

ODS (raw business data), DWD (detailed facts), DWS (dimensional aggregates for all lines), AWS (business‑specific aggregates), DIM (real‑time dimension tables imported from offline).

4.2.2 Multi‑stream Association

Only matching data flows downstream; unmatched data is stored externally and re‑processed when updates arrive.

5. Data Services

5.1 Service Architecture Evolution

Progresses from SOA (multiple interfaces per requirement, low reuse) to OpenAPI (wide tables, single interface), SmartDQ (ORM‑wrapped logical tables), and OneService (mixed simple SQL and custom plugin interfaces).

5.2 Performance Optimization

5.2.1 Resource Allocation

Separates complex logic to a shared layer, uses distinct thread pools for Get and List queries, and optimizes execution plans.

5.2.2 Cache Optimization

Metadata cache, logical‑to‑physical table mapping cache, and query result cache.

5.2.3 Query Capability

Combines offline and real‑time queries, switches to real‑time when offline results are missing, and replaces polling with push notifications.

6. Data Mining

The data‑mining workflow includes business understanding, data preparation, feature engineering, model training, testing, deployment, online application, and feedback.

Data‑platform layers: Feature Data Mining (FDM) for cleaned feature storage, Individual Data Mining (IDM) for entity‑level indicators, Relational Data Mining (RDM) for relationship indicators, and Application‑oriented Data Mining (ADM) for personalized metrics.

Typical applications: individual mining (user profiling, identity recognition, KPI prediction, anti‑fraud) and relational mining (similarity, competition, recommendation systems).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data mining Streaming Data synchronization log collection

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.