Comprehensive Guide to Big Data Interview Topics: Log Collection, Data Synchronization, Offline Development, Real‑time Technology, Data Services, and Data Mining
This article provides an extensive overview of big‑data interview subjects, covering browser and mobile log collection methods, data synchronization techniques (batch, real‑time, sharding), offline data development platforms, streaming architectures, data service evolution, performance optimization, and data‑mining layers and applications.
1. Log Collection
This chapter introduces Alibaba's log collection system, which consists of two main solutions: Aplus.JS for web (browser) logs and UserTrack for mobile (APP) logs.
1.1 Browser Page Log Collection
Page view logs are collected when a page is rendered, forming the basis for Page View (PV) and Unique Visitor (UV) metrics. The article explains the challenges and methods of collecting these fundamental logs.
Interaction logs capture user actions after page load, enabling analysis of user interests and experience optimization. Specialized logs such as exposure and real‑time online status are also discussed.
1.2 Mobile Client Log Collection
Mobile log collection uses the UserTrack SDK, categorizing events (e.g., page events, widget clicks) to facilitate downstream analysis, handling hybrid H5/Native logs, device identification, data upload, and processing.
1.3 Log Collection Challenges
Challenges include handling massive log volumes, structuring and normalizing logs, efficient downstream computation, and providing flexible support for algorithms.
2. Data Synchronization
2.1 Basics of Data Synchronization
Three approaches: direct data extraction, file‑based synchronization, and database log parsing.
2.2 Data Synchronization Strategies
2.2.1 Batch Synchronization
Data is unified as strings, DataX plugins read from various sources, and the entire transfer occurs in memory without disk I/O.
2.2.2 Real‑time Synchronization
Real‑time incremental updates are obtained by parsing MySQL binlog and delivered via a publish‑subscribe model with load balancing and filtering capabilities.
2.3 Data Synchronization Issues
2.3.1 Sharding and Table Partitioning
A logical middle table aggregates sharded data, allowing external access as if it were a single table.
2.3.2 Efficient and Batch Synchronization
Metadata management platform provides transparent configuration, enabling one‑click table creation, task configuration, publishing, and testing.
2.3.3 Merging Incremental and Full Synchronization
Full outer join and insert‑overwrite replace merge/update; daily full snapshots are kept for a short period and merged with incremental data.
2.3.4 Synchronization Performance
...
2.3.5 Data Drift
Occurs around midnight when data spans days; remedied by re‑sorting, deduplication, and re‑partitioning.
3. Offline Data Development
3.1 Unified Computing Platform
Handles data ingestion, storage, and various computations, organized into client, access, logic, and compute layers.
Roles in the logic layer: Worker (handles RESTful requests and job submission), Scheduler (dispatches MaxCompute instances), Executor (executes tasks on the compute cluster).
3.2 Unified Development Platform
Includes development/debugging, code quality control, data quality monitoring, and testing platforms.
3.3 Task Scheduling System
Consists of a scheduling engine and execution engine, with workflow and task state machines, and event‑driven instance generation.
3.4 Features
Dependency management, automatic input‑output table detection, cycle scheduling, and manual execution after automatic release.
4. Real‑time Technology
4.1 Streaming Architecture
Four components: data collection, processing, storage, and service.
4.1.1 Data Collection
Data is collected as files, with size and interval thresholds controlling frequency; collected data is then consumed by a streaming engine.
4.1.2 Data Processing
Supports SQL‑based stream analysis, multi‑layer processing, data skew handling, deduplication (exact via skew, fuzzy via hash), and transaction handling (timeout retries, batch IDs, backup to external storage).
4.1.3 Data Storage
Requires multi‑threaded, low‑latency storage; table and rowkey design aim for balanced distribution and same primary dimension co‑location.
4.2 Streaming Data Model
4.2.1 Data Layering
ODS (raw business data), DWD (detailed facts), DWS (dimensional aggregates for all lines), AWS (business‑specific aggregates), DIM (real‑time dimension tables imported from offline).
4.2.2 Multi‑stream Association
Only matching data flows downstream; unmatched data is stored externally and re‑processed when updates arrive.
5. Data Services
5.1 Service Architecture Evolution
Progresses from SOA (multiple interfaces per requirement, low reuse) to OpenAPI (wide tables, single interface), SmartDQ (ORM‑wrapped logical tables), and OneService (mixed simple SQL and custom plugin interfaces).
5.2 Performance Optimization
5.2.1 Resource Allocation
Separates complex logic to a shared layer, uses distinct thread pools for Get and List queries, and optimizes execution plans.
5.2.2 Cache Optimization
Metadata cache, logical‑to‑physical table mapping cache, and query result cache.
5.2.3 Query Capability
Combines offline and real‑time queries, switches to real‑time when offline results are missing, and replaces polling with push notifications.
6. Data Mining
The data‑mining workflow includes business understanding, data preparation, feature engineering, model training, testing, deployment, online application, and feedback.
Data‑platform layers: Feature Data Mining (FDM) for cleaned feature storage, Individual Data Mining (IDM) for entity‑level indicators, Relational Data Mining (RDM) for relationship indicators, and Application‑oriented Data Mining (ADM) for personalized metrics.
Typical applications: individual mining (user profiling, identity recognition, KPI prediction, anti‑fraud) and relational mining (similarity, competition, recommendation systems).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
