Big Data 11 min read

Interview Review: Spark Stage Logic, Data Warehouse Evaluation, and Flink Late‑Data Handling

This article reviews common interview questions for data development roles, covering Spark stage partitioning and optimization, criteria for evaluating data warehouses, Flink's handling of late data, and provides practical answers and resources to help candidates deliver standout responses.

Big Data Technology & Architecture

May 15, 2025

Interview Review: Spark Stage Logic, Data Warehouse Evaluation, and Flink Late‑Data Handling

Hello everyone, today we share a review of interview questions.

Several members of a big‑data advanced class have interviewed for Meituan's data development positions; we have compiled some high‑quality questions and how to give answers that exceed expectations to achieve better interview evaluations.

We often say that an interview reflects years of work experience, project summaries, skill accumulation, thinking patterns, and future considerations, so these questions can serve as a self‑assessment.

Question 1: What is the logic behind Spark Stage division? How can you view Spark job Stages? What optimization strategies do you use?

This is a classic question; the difficulty lies in the second part about online task optimization.

Spark divides a job into multiple Stages via wide‑dependency operations (e.g., groupByKey, join). Within a Stage, narrow‑dependency operations can be pipelined efficiently. Understanding Stage division helps optimize jobs, such as reducing unnecessary shuffles and adjusting partition counts.

After launching a Spark job, the Spark Web UI shows Stage division and performance metrics. The Stages tab displays:

Stage list: ID, name, status, task count, input/output data size, etc.

Dependency relationships: Parent Stages indicate upstream/downstream links.

Shuffle metrics: Shuffle Read/Write size and records, reflecting data movement.

We recommend using the Spark Web UI together with the EXPLAIN command to analyze SQL execution plans. For complex scenarios, programmatic APIs can retrieve detailed runtime information, allowing you to pinpoint shuffle bottlenecks, data skew, and apply targeted optimizations.

For further reading, see the articles linked below:

Spark Performance Optimization Summary

Apache Doris 3.0 Core Features and Production Practices

Question 2: How would you evaluate the quality of a data warehouse?

This open‑ended question has no single correct answer but reveals a candidate’s depth of experience and thinking. Below is a suggested framework:

1. Data Quality

Accuracy: Data should accurately reflect business reality; verify against source data, primary key uniqueness, and end‑to‑end monitoring.

Completeness: No missing values; all required dimensions and metrics are present; long‑term quality tracking is in place.

Consistency: The same entity should have identical values across datasets and processing stages (e.g., metric definitions, calculation logic).

Timeliness: Data is refreshed within required SLA windows.

2. Model Design

Rationality: The model meets current business needs and is extensible for future changes, avoiding siloed development.

Link Simplicity: Minimize table joins to reduce complexity and failure risk, and limit cross‑team dependencies.

Layering & Domain Separation: Clear layers (ODS, DWD, DWM, DIM, etc.) and domain boundaries follow warehouse standards.

3. Execution Efficiency

Job Runtime: Tasks complete within reasonable timeframes to meet SLA.

Data Skew: Absence of severe skew that would delay overall job progress.

SQL Optimization: Readable, non‑redundant SQL with optimization opportunities.

4. Resource Utilization

Queue Resource Requests: CPU, memory allocations are reasonable and utilization stays within normal ranges.

Cost Management: Regular cost monitoring and allocation based on business importance.

5. Data Management & Service

Metadata Management: Tables have clear aliases and field descriptions; a metadata system tracks lineage.

Permission Management: Strict access controls ensure data security.

Data Service Capability: Unified data interfaces for business users with responsive query handling.

6. Business Value

Usage Rate: High page views or coverage indicates strong support for business.

Decision Support: Provides accurate, timely data for strategic decisions and daily operations.

Expand on the above points during the interview.

Question 3: What quality management practices does your data warehouse implement?

Focus on the capabilities provided by DQC (Data Quality Center) and adapt them to your business scenarios. Reference materials:

Bilibili Data Quality Assurance System Construction and Practice

NetEase Yanxuan Data Quality Practice

Question 4: Introduce the most technically challenging project you have worked on.

This question is left for the candidate to answer freely.

Question 5: How does Flink handle late data? What is your approach?

Flink provides several mechanisms for handling late data, primarily based on Watermarks, Allowed Lateness, and Side Outputs.

The concept of “late” data is essentially out‑of‑order events.

In non‑stateful ETL jobs, you can simply sort data to enforce “first‑in‑first‑out”.

In stateful computations, the common practice is to discard excessively late data. For example:

select * from table_a where current_timestamp() - event_time <= 48 * 60 * 60

This query drops records that are more than 48 hours late, ensuring downstream 48‑hour aggregation results remain accurate.

These are the key questions and answer outlines we share for reference.

Finally, you are welcome to join our knowledge community:

“300万字！全网最全大数据学习面试社区等你来” .

If this article helped you, please “view”, “like”, and “bookmark”.

Big Data Flink Data quality Data Warehouse interview Spark

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.