Big Data 22 min read

Comprehensive Big Data Interview Question Guide for Major Tech Companies

This article compiles extensive interview questions and topics covering Hadoop, Spark, Flink, Hive, Kafka, MySQL, Redis, Java fundamentals, and algorithms, organized by companies such as Xiaomi, ByteDance, Alibaba, Shopee, Tencent, Meituan, NetEase, and Baidu, to help candidates prepare effectively for big‑data engineering roles.

Big Data Technology & Architecture

Aug 2, 2021

Comprehensive Big Data Interview Question Guide for Major Tech Companies

Currently, many companies have started campus recruitment, and graduates with a background in big data face the challenge of preparing for interviews. This guide collects interview resources gathered from online discussions and personal experience, offering a comprehensive list of topics and questions to study.

Xiaomi / ByteDance / Alibaba

Hadoop

Introduce Hadoop (ByteDance, GoodFuture)

MapReduce processing flow / shuffle process (Alibaba, GoodFuture, NetEase)

How Yarn works (Xiaomi)

Are MapReduce and HDFS a single system? Their relationship (Alibaba)

Data skew generation and solutions (Alibaba, ByteDance, Xiaomi, NetEase, GoodFuture)

Types of joins in MapReduce (ByteDance)

Hadoop HA

Hive

Difference between internal and external tables (ByteDance, GoodFuture)

Data warehouse layering (Xiaomi, GoodFuture, NetEase)

Choosing star schema vs. snowflake schema (ByteDance, GoodFuture)

Differences between data warehouse and traditional databases

Dimension redundancy and third normal form (ByteDance, GoodFuture)

Hive storage formats and compression differences (GoodFuture)

Solving slow HQL execution (ByteDance, Alibaba, Xiaomi)

Spark

Relationship between job, stage, task (Xiaomi)

Spark job submission process (Alibaba, Xiaomi)

Common Spark operators (Xiaomi)

Differences and optimizations between Spark shuffle and MapReduce shuffle

Spark fault tolerance (Alibaba)

Various join implementations in SparkSQL

Introduction to Spark Streaming

Understanding Spark RDD

Flink

Comparison of Spark Streaming and Flink (Xiaomi)

Flink state handling (Xiaomi)

Flink fault tolerance and state consistency (Alibaba)

Implementation of consistent checkpoints – distributed snapshots (Alibaba)

Flink watermark, window mechanism, time (Xiaomi)

Flink runtime architecture

MySQL

Index concepts, B+Tree (Alibaba)

Considerations for creating indexes (frequency, composite indexes, order)

Clustered index, covering index, and back‑table queries

When indexes become ineffective and how to detect usage

Transaction basics and concurrency issues (dirty read, non‑repeatable read, phantom read)

ACID properties and isolation levels

Kafka

File storage mechanism of Kafka

Reliability guarantees: producer‑broker communication, ISR, ACK, partition replicas, leader election

Kafka consistency guarantees

Ensuring data ordering

Differences between Kafka and traditional message queues

Redis

Advantages and disadvantages of Redis

Redis data types

Why Redis is highly efficient

Redis master‑slave replication process

AOF vs. RDB advantages, disadvantages, and use cases

Redis eviction policies

Cache avalanche, cache breakdown and their solutions

Java Fundamentals

Includes core Java knowledge, JVM, multithreading, and related concepts.

Open‑ended Questions

Understanding of big‑data processing philosophy (divide‑and‑conquer, moving computation to data)

Understanding of big‑data ecosystem evolution (Alibaba)

Future trends of big‑data systems (Alibaba)

Possible reasons for a frozen Douyin app (ByteDance)

Algorithms

Review of common algorithm topics such as linked lists, binary trees, stacks, queues, dynamic programming, recursion, backtracking, and sorting algorithms (bubble sort, quick sort, merge sort), with emphasis on problem‑solving strategies used in interviews.

ByteDance

First Round

Dimension modeling: identify theme, granularity, metrics, fact tables, dimension tables

Differences between Hive shuffle and Spark shuffle

Why Spark is fast and its execution process

Conversion rate calculation

Handling slowly changing dimensions

Flink state, window, broadcast stream

Second Round

Hive count(distinct) reduce count and issues with massive data

Spark optimization

Ensuring precise consistency in Flink

Flink real‑time top‑N

Ensuring precise consistency when writing Flink results to Redis

Spark‑Hive solutions for data skew

Fact table classifications

Implementation of cumulative snapshot fact tables

Third Round

HDFS read/write process at source‑code level (including RPC)

MapReduce shuffle principles at source‑code level (locks, multithreading, disk spill)

Why data warehouses need layering

Differences between real‑time and offline processing

Feature mining and management

Shopee

First Round

Self‑introduction focusing on projects

Deep dive into Hadoop‑related project details

Extended Hadoop questions (detailed HDFS write process)

Designing a HashMap and its algorithmic complexity

Implementing quick sort

Discussion of design principles when detailed knowledge is lacking

ClickHouse basics and characteristics

Java JVM memory structure

Volatile keyword purpose

MySQL index concepts and B‑tree vs. B+‑tree

Redis use cases and data structures

Bloom filter‑style solution for checking existence in 4 billion numbers with 10 GB memory

Spark job execution flow

Spark data skew handling methods

Kafka’s problem‑solving capabilities

Hive file storage formats

HQL row‑to‑column and column‑to‑row transformations

HQL query to get the latest date’s name for each id

Zookeeper distributed lock implementation

Linux command to view the highest CPU‑consuming process

Algorithm: implement a queue using two stacks

Second Round

NameNode responsibilities and metadata format

NameNode failure recovery process

NameNode heartbeat content from DataNode

Reason for block partitioning

HDFS write process

Multithreaded code to determine thread exit based on static member access

One‑line Linux bash command to count lines containing a keyword

Third Round

Thread generating key‑value pairs and another thread aggregating sums

SQL to sum scores for identical student names

Bash script to sum the second column after removing the header

SQL to find users with consecutive logins

SQL to select students with average score > 80 and course 0001 score higher than course 0002

Data skew issues

Difference between JVM heap and stack

Common methods in java.lang.Objects and hashCode return value

Creating a thread and setting its heap size

JVM garbage collection mechanisms

Algorithm: mirror a binary tree

Approach to find common numbers in two 10 GB files with only 256 MB memory

Tencent

First Round

Self‑introduction

Work responsibilities

Data warehouse layering

Spark job issues and solutions

Dimension tables vs. fact tables

Types of fact tables

Team composition and responsibilities

Number of tables and management methods

SQL questions

Join ON vs. WHERE syntax

User consecutive check‑in days

Second Round

How themes are divided and why starting from a certain layer

Dimension vs. fact tables

Fact table classifications

Data quality assurance methods

Metrics for measuring data‑warehouse quality

Offline task issues (lateness, duplication) and handling

Ensuring data consistency

Differences between data warehouse and data middle‑platform

Data modeling categories in a warehouse

Handling slowly changing data

Building and maintaining user‑portrait metrics

Data cleaning implementation steps

Resolving issues when provided metrics are incorrect

Meituan

First Round

Self‑introduction

Mathematical modeling competition experience

SQL exercises

Common OS commands

High I/O usage resolution

MySQL index concepts

MapReduce principles

Explanation of shuffle

Data skew understanding and optimization

Why project data is stored in MongoDB

Hive knowledge

Interviewer encouragement to run data experiments

Second Round

Detailed project responsibilities

Java word‑count implementation

Code improvement discussion

Spark experience

Spark Streaming interface level (RDD/DataFrame/Dataset)

Spark DAG explanation

Spark lazy evaluation mechanism

Hive tuning methods

Partitioning and bucketing logic

OSI seven‑layer protocol explanation

Common Linux commands

Shell scripts written

Understanding of shell pipelines

Data structures, B‑tree vs. B+‑tree

MySQL insert impact on indexes

Keyboard input to screen display process

Character set encoding/decoding principles

Combinatorial problem: seating arrangements and dish selections

Third Round

Self‑introduction

Undergraduate and graduate computer knowledge

Operating system file system overview

Deep dive into Hadoop, Hive, HBase

Hadoop deployment details

Zookeeper configuration for deployment

Recent big‑data project experience

Table design and rowkey strategy

HBase problem‑solving scenarios

MySQL use cases and table designs

Reasons for not storing MySQL data in HBase

MySQL practical problem solving

Key Java language features

HashMap key/value storage requirements

Implementing a list with primitive types

NetEase

First Round

Self‑introduction (3 minutes)

ETL project introduction (15 minutes)

Why data stored in HDFS is later imported to NoSQL; HDFS OLAP limitations

Spark job execution process

MapReduce vs. Spark comparison

Linux commands (candidate lacked experience)

Statistical concepts: p‑value, median vs. mean

Business case: declining subscription on NetEase Cloud Classroom and resume‑placement effectiveness evaluation

Second Round

Self‑introduction (3 minutes)

ETL project introduction (15 minutes)

Data‑warehouse development and migration project overview (10 minutes)

Machine‑learning project and its integration with big‑data development (5 minutes)

Kafka architecture, preventing split‑brain, and why newer versions avoid Zookeeper for offset management

Hand‑written code: find the second largest number in an array

Third Round

Self‑introduction (3 minutes)

ETL project introduction (15 minutes)

Hand‑written code: derive post‑order traversal from given pre‑order and in‑order traversals (5 minutes)

Business case: user‑function usage duration calculation and optimization; webpage importance ranking using PageRank (Spark implementation)

Fourth Round

Internship responsibilities

Technical architecture of the internship company

Collaboration between front‑end, back‑end, data‑center, and algorithm teams

Baidu

First Round

Project introduction

Join types: left join, inner join, cross join

MapReduce basics

Bubble sort algorithm

Three dimension‑modeling approaches

Awk usage

Dynamic vs. static partitioning

Data skew handling

Join size considerations for large vs. small tables

Second Round

Deep‑dive project introduction

SQL: intersection of two tables, and set differences

SQL: top‑3 student grades

SQL: left join scenarios

Binary tree search

Merging two sorted linked lists

Shell and Linux common commands

Factory design pattern

MyISAM vs. InnoDB differences and use cases

Third Round

Hand‑written code: intersection of two sorted arrays

Dynamic partitioning discussion

Linux common commands

Spark data skew and lineage issues

This guide serves as a detailed reference for candidates preparing for big‑data engineering interviews across major Chinese tech firms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink SQL interview preparation Spark Hadoop

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.