Comprehensive Big Data Interview Question Guide for Major Tech Companies
This article compiles extensive interview questions and topics covering Hadoop, Spark, Flink, Hive, Kafka, MySQL, Redis, Java fundamentals, and algorithms, organized by companies such as Xiaomi, ByteDance, Alibaba, Shopee, Tencent, Meituan, NetEase, and Baidu, to help candidates prepare effectively for big‑data engineering roles.
Currently, many companies have started campus recruitment, and graduates with a background in big data face the challenge of preparing for interviews. This guide collects interview resources gathered from online discussions and personal experience, offering a comprehensive list of topics and questions to study.
Xiaomi / ByteDance / Alibaba
Hadoop
Introduce Hadoop (ByteDance, GoodFuture)
MapReduce processing flow / shuffle process (Alibaba, GoodFuture, NetEase)
How Yarn works (Xiaomi)
Are MapReduce and HDFS a single system? Their relationship (Alibaba)
Data skew generation and solutions (Alibaba, ByteDance, Xiaomi, NetEase, GoodFuture)
Types of joins in MapReduce (ByteDance)
Hadoop HA
Hive
Difference between internal and external tables (ByteDance, GoodFuture)
Data warehouse layering (Xiaomi, GoodFuture, NetEase)
Choosing star schema vs. snowflake schema (ByteDance, GoodFuture)
Differences between data warehouse and traditional databases
Dimension redundancy and third normal form (ByteDance, GoodFuture)
Hive storage formats and compression differences (GoodFuture)
Solving slow HQL execution (ByteDance, Alibaba, Xiaomi)
Spark
Relationship between job, stage, task (Xiaomi)
Spark job submission process (Alibaba, Xiaomi)
Common Spark operators (Xiaomi)
Differences and optimizations between Spark shuffle and MapReduce shuffle
Spark fault tolerance (Alibaba)
Various join implementations in SparkSQL
Introduction to Spark Streaming
Understanding Spark RDD
Flink
Comparison of Spark Streaming and Flink (Xiaomi)
Flink state handling (Xiaomi)
Flink fault tolerance and state consistency (Alibaba)
Implementation of consistent checkpoints – distributed snapshots (Alibaba)
Flink watermark, window mechanism, time (Xiaomi)
Flink runtime architecture
MySQL
Index concepts, B+Tree (Alibaba)
Considerations for creating indexes (frequency, composite indexes, order)
Clustered index, covering index, and back‑table queries
When indexes become ineffective and how to detect usage
Transaction basics and concurrency issues (dirty read, non‑repeatable read, phantom read)
ACID properties and isolation levels
Kafka
File storage mechanism of Kafka
Reliability guarantees: producer‑broker communication, ISR, ACK, partition replicas, leader election
Kafka consistency guarantees
Ensuring data ordering
Differences between Kafka and traditional message queues
Redis
Advantages and disadvantages of Redis
Redis data types
Why Redis is highly efficient
Redis master‑slave replication process
AOF vs. RDB advantages, disadvantages, and use cases
Redis eviction policies
Cache avalanche, cache breakdown and their solutions
Java Fundamentals
Includes core Java knowledge, JVM, multithreading, and related concepts.
Open‑ended Questions
Understanding of big‑data processing philosophy (divide‑and‑conquer, moving computation to data)
Understanding of big‑data ecosystem evolution (Alibaba)
Future trends of big‑data systems (Alibaba)
Possible reasons for a frozen Douyin app (ByteDance)
Algorithms
Review of common algorithm topics such as linked lists, binary trees, stacks, queues, dynamic programming, recursion, backtracking, and sorting algorithms (bubble sort, quick sort, merge sort), with emphasis on problem‑solving strategies used in interviews.
ByteDance
First Round
Dimension modeling: identify theme, granularity, metrics, fact tables, dimension tables
Differences between Hive shuffle and Spark shuffle
Why Spark is fast and its execution process
Conversion rate calculation
Handling slowly changing dimensions
Flink state, window, broadcast stream
Second Round
Hive count(distinct) reduce count and issues with massive data
Spark optimization
Ensuring precise consistency in Flink
Flink real‑time top‑N
Ensuring precise consistency when writing Flink results to Redis
Spark‑Hive solutions for data skew
Fact table classifications
Implementation of cumulative snapshot fact tables
Third Round
HDFS read/write process at source‑code level (including RPC)
MapReduce shuffle principles at source‑code level (locks, multithreading, disk spill)
Why data warehouses need layering
Differences between real‑time and offline processing
Feature mining and management
Shopee
First Round
Self‑introduction focusing on projects
Deep dive into Hadoop‑related project details
Extended Hadoop questions (detailed HDFS write process)
Designing a HashMap and its algorithmic complexity
Implementing quick sort
Discussion of design principles when detailed knowledge is lacking
ClickHouse basics and characteristics
Java JVM memory structure
Volatile keyword purpose
MySQL index concepts and B‑tree vs. B+‑tree
Redis use cases and data structures
Bloom filter‑style solution for checking existence in 4 billion numbers with 10 GB memory
Spark job execution flow
Spark data skew handling methods
Kafka’s problem‑solving capabilities
Hive file storage formats
HQL row‑to‑column and column‑to‑row transformations
HQL query to get the latest date’s name for each id
Zookeeper distributed lock implementation
Linux command to view the highest CPU‑consuming process
Algorithm: implement a queue using two stacks
Second Round
NameNode responsibilities and metadata format
NameNode failure recovery process
NameNode heartbeat content from DataNode
Reason for block partitioning
HDFS write process
Multithreaded code to determine thread exit based on static member access
One‑line Linux bash command to count lines containing a keyword
Third Round
Thread generating key‑value pairs and another thread aggregating sums
SQL to sum scores for identical student names
Bash script to sum the second column after removing the header
SQL to find users with consecutive logins
SQL to select students with average score > 80 and course 0001 score higher than course 0002
Data skew issues
Difference between JVM heap and stack
Common methods in java.lang.Objects and hashCode return value
Creating a thread and setting its heap size
JVM garbage collection mechanisms
Algorithm: mirror a binary tree
Approach to find common numbers in two 10 GB files with only 256 MB memory
Tencent
First Round
Self‑introduction
Work responsibilities
Data warehouse layering
Spark job issues and solutions
Dimension tables vs. fact tables
Types of fact tables
Team composition and responsibilities
Number of tables and management methods
SQL questions
Join ON vs. WHERE syntax
User consecutive check‑in days
Second Round
How themes are divided and why starting from a certain layer
Dimension vs. fact tables
Fact table classifications
Data quality assurance methods
Metrics for measuring data‑warehouse quality
Offline task issues (lateness, duplication) and handling
Ensuring data consistency
Differences between data warehouse and data middle‑platform
Data modeling categories in a warehouse
Handling slowly changing data
Building and maintaining user‑portrait metrics
Data cleaning implementation steps
Resolving issues when provided metrics are incorrect
Meituan
First Round
Self‑introduction
Mathematical modeling competition experience
SQL exercises
Common OS commands
High I/O usage resolution
MySQL index concepts
MapReduce principles
Explanation of shuffle
Data skew understanding and optimization
Why project data is stored in MongoDB
Hive knowledge
Interviewer encouragement to run data experiments
Second Round
Detailed project responsibilities
Java word‑count implementation
Code improvement discussion
Spark experience
Spark Streaming interface level (RDD/DataFrame/Dataset)
Spark DAG explanation
Spark lazy evaluation mechanism
Hive tuning methods
Partitioning and bucketing logic
OSI seven‑layer protocol explanation
Common Linux commands
Shell scripts written
Understanding of shell pipelines
Data structures, B‑tree vs. B+‑tree
MySQL insert impact on indexes
Keyboard input to screen display process
Character set encoding/decoding principles
Combinatorial problem: seating arrangements and dish selections
Third Round
Self‑introduction
Undergraduate and graduate computer knowledge
Operating system file system overview
Deep dive into Hadoop, Hive, HBase
Hadoop deployment details
Zookeeper configuration for deployment
Recent big‑data project experience
Table design and rowkey strategy
HBase problem‑solving scenarios
MySQL use cases and table designs
Reasons for not storing MySQL data in HBase
MySQL practical problem solving
Key Java language features
HashMap key/value storage requirements
Implementing a list with primitive types
NetEase
First Round
Self‑introduction (3 minutes)
ETL project introduction (15 minutes)
Why data stored in HDFS is later imported to NoSQL; HDFS OLAP limitations
Spark job execution process
MapReduce vs. Spark comparison
Linux commands (candidate lacked experience)
Statistical concepts: p‑value, median vs. mean
Business case: declining subscription on NetEase Cloud Classroom and resume‑placement effectiveness evaluation
Second Round
Self‑introduction (3 minutes)
ETL project introduction (15 minutes)
Data‑warehouse development and migration project overview (10 minutes)
Machine‑learning project and its integration with big‑data development (5 minutes)
Kafka architecture, preventing split‑brain, and why newer versions avoid Zookeeper for offset management
Hand‑written code: find the second largest number in an array
Third Round
Self‑introduction (3 minutes)
ETL project introduction (15 minutes)
Hand‑written code: derive post‑order traversal from given pre‑order and in‑order traversals (5 minutes)
Business case: user‑function usage duration calculation and optimization; webpage importance ranking using PageRank (Spark implementation)
Fourth Round
Internship responsibilities
Technical architecture of the internship company
Collaboration between front‑end, back‑end, data‑center, and algorithm teams
Baidu
First Round
Project introduction
Join types: left join, inner join, cross join
MapReduce basics
Bubble sort algorithm
Three dimension‑modeling approaches
Awk usage
Dynamic vs. static partitioning
Data skew handling
Join size considerations for large vs. small tables
Second Round
Deep‑dive project introduction
SQL: intersection of two tables, and set differences
SQL: top‑3 student grades
SQL: left join scenarios
Binary tree search
Merging two sorted linked lists
Shell and Linux common commands
Factory design pattern
MyISAM vs. InnoDB differences and use cases
Third Round
Hand‑written code: intersection of two sorted arrays
Dynamic partitioning discussion
Linux common commands
Spark data skew and lineage issues
This guide serves as a detailed reference for candidates preparing for big‑data engineering interviews across major Chinese tech firms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
