Big Data Lessons from Baidu: Pitfalls, Language Choices, and NewSQL Insights
In this expert Q&A, Baidu’s senior big-data specialists reveal common project pitfalls, argue for Java in Hadoop-style systems, discuss MongoDB deployment, outline criteria for choosing open-source versus self-built solutions, and evaluate the viability of NewSQL/Spanner-type startups.
Introduction
“坐而论道” is a rotating Q&A format; this article records part of the big-data themed week with leading Chinese experts.
Question 1: Typical pitfalls in Baidu’s big-data projects
Avoid over-design and iterate quickly.
Practice defensive programming but avoid extreme over-protection.
Quantify work whenever possible; “if you can’t measure it, you can’t improve it”.
Treat external interfaces cautiously; maintain compatibility.
Prepare comprehensive monitoring and incident-response plans.
Provide flexible architecture so business teams can adjust configurations.
Automate routine manual tasks; assume humans and systems are unreliable.
Additional tips:
Strictly follow coding standards.
Use the most stringent compiler options.
Conduct thorough code reviews.
Question 2: Language choice for a Hadoop-like project
Java is the preferred language for open-source projects like Hadoop because it attracts a large community. Clojure, used by Storm, suffers from a limited talent pool, reducing community participation.
Other considerations include supporting Python, PHP, and C++ for internal services, using C++ for performance-critical components, and leveraging existing libraries for serialization and RPC.
Question 3: Factors when deciding between self-developed and open-source solutions
Clarify the business scenario and requirements.
Thoroughly research existing solutions (papers, code, community activity) and test at scale.
Assess whether the solution can be understood deeply, maintained, and promoted.
If the solution aligns with the underlying research and has no critical flaws, adopt it.
Because few solutions meet all criteria, many projects end up being self-developed.
Question 4: MongoDB usage at Baidu
MongoDB is relatively niche at Baidu, deployed on a few hundred nodes, managed independently by each business line. Baidu Cloud offers a shared-instance MongoDB service within the BAE product.
Question 5: Is a NewSQL/Spanner-style distributed database a good startup direction?
Target customers are large enterprises (banks, energy) that can afford existing solutions; convincing them to trust a startup is difficult.
Technical challenges are significant. However, a RedShift-like system may have market potential because open-source alternatives are lacking and commercial offerings are costly.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
