Choosing the Right Open‑Source Big Data Stack for Advertising: Expert Insights
This article records a WeChat Q&A where industry experts discuss selecting open‑source big data solutions, advertising‑specific data scenarios, and share a practical lambda‑style platform architecture featuring Hadoop, Spark, Storm, Elasticsearch, Redis and MySQL.
Key Questions and Answers
The article records a Q&A session from the "Efficient Operations" WeChat group that focused on big‑data practices in the advertising industry.
Q1: How to choose among many open‑source big‑data solutions?
In advertising, offline processing typically uses Hadoop, while real‑time tasks rely on Storm or Spark for graph computation. Common resource‑management frameworks include:
Mesos
YARN
Corona
Torca
Omega
Mesos
Originating from a UC Berkeley research project, Mesos is an Apache Incubator project used by companies such as Twitter. It follows a Master/Slave architecture where the Master stores lightweight state about frameworks and slaves, allowing easy recovery via Zookeeper.
Advantages: supports both short‑lived tasks and long‑running services, and its coarse‑grained resource allocation fits environments with multiple coexisting computation frameworks.
Drawback: the DRF scheduling algorithm focuses heavily on fairness and may ignore specific application needs.
YARN
YARN is Hadoop 2.0’s resource manager, quickly adopted by Hadoop components and offering many built‑in scheduling algorithms. Its ResourceManager handles task scheduling for all applications, but integrating traditional database workloads can be inefficient.
Corona
Corona, an open‑source next‑generation MapReduce framework from Facebook, shares design goals with YARN. In many Hadoop deployments, YARN and Mesos remain the primary choices.
Advertising‑specific big‑data stack
Advertising systems often combine:
Storm for billing and anti‑fraud real‑time calculations
Spark’s MLlib for machine‑learning tasks such as click‑through‑rate prediction, clustering, and collaborative filtering
The following diagram shows an internal DMP data‑processing architecture that incorporates Hadoop, Spark, Storm, Elasticsearch, Redis and MySQL.
Q2: Advertising industry big‑data use cases and challenges
Massive scale : millions of pages, billions of users, billions of ad‑transaction requests per day, with strict latency (e.g., 100 ms bid response).
Dynamic user targeting : user interests change rapidly, requiring timely profile updates to avoid irrelevant ad delivery.
Frequent context changes : varying user contexts and page content demand adaptive ad selection.
Q3: Company’s big‑data platform architecture (lambda‑style)
The platform consists of the following open‑source components:
Hadoop for offline reporting and user‑profile generation
Storm for low‑latency real‑time billing and anti‑fraud
Spark (MLlib) for machine‑learning tasks such as click‑through‑rate prediction
Elasticsearch for near‑real‑time indexing and time‑series queries
HBase and MySQL for final result storage and front‑end queries
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
