Big Data 15 min read

User Behavior Data Collection and Real-Time Processing Architecture at Qunar

This article describes Qunar's end‑to‑end user behavior data pipeline, covering offline and real‑time ETL processes, system architecture, Dubbo service interfaces, monitoring, optimizations, and the numerous product applications that leverage the unified behavior dataset.

Qunar Tech Salon

Dec 7, 2017

User Behavior Data Collection and Real-Time Processing Architecture at Qunar

Qunar, an online travel agent, needed a unified, structured, low‑latency user behavior dataset across multiple business lines (hotels, flights, tickets, etc.) to support intelligent, personalized services such as recommendation and search.

Project Background

The company built a client‑side behavior collection that aggregates logs from various departments, cleanses them, and provides the data via a Dubbo interface or raw HDFS files.

Data Overview

The dataset captures five behavior types (search, click, order fill, favorite, purchase) across seven business lines, storing recent 40‑day data for most actions and 400‑day history for orders, processing tens of millions of records daily with sub‑100 ms latency.

System Overview

The pipeline consists of offline batch processing on Hadoop and a real‑time stream built with qflume, Logstash, Kafka, and Spark Streaming, with results cached in Redis and served through Dubbo.

Offline Framework

Offline jobs ingest hotdog, Kylin, and backend logs, filter, join, standardize, and enrich fields, then merge daily data (T‑1) with historical data (T‑2) to produce full‑history and incremental datasets stored in HDFS and MySQL.

Real‑Time Framework

Real‑time ingestion uses Qflume agents to push logs to Kafka topics, optional Logstash preprocessing, and Spark Streaming for complex logic; intermediate results are cached in Redis (≈100 GB) handling tens of thousands of QPS, with Redis chosen for its performance.

Interface and Service

The unified dataset is exposed via a Dubbo service using Protocol Buffers for fast serialization, while raw JSON files are also available for offline analysis.

Exception Handling and Optimizations

Monitoring of Kafka delays, log counts, and exception rates with alert thresholds.

Completeness checks on offline sources and statistical comparison with real‑time results.

Real‑time tracking of recent user IDs for time‑sensitive push notifications.

48‑hour Redis TTL with 10‑day persistence for backup and discrepancy analysis.

Blacklist handling for abusive crawlers and API misuse.

Logstash‑level cleaning for oversized log entries.

Delta‑load strategy to reduce daily bulk import load.

Application Overview

The dataset powers more than 30 internal projects, including "Big Search" recommendations, pre‑filled search terms, homepage suggestions, ride‑hailing recommendations, vacation home ranking, targeted advertising, geo‑fence push notifications, search result boosting, and a user behavior analysis system for case studies.

Experience Summary

Effective logging is essential; neglecting it hampers product insight.

Early investment in data architecture prevents costly retrofits.

Stable foundational data yields measurable business gains.

Logging should be treated as a core product feature.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline user behavior Recommendation Systems ETL

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.