How We Built a Real‑Time Log Analytics Platform with Storm and Cardinality Counting
To monitor hundreds of web apps on UAE’s PaaS platform in near‑real time, we combined Storm with lightweight log transport, a memcached‑based fqueue, and adaptive cardinality counting to efficiently compute PV, UV, response times, and custom metrics while handling cross‑cluster log aggregation.
Background
UAE (UC App Engine) is an internal UC PaaS platform whose architecture resembles CloudFoundry. It hosts hundreds of web applications, and all requests pass through UAE’s router, generating terabytes of Nginx access logs daily.
We needed a solution to monitor each business’s traffic trends, ad data, page latency, access quality, custom reports, and anomaly alerts in real time.
Hadoop could not meet the sub‑second latency requirement, Spark Streaming was overkill and we lacked Spark expertise, and writing a custom distributed scheduler would be complex. We chose Apache Storm for its lightweight, flexible, and easily extensible message‑driven processing.
Cross‑cluster log transmission was also a major challenge due to the many UC clusters.
Technical Preparation
Cardinality Counting
In distributed big‑data computation, Page Views (PV) can be summed directly, but Unique Visitors (UV) cannot. Counting UV across hundreds of services and tens of thousands of URLs per minute would consume prohibitive memory.
Probabilistic data structures, such as Adaptive Counting (implemented via stream-2.7.0.jar), provide accurate UV estimates with minimal memory overhead and acceptable error bounds.
Real‑time Log Transport
Real‑time computation requires second‑level log delivery. UAE already provides a lightweight log transport tool (client mca and server mcs) that we adopted.
The client watches log file changes in each cluster and streams them to the Storm cluster, where they are stored as ordinary log files. We adjusted the transport strategy so that each Storm node receives logs of roughly equal size, allowing the Spout to read only local data.
Data Source Queue
Instead of heavyweight queues like Kafka, we used fqueue , a lightweight memcached‑protocol queue that converts ordinary log files into a memcached service. Storm’s Spout reads entries via the memcached protocol.
fqueue does not support replay; once a record is consumed it disappears. It is backed by local files with a thin cache, so its throughput is limited by disk I/O, which is sufficient for our current load.
Architecture
With the above components, we can obtain user logs within seconds of a request.
The topology includes two types of computation bolts to ensure even load distribution across services with vastly different traffic volumes.
The Spout normalizes each raw log line, groups by URL (using fieldsGrouping for balanced distribution), and forwards to the appropriate stat_bolt .
stat_bolt performs the main calculations: PV, UV (using cardinality counting), total and backend response times, HTTP status code statistics, URL ranking, traffic aggregation, etc.
merge_bolt merges per‑service metrics such as PV and UV.
A custom Coordinator class (streamId “coordinator”) handles time coordination, batch splitting, task completion checks, and timeout handling, similar to Storm’s Transactional Topology.
A Scheduler obtained parameters via API to dynamically adjust Spout and Bolt distribution across servers.
Topology upgrade support allows the new topology to run alongside the old one; once the new topology takes over the fqueue, the old topology is terminated.
Key operational notes:
Storm nodes should be placed in the same rack to avoid bandwidth bottlenecks.
Nginx logs are split hourly; inaccurate split times cause noticeable data spikes at the hour mark, so using an Nginx module for log rotation is preferred over cron‑based signals.
Insufficient JVM heap size can cause workers to be killed; configure -Xmx appropriately.
Custom Features
Static resource filtering by Content‑Type or file extension.
URL merging for RESTful resources to simplify presentation.
Custom dimensions and metrics defined via ANTLR v3 for syntax and lexical analysis, with support for custom alert expressions.
Additional Monitoring
Process‑level (CPU/MEM/port) monitoring for each service.
Monitoring of dependent services such as MySQL and memcached.
Server‑level monitoring of disk, memory, I/O, kernel parameters, locale, environment variables, and compilation environment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
