Cloud Native 12 min read

From Bare Metal to Cloud‑Native: How Zhuanzhuan Reinvented Log Collection

This article traces Zhuanzhuan's evolution of log collection—from a bare‑metal scribe + flume pipeline, through a container‑aware log‑pilot solution, to a cloud‑native filebeat and fb‑advisor architecture—detailing the motivations, technical designs, performance gains, and trade‑offs of each stage.

dbaplus Community

Mar 13, 2023

From Bare Metal to Cloud‑Native: How Zhuanzhuan Reinvented Log Collection

Background

Since 2018 Zhuanzhuan has been migrating its services to containers, and log collection has been a critical component for error debugging, statistical analysis, and strategic decision‑making.

1. Bare Metal Era

In the early days the logging stack consisted of scribe and flume . Adding a new service required the following workflow:

Operations submits a ticket requesting log collection for the service.

The ticket is approved.

The scribe + flume components are automatically deployed on the target server.

A scribe configuration file is rendered, pointing the service’s log directory to the collection directory.

Scribe reads the logs, forwards them to flume, and flume distributes them to Kafka, HDFS, etc.

The workflow worked well because service deployment nodes changed rarely on bare metal.

2. Container Era

When containers were introduced, Zhuanzhuan chose a “most stable” migration path that kept the existing release system, login mechanism, and overall log‑processing logic unchanged. The solution combined log‑pilot with flume :

Log‑pilot automatically discovers containers, extracts metadata, and generates a flume configuration.

Flume collects container logs, writes them to a host‑side directory, and triggers log‑pilot to update the scribe configuration and restart scribe.

This kept the legacy scribe + flume pipeline intact while supporting containers.

As container adoption grew, log volume exploded, causing disk pressure, iowait >90 %, and retention dropping from 30 days to 3 days.

ByteCompass – Optimized Component

To eliminate the extra log‑pilot + flume hop, the operations team built ByteCompass , a system‑d managed daemon that watches the Docker API, renders a new scribe.conf directly from container events, and restarts scribe.

Performance improvements:

Average node iowait reduced from 10 % to 1 % (peak from >25 % to <3 %).

Overall iowait dropped by 92 %.

Log retention increased from 3 days back to >7 days.

Comparison of bare‑metal and ByteCompass log collection

3. Cloud‑Native Era

With the container‑native approach reaching its limits, Zhuanzhuan designed a cloud‑native pipeline based on filebeat and a custom helper called fb‑advisor :

fb‑advisor watches the kube‑apiserver pod API, captures new pod events, extracts the host‑mounted log path, and writes a filebeat configuration.

Filebeat reloads automatically, ships logs to Kafka, and a custom consumer processes them downstream.

General HostPath Solution

For a more generic approach, the standard add_kubernetes_metadata processor of filebeat is used to attach pod metadata to log entries. The relevant configuration is shown below:

processors:
  - add_kubernetes_metadata:
      in_cluster: false
      host: 10.140.24.108
      kube_config: /pathto/kubeconfig
      namespace: default
      default_indexers.enabled: false
      default_matchers.enabled: false
      sync_period: 60m
      indexers:
        - pod_uid:
      matchers:
        - logs_path:
            logs_path: '/var/lib/kubelet/pods/'
            resource_type: 'pod'

This processor connects to the kube‑apiserver, extracts the pod UID from the hostPath ( /var/lib/kubelet/pods/<pod UID>/volumes/…), and appends the metadata to each log line.

Comparison and Security Considerations

The custom fb‑advisor solution offers higher configurability, allowing selective metadata attachment, while the generic processor automatically adds all pod labels. From a security perspective, the fb‑advisor approach is safer because it reads logs from the host‑mounted directory, decoupling collection from the pod lifecycle and avoiding data loss when pods are rescheduled.

Conclusion

Log collection in the cloud‑native era has many viable designs—there is no one‑size‑fits‑all solution. Zhuanzhuan’s journey from scribe + flume to filebeat + fb‑advisor illustrates how evolving requirements, performance constraints, and operational safety drive architectural choices.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations container log collection filebeat

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.