Business Observability and Real-Time Event Streaming Architecture for Content Production
The paper proposes a business‑observability framework for a content‑production pipeline—illustrated by Bilibili’s workflow—by modeling archives as entities, assigning global AIDs for end‑to‑end tracing, and leveraging a Kafka‑Flink‑ClickHouse event‑streaming platform to monitor real‑time latency, bottlenecks, and safety audits across the entire production line.
This document introduces the need for business observability in a content production platform, using Bilibili's workflow as a concrete example. It describes the end‑to‑end process from creator inspiration, video editing, upload, transcoding, content safety review, to final CDN distribution, and draws an analogy to a Manufacturing Execution System (MES) to highlight the importance of real‑time tracking and scheduling.
The top‑level design defines "business observability" as the monitoring of business entities (archives) and events throughout their lifecycle, rather than merely technical metrics such as QPS or latency. It emphasizes the need to observe the business layer (the "goods" on the production line) to detect bottlenecks, resource contention, and abnormal user behavior.
Key business entities are modeled as an Archive aggregate consisting of Party/Place/Thing, Role, Description, and Moment‑Interval. The document references Peter Coad’s four‑color modeling method to illustrate how static and temporal aspects of an entity are captured.
To achieve end‑to‑end traceability, a custom business trace mechanism is proposed. A global identifier (AID) serves as the trace ID linking client‑side events (captured before the archive is created) with server‑side events. The trace flow includes mapping clientTraceID → serverTraceID → AID.
Event collection is performed via an event‑streaming pipeline. All sub‑systems emit events that are aggregated into a real‑time event platform built on Kafka, Flink, and ClickHouse. This enables snapshot reconstruction, duration calculation, and statistical analysis (zoom‑out for aggregate metrics, zoom‑in for single‑archive trace).
Sample tracing code (illustrating OpenTracing‑style span creation) is shown below:
parentSpan:=tracer.StartSpan("稿件创作") //step 1 parentSpan=tracer.StartSpan("获取素材", opentracing.ChildOf(parentSpan.Context())) //step 2 parentSpan=tracer.StartSpan("上传视频", opentracing.ChildOf(parentSpan.Context())) //step 3 parentSpan=tracer.StartSpan("稿件预检", opentracing.ChildOf(parentSpan.Context())) //step 4 parentSpan=tracer.StartSpan("提交稿件", opentracing.ChildOf(parentSpan.Context())) //step 5On the server side, the trace IDs are mapped and reported:
ts.Trace(c, ts.EventTraceName("审核通过"), ts.EventKey(te.CreativeArchiveAuditEditEvent), ts.EventNodeType(ts.Start), ts.EventLevel(ts.Info), ts.AID(aid))The platform provides real‑time dashboards for event latency percentiles, event loss detection, and safety audit reconciliation. Alerting integrates with a monitoring system (Prometheus) and an alarm platform to notify responsible teams of anomalies such as missing safety reviews or delayed openings.
Statistical queries (e.g., using window functions) are used to compute per‑archive durations and percentile distributions. Example SQL snippet:
select aid, lagInFrame(event_time) over win as previous_time, event_time from xxx where xxx window win as (partition by aid order by event_time, event_key)Finally, the document summarizes that a global event‑sourcing platform gives precise insight into production efficiency, supports cross‑domain observability, and can be extended to various business scenarios beyond content production.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.