Design and Implementation of a Business System Trace and Log Reporting Tool
This article presents the challenges of complex business systems, compares distributed tracing and traditional ELK solutions, and details the design, integration steps, usage workflow, and future enhancements of a lightweight SDK-based trace and log reporting platform that improves debugging efficiency and reduces operational overhead.
Business System Challenges
The storefront guide channel page system relies on multiple middle‑platform capabilities, leading to increasingly complex business logic and rapid growth in system complexity; as traffic rises, the time required for developers to operate and troubleshoot also increases, making fast problem restoration a key challenge.
Accurate business data tracing and rapid issue investigation become critical, requiring tools that record the entire execution process to reconstruct the first‑hand scene and enable precise analysis and localization.
Horizontal Product Research
Two mainstream approaches for business tracing are examined: distributed tracing systems (e.g., SkyWalking, Pinpoint) and log‑based ELK solutions.
The following outlines their usage scenarios and drawbacks.
2.1 Distributed Tracing Systems
The core principle links calls across servers using a common TraceId; a sample call chain illustrates a user request flowing from Application A to B, C, then to D and E, forming a directed acyclic graph.
Distributed tracing collects logs at sampling rates, writes them to data files, and pipelines them to BigTable by TraceId, enabling full‑chain visibility across services. However, it suffers from large log volumes and high maintenance costs, making it difficult to scale and adapt to evolving business needs.
2.2 Traditional ELK Log System
ELK requires developers to log extensively, then filter logs in Elasticsearch to reconstruct execution scenes. Its drawbacks include complex environment setup, cumbersome log collection, difficulty filtering overlapping logs, and time‑consuming manual analysis.
Both approaches have limitations, prompting the design of a hybrid solution that combines their strengths, uses timestamps and unique link identifiers for precise filtering, and leverages business attributes for accurate data selection.
Design Philosophy
High System Stability : Independent thread pool with discard‑when‑full policy; asynchronous reporting via MQ.
Low Integration Cost : SDK package with thread pool and messaging; annotation, AOP, and manual reporting options.
Traceability : Integration with JD pfinder for full‑link identification.
Visualization & Data Isolation : Micro‑application capability for isolated visual dashboards and customized pages.
Instant Notification : JD me instant messaging integrated with micro‑application user settings.
Overall architecture diagram:
The design enables precise data reporting, reduces integration overhead via an independent SDK, and supports configurable, low‑intrusion deployment.
Usage Workflow
The end‑to‑end process consists of five steps, with blue parts indicating business‑system actions and red parts indicating micro‑application capabilities.
Step 1: Create a micro‑application and obtain a unique agentId.
(1) Application creation for data isolation and visual page setup.
(2) Retrieve agentId for data isolation.
Step 2: Add the reporting SDK.
Insert Maven dependency into pom.xml and set tools.version to 2.0.0‑SNASHOT .
<dependency>
<groupId>com.jd.tools.log</groupId>
<artifactId>tools-api</artifactId>
<version>${tools.version}</version>
</dependency>Step 3: Configure basic settings.
Set the system code (agentId) in *.properties or *.yml files.
lc.systemCode=${agentId} lc:
systemCode: ${agentId}Step 4: Report business data using one of three methods.
Reporting Method
Advantages
Disadvantages
Use Cases
Annotation
Flexible, no code, simple
Manual on key methods, not for private methods, single format
New systems, fine‑grained method splitting
API
Flexible, controllable format, high portability
More code, invasive
Both new and legacy systems, need to report private methods
AOP
Path interception, simple config, wide coverage
Private methods unsupported, single format, cannot identify key points
Full‑scope data reporting without focusing on specific points
Combining methods can improve data quality.
Step 5: View reports in the micro‑application.
Reports contain custom system fields, business fields (account, channelOrigin, className, createdDate, methodName, module), systemCode, traceId, and key business data.
Common Questions
1. Does integration significantly impact performance? – The solution uses an independent thread pool and MQ with a discard‑when‑full policy to minimize impact.
2. How to quickly locate a problematic link for a user in a specific time window? – Use time + user PIN to filter data, then traceId to reconstruct the full call chain.
Dada Group Technology
Sharing insights and experiences from Dada Group's R&D department on product refinement and technology advancement, connecting with fellow geeks to exchange ideas and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.