Operations 13 min read

How to Build a Robust Log Analysis System for Stable Microservices

Amid microservice and distributed architectures, this article explains how to design a comprehensive log analysis system—covering collection, storage, consumption, key data points, collection methods, and practical use cases like automated test generation, issue localization, and real‑time exception monitoring—to ensure system stability.

Taobao Frontend Technology

Mar 18, 2021

How to Build a Robust Log Analysis System for Stable Microservices

Quote

In the current microservice and distributed architecture, a simple interface call may involve multiple business service units and servers. How can we accurately, efficiently, and instantly understand the running state of business systems, ensure stability, and quickly locate online anomalies? This article discusses solutions, focusing on building a complete log analysis system to guarantee business system stability.

Note: The discussion is limited to business‑service stability and does not cover server‑level (e.g., network, CPU) stability.

Infrastructure

To understand system operation, we first need to collect data that reflects it; logs are an excellent carrier. Collect dispersed logs from all servers in a consistent way and consume them as needed. This basic log analysis pipeline requires three core services:

Log collection service: gathers logs from various servers and processes raw log data. Typically consists of an agent deployed on application services and a backend service for further processing and storage.

Log storage service: persists the collected log information.

Log consumption service: consumes log information according to specific needs.

Key guarantees for a complete log analysis product include:

Real‑time collection of information.

Successful log gathering.

Elastic scaling of storage.

Support for text queries and complex analytical queries.

Many mature log analysis products are available, such as the open‑source giants ELK and Prometheus. Cloud providers also offer their own log services, often integrated with other services for a productized solution. Comparative articles exist (e.g., Prometheus vs. ELK, SLS vs. ELK).

Recommended Standards

Key Collection Information

Commonly collected business‑system data includes:

Information about calls to this system and its external dependencies.

Details of each call:

Unique flow identifier (to link the entire call chain).

Request start time, end time, and duration.

Interface protocol, request method, and interface name.

Input and output parameters (large volume depending on request scale).

Caller information and other useful context.

Relevant business information.

Collection Methods

Use aspect‑oriented programming to uniformly capture key interface information and exceptions, centralizing logging policies and reducing duplicate logging code.

Explicitly print special business‑logic information where needed, including the unique flow identifier to associate it with a specific request.

Attach business identifiers (e.g., document numbers) to the flow context so that aspects can retrieve them when logging.

Log Retrieval Methods

Logs are first written to local disk; an agent monitors the files and reports data. Ensure proper log rotation and cleanup to avoid disk exhaustion.

Application services can output logs directly to a log service (e.g., Aliyun Log Log4j Appender).

Use Cases

Automated Test Case Generation

Test cases ensure system logic correctness, but writing them manually is labor‑intensive. By leveraging the log data described above, each test case can be generated automatically from a recorded flow.

describe('some ut', () => {
  it('should be that', async () => {
    // 1. MOCK external service calls
    // 2. Prepare input and environment
    // 3. Invoke test interface
    // 4. Assert results
  });
});

The automation input is simply the unique flow identifier; the system retrieves the corresponding request/response data from the log storage and generates test code following a predefined template.

describe('auto ut', () => {
  it('should <%= url %> <%= method %> OK', async () => {
    // mock environment
    const ctx = app.mockContext({
      // environment parameters (user, etc.)
    });

    // mock dependent service calls
    mm.classMethod(<%= service %>, <%= method %>, (...args) => {
      if (/**args match input1**) {
        // return response1
      }
      // a call may have multiple different inputs
      if (/**args match input2**) {
        // return response2
      }
    });

    // http interface test
    await app
      .httpRequest()
      .<%= method %>('<%= url %>')
      .query(<%= query %>)
      .set('referer', '<%= referrer %>')
      .send(<%= input %>)
      .expect(<%= statusCode %>)
      .expect((res) => {
        // result assertion
      });

    // service test
    const service = await ctx.requestContext.getAsync('<%= serviceName %>');
    const rst = await service.<%= method %>(<%= input %>);
    // result assertion
  });
});

The only manual step is providing the unique flow identifier.

Online Issue Localization

Effective debugging requires reconstructing the execution chain. By linking logs with the unique flow identifier, we obtain input/output parameters of the service, external calls, and context information. Comparing these against expectations pinpoints the problematic step.

System Exception Monitoring

Real‑time detection of exceptions can be achieved by monitoring the standardized response fields (e.g., success flag, error code) in the collected logs. Alerting mechanisms can notify teams when failure logs exceed thresholds, provided the system follows a consistent error‑code convention.

Other Applications

Analyzing log data reveals interface latency, enabling identification of slow endpoints for optimization. Full‑chain logs show time composition (internal processing vs. external calls) and call frequency, guiding batch request consolidation and reducing redundant calls.

User‑related log information can also inform product decisions by exposing usage patterns.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

microservices Operations test automation Logging System Monitoring

Written by

Taobao Frontend Technology

The frontend landscape is constantly evolving, with rapid innovations across familiar languages. Like us, your understanding of the frontend is continually refreshed. Join us on Taobao, a vibrant, all‑encompassing platform, to uncover limitless potential.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.