Spring Cloud Microservices Series #7: Implementing Distributed Tracing with SkyWalking

This article explains why distributed tracing is essential for Spring Cloud microservices, introduces SkyWalking’s core concepts, compares it with other tracing tools, shows how to deploy SkyWalking via Docker Compose, integrate the Java agent, and use the UI to analyze performance, errors, and alerts.

Coder Trainee
Coder Trainee
Coder Trainee
Spring Cloud Microservices Series #7: Implementing Distributed Tracing with SkyWalking

1. Need for Distributed Tracing

A request often traverses three to five microservices. When a step becomes slow or fails, locating the problem quickly is difficult because logs are dispersed across different servers, cannot be correlated to the same request, latency of each step is unknown, and service dependencies are unclear.

用户请求 → 网关 → 文章服务 → 用户服务 → 评论服务 → 搜索服务
            ↓
            哪个环节慢了?
            哪个环节挂了?

Traditional logging problems :

Log dispersion – each service writes logs on different servers.

Cannot correlate – logs for the same request are scattered.

Unknown latency – the bottleneck cannot be identified.

Unclear dependencies – service call relationships are not obvious.

2. What Tracing Can Do

┌─────────────────────────────────────────────────────────────────┐
│                     Trace ID: abc123                         │
├─────────────────────────────────────────────────────────────────┤
│ Span 1: Gateway (0ms - 150ms)                                 │
│   └── Span 2: Article Service (10ms - 130ms)                  │
│        └── Span 3: User Service (20ms - 50ms)                 │
│        └── Span 4: Comment Service (55ms - 120ms)            │
│               └── Span 5: Database (60ms - 110ms)             │
└─────────────────────────────────────────────────────────────────┘

Core concepts :

Trace : a complete request chain.

Span : a single operation unit (a service call).

Trace ID : globally unique identifier that spans the whole trace.

Span ID : unique identifier for each operation.

2. SkyWalking Overview

2.1 Why Choose SkyWalking?

Storage : supports Elasticsearch, MySQL, TiDB (Zipkin uses ES, Jaeger uses ES/Cassandra, Pinpoint uses HBase).

Performance consumption : medium for SkyWalking (higher for Zipkin due to byte‑code injection), low for Jaeger and Pinpoint.

Invasiveness : SkyWalking uses automatic Java agents (no code changes), while Zipkin and Jaeger require manual instrumentation.

UI features : SkyWalking provides a rich UI with alerting and Chinese documentation; Zipkin UI is basic, Jaeger UI is rich, Pinpoint UI is rich.

Core advantages :

Non‑intrusive: Java Agent, no code changes required.

Good performance: minimal impact on business logic.

All‑in‑one: tracing, metrics, and alerting integrated.

Open‑source domestic project with friendly Chinese docs.

2.2 Architecture

┌─────────┐   ┌─────────┐   ┌─────────┐
│ 用户服务 │   │ 文章服务 │   │ 评论服务 │
│ + Agent │   │ + Agent │   │ + Agent │
└────┬────┘   └────┬────┘   └────┬────┘
     │            │            │
     └───────────────┼───────────────┘
                     ▼
               ┌─────────────┐
               │   OAP 服务   │ (analysis, aggregation)
               └──────┬──────┘
                      │
          ┌─────────┼─────────┐
          ▼         ▼         ▼
   ┌────────┐ ┌────────┐ ┌────────┐
   │  ES    │ │ MySQL │ │ 告警   │
   │ 存储   │ │       │ │       │
   └────────┘ └────────┘ └────────┘
                      │
                      ▼
               ┌─────────────┐
               │   UI 服务    │ (visualization)
               └─────────────┘

3. Deploying SkyWalking

3.1 One‑click Docker Compose Deployment

# docker-compose.yml
version: '3.8'
services:
  elasticsearch:
    image: elasticsearch:7.17.0
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
    volumes:
      - es-data:/usr/share/elasticsearch/data

  oap:
    image: apache/skywalking-oap-server:9.5.0
    container_name: oap
    depends_on:
      - elasticsearch
    environment:
      - SW_STORAGE=elasticsearch
      - SW_STORAGE_ES_CLUSTER_NODES=elasticsearch:9200
    ports:
      - "11800:11800"   # gRPC (Agent reporting)
      - "12800:12800"   # HTTP (UI calls)
    volumes:
      - oap-config:/skywalking/config

  ui:
    image: apache/skywalking-ui:9.5.0
    container_name: ui
    depends_on:
      - oap
    environment:
      - SW_OAP_ADDRESS=http://oap:12800
    ports:
      - "8088:8080"   # UI access port

volumes:
  es-data:
  oap-config:
# Start services
docker-compose up -d

# Access UI
http://localhost:8088

3.2 Java Agent Download

# Download Agent
wget https://archive.apache.org/dist/skywalking/9.5.0/apache-skywalking-java-agent-9.5.0.tgz

# Extract
tar -zxvf apache-skywalking-java-agent-9.5.0.tgz

# Directory layout
skywalking-agent/
├── skywalking-agent.jar
├── config/
│   └── agent.config
└── plugins/
    ├── apm-spring-cloud-gateway-plugin
    ├── apm-feign-default-http-9.x-plugin
    ├── apm-mysql-8.x-plugin
    └── ...

4. Service Integration with SkyWalking

4.1 JVM Parameters

# Start service with agent
java -javaagent:/path/to/skywalking-agent/skywalking-agent.jar \
     -Dskywalking.agent.service_name=user-service \
     -Dskywalking.collector.backend_service=localhost:11800 \
     -jar user-service.jar

4.2 Docker Integration

# In docker-compose.yml for a service
user-service:
  build: ./user-service
  environment:
    - JAVA_TOOL_OPTIONS=-javaagent:/skywalking-agent/skywalking-agent.jar
    - SW_AGENT_NAME=user-service
    - SW_AGENT_COLLECTOR_BACKEND_SERVICES=oap:11800
  volumes:
    - ./skywalking-agent:/skywalking-agent

4.3 IDE (IntelliJ) Integration

# VM options
-javaagent:/path/to/skywalking-agent/skywalking-agent.jar
-Dskywalking.agent.service_name=user-service
-Dskywalking.collector.backend_service=localhost:11800

4.4 Agent Configuration (agent.config)

# Service name
agent.service_name=${SW_AGENT_NAME:user-service}

# OAP address
collector.backend_service=${SW_AGENT_COLLECTOR_BACKEND_SERVICES:localhost:11800}

# Sampling rate (1.0 = 100% sampling)
agent.sample_n_per_3_secs=10

# Paths to ignore
agent.ignore_suffix=.jpg,.jpeg,.png,.css,.js

# Log level
logging.level=${SW_LOGGING_LEVEL:INFO}

# Plugin settings
plugin.mysql.trace_sql_parameters=true
plugin.springmvc.collect_http_params=true

5. Tracing in Action

5.1 View Full Call Chain

After starting the services, open http://localhost:8088 to access the SkyWalking UI.

① Topology diagram shows service dependencies:

┌─────────┐     ┌─────────┐     ┌─────────┐
│ Gateway │────→│ Article │────→│  User   │
└─────────┘     │ Service │     │ Service │
                └─────────┘     └─────────┘
                     │
                     ▼
                 ┌─────────┐
                 │ Comment │
                 │ Service │
                 └─────────┘

② Trace list (example rows):

Trace ID: abc123, Endpoint: GET /api/article/1, Latency: 156ms, Status: Success

Trace ID: abc124, Endpoint: GET /api/article/2, Latency: 3021ms, Status: Slow

Trace ID: abc125, Endpoint: POST /api/comment, Latency: 500ms, Status: Failure

③ Detailed Span for Trace ID abc124:

Trace ID: abc124
Total Duration: 3021ms

Timeline:
├── gateway: /api/article/2 (0ms - 3021ms)
    ├── article-service: ArticleController.getArticle (5ms - 3015ms)
        ├── user-service: UserController.getUser (10ms - 2010ms)  ← slow here
        │   └── mysql: SELECT * FROM user (15ms - 2005ms)   ← SQL slow
        └── comment-service: CommentController.list (2015ms - 3010ms)
            └── mysql: SELECT * FROM comment (2020ms - 3005ms)

5.2 Performance Analysis

Slow query identification (Top 10 Slow Spans) :

1. user-service: SELECT * FROM user (2000ms)
2. comment-service: SELECT * FROM comment (1000ms)
3. article-service: ArticleController.getArticle (500ms)

Before optimization (serial calls) :

文章服务 → 用户服务 (2000ms)
        → 评论服务 (1000ms)
总耗时:3000ms

After optimization (parallel calls) :

CompletableFuture<User> userFuture =
    CompletableFuture.supplyAsync(() -> userClient.getUser(id));
CompletableFuture<List<Comment>> commentFuture =
    CompletableFuture.supplyAsync(() -> commentClient.list(articleId));

User user = userFuture.join();      // 2000ms
List<Comment> comments = commentFuture.join(); // 1000ms
// Actual total time = max(2000, 1000) = 2000ms

5.3 Error Analysis

# Error Trace
Trace ID: abc125
Status: HTTP 500 Internal Server Error

Span Detail:
├── article-service (0ms - 500ms)
    └── user-service (10ms - 300ms)
        └── Exception: NullPointerException at UserService.getUser:45

Stack Trace:
java.lang.NullPointerException
    at com.laok.service.UserService.getUser(UserService.java:45)
    at com.laok.controller.UserController.getUser(UserController.java:20)

6. Alert Configuration

6.1 Built‑in Alert Rules (alarm-settings.yml)

# Service response time alert
service_resp_time_rule:
  metrics-name: service_resp_time
  op: ">"
  threshold: 1000
  period: 3
  count: 3
  message: "服务 {name} 响应时间超过 1000ms"

# Service success rate alert
service_sla_rule:
  metrics-name: service_sla
  op: "<"
  threshold: 95
  period: 3
  count: 2
  message: "服务 {name} 成功率低于 95%"

# Database slow query alert
database_resp_time_rule:
  metrics-name: database_resp_time
  op: ">"
  threshold: 500
  period: 2
  count: 3
  message: "数据库 {name} 查询耗时超过 500ms"

6.2 DingTalk Alert Integration

# alarm-settings.yml (webhook)
webhooks:
  - https://oapi.dingtalk.com/robot/send?access_token=xxx

# DingTalk message template (JSON)
{
  "msgtype": "markdown",
  "markdown": {
    "title": "SkyWalking 告警",
    "text": "## 服务告警

- **服务**: {name}
- **指标**: {metrics-name}
- **阈值**: {threshold}
- **当前值**: {value}
- **时间**: {time}
- **消息**: {message}"
  }
}

7. Custom Instrumentation

7.1 Business Method Tracing

@Service
@Slf4j
public class BusinessService {

    @Trace(operationName = "business:processOrder")
    public void processOrder(Long orderId) {
        // business logic
        doStep1();
        doStep2();
    }

    @Trace(operationName = "business:doStep1")
    private void doStep1() {
        // sub‑step
    }
}

7.2 Parameter and Return Value Recording

@Component
public class CustomSpanDecorator implements SpanDecorator {

    @Override
    public void beforeMethod(Method method, Object[] args) {
        ActiveSpan.tag("method", method.getName());
        ActiveSpan.tag("params", JSON.toJSONString(args));
    }

    @Override
    public void afterMethod(Method method, Object result) {
        ActiveSpan.tag("result", JSON.toJSONString(result));
    }
}

7.3 Exception Recording

@RestControllerAdvice
public class GlobalExceptionHandler {

    @ExceptionHandler(Exception.class)
    public Result handleException(Exception e) {
        // Record exception to SkyWalking
        ActiveSpan.error(e);
        ActiveSpan.tag("error.message", e.getMessage());
        return Result.error("系统异常");
    }
}

8. Log and Trace Correlation

8.1 Logback Pattern with TraceId

<configuration>
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] [%X{traceId}] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
</configuration>

8.2 gRPC Log Reporting

# agent.config additions
plugin.toolkit.log.grpc.reporter.server_host=${SW_GRPC_LOG_SERVER_HOST:localhost}
plugin.toolkit.log.grpc.reporter.server_port=${SW_GRPC_LOG_SERVER_PORT:11800}
plugin.toolkit.log.grpc.reporter.max_message_size=${SW_GRPC_LOG_MAX_MESSAGE_SIZE:10485760}

9. Common Issues and Pitfalls

9.1 Agent Not Effective (Service Not Visible)

Symptoms : UI shows no services.

Checklist :

Verify the javaagent path is correct.

Ensure the OAP address is reachable.

Check the agent logs (e.g., logs/skywalking-api.log).

9.2 Incomplete Trace (Only Some Services Appear)

Cause : Some plugins were not loaded.

Solution : Inspect the skywalking-agent/plugins/ directory and confirm required plugins (e.g., apm-spring-cloud-gateway-plugin, apm-feign-default-http-9.x-plugin) are present.

9.3 Performance Impact After Integration

Symptoms : Services become slower.

Solution : Reduce the sampling rate and ignore non‑essential paths.

# 10% sampling
agent.sample_n_per_3_secs=1

# Ignore static resources and health checks
agent.ignore_suffix=.jpg,.jpeg,.png,.css,.js,/health,/actuator
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

microservicesAlertingdistributed tracingSpring CloudJava Agentperformance analysisDocker ComposeSkyWalking
Coder Trainee
Written by

Coder Trainee

Experienced in Java and Python, we share and learn together. For submissions or collaborations, DM us.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.