Automating Android Startup Performance: Perfetto Tracing, Gradle Instrumentation, and Automated Analysis
This article explains how to build a high‑precision Android startup performance pipeline by selecting the right tracing tool, extending Perfetto with custom Gradle instrumentation, handling edge‑case trace mismatches, and using Trace Processor’s Python API for automated wall‑time and CPU‑time regression detection.
Introduction
Startup latency is a critical factor for user experience in Android apps. On low‑end devices the Baidu app suffers from noticeable stalls and black screens, and existing tools (TraceView, CPU Profiler, Systrace) either add too much overhead or lack the necessary analysis capabilities. A custom solution is required to collect, instrument, and analyze every Java/Kotlin method efficiently.
Tool Selection
Four mainstream Android tracing tools were evaluated:
TraceView : high overhead, supports MethodTracing and Sampling, but produces imprecise timing and heavy flame‑graph analysis.
CPU Profiler : similar to TraceView with large performance impact.
Systrace : low overhead, uses kernel ftrace, but requires manual trace points in the app and limited to system‑level events.
Perfetto : low overhead, supports multiple data sources (Java/Kotlin, native, ftrace), provides SQL‑based analysis and a web UI. Chosen as the primary collector despite limited support on Android <9.
Perfetto Overview
Perfetto is a Google‑open‑source performance framework divided into three functional blocks:
Record traces : collects data from user‑space (atrace, custom Trace API) and kernel‑space (ftrace, perf events).
Analyze traces : parses trace files into an in‑memory SQLite database via the Trace Processor module, exposing a Python API for custom SQL queries.
Visualize traces : a web UI ( https://ui.perfetto.dev/) renders flame graphs and allows ad‑hoc SQL queries.
Trace Collection Command
./record_android_trace -c atrace.cfg -n -o trace.htmlThe -c option points to a configuration file that defines buffer size, fill policy, and data‑source settings. A minimal config example:
buffers: { size_kb: 522240 fill_policy: DISCARD }
data_sources: { config { name: "linux.ftrace" ftrace_config { ftrace_events: "sched/sched_switch" atrace_categories: "dalvik" atrace_apps: "com.example.app" } } }
duration_ms: 30000Automatic Instrumentation
To avoid manual insertion of Trace.beginSection / Trace.endSection calls, a Gradle Transform plugin was built using ASM bytecode manipulation. The plugin inserts a trace start at the method entry and ensures a matching end at every exit point.
Handling "Did Not Finish" Issues
When an exception bypasses the explicit endSection, the trace becomes corrupted. The solution mirrors the JVM’s try‑finally semantics: wrap the original method body in a try block, insert Trace.beginSection before it, and place Trace.endSection in a finally block. This guarantees paired calls even when the method throws.
public void testMethod(boolean a, boolean b) {
try {
Trace.beginSection("com.sample.Test.testMethod");
if (!a) {
throw new RuntimeException("test throw");
}
Log.e("testa", "com.sample.Test.testMethod");
if (b) return;
Log.e("testb", "com.sample.Test.testMethod");
} finally {
Trace.endSection();
}
}Instrumenting System Calls (Object.wait)
System classes cannot be directly transformed, but their usage can be monitored by replacing the bytecode instruction that calls Object.wait with a custom static method that adds tracing before delegating to the original implementation.
public static void wait(Object lock, long timeout, int nanos) throws InterruptedException {
boolean isMain = Looper.getMainLooper() == Looper.myLooper();
try {
if (isMain) Trace.beginSection("Main Thread Wait");
lock.wait(timeout, nanos);
} finally {
if (isMain) Trace.endSection();
}
}During bytecode scanning, the instruction INVOKEVIRTUAL java/lang/Object.wait (JI)V is replaced with
INVOKESTATIC com/baidu/systrace/SystraceInject.wait (Ljava/lang/Object;JI)V, preserving the original operand order.
Trace Analysis with Trace Processor
Trace Processor converts trace files into a SQLite in‑memory database. Supported formats include Perfetto protobuf, Linux ftrace, Android systrace, Chrome JSON, and others. Two key duration metrics are extracted:
Wall Duration : total elapsed time (CPU execution + waiting, I/O, scheduling).
CPU Duration : pure CPU execution time, derived by intersecting slice (user‑space events) with sched_slice (kernel scheduling slices).
Core tables used for analysis:
process : maps process name to unique upid.
thread : maps upid to thread IDs ( utid).
thread_track : links utid to a track_id used by slice.
sched_slice : kernel‑level scheduling events (timestamp, duration, CPU, thread).
slice : user‑space trace events (timestamp, duration, name, track_id, depth).
Example SQL to compute CPU duration for a method:
SELECT s.name,
SUM(ss.dur) AS cpu_time_ns
FROM slice s
JOIN thread_track tt ON s.track_id = tt.id
JOIN sched_slice ss ON ss.utid = tt.utid
WHERE s.name = 'com.sample.Test.testMethod'
AND ss.ts BETWEEN s.ts AND s.ts + s.dur
GROUP BY s.name;Automated Analyses Implemented
List of methods whose wall duration exceeds a configurable threshold.
Regression detection by comparing a baseline (release) trace with a test trace.
Top‑N asynchronous thread CPU consumption.
Main‑thread lock contention detection (trace names prefixed with monitor contention).
Best Practices and Performance Optimizations
Full‑method instrumentation adds roughly 10 MB to the APK. To limit size and runtime overhead, a blacklist file can exclude trivial getters/setters, empty methods, or whole packages from instrumentation. Additionally, a black‑list for high‑frequency system calls (e.g., Object.wait) reduces unnecessary trace points.
Profiling showed that the EventBus component contributed a large number of trace events but only a 50 ms overall startup gain after optimization, illustrating the importance of focusing on high‑impact sections.
End‑to‑End Pipeline
The final pipeline automates the following steps:
Gradle plugin builds an instrumented APK (baseline and test builds).
Automated real‑device tests launch the app, trigger record_android_trace, and collect trace files.
Python scripts using the Trace Processor API run the predefined SQL analyses.
Results are aggregated into a performance report with flame graphs and regression tables.
All steps run without manual intervention, reducing the typical two‑person‑day effort to under a half‑day of review for exception handling.
Conclusion
The Baidu Android startup performance framework combines Perfetto’s low‑overhead tracing, a Gradle‑based automatic instrumentation layer, and a Python‑driven analysis suite. While the approach introduces APK size growth and some runtime cost, ongoing refinements—such as selective instrumentation, deeper Perfetto SDK integration, and smarter trace filtering—will further improve efficiency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
