How We Uncovered Hidden Bottlenecks in Rust Services with Profiling

After migrating thousands of Java cores to Rust, the team used jemalloc and pprof profiling to pinpoint why a few services only improved 10%, refactored the OSS client for reuse, and achieved up to 20% CPU reduction and significant memory savings, demonstrating the power of deep performance analysis in production Rust services.

DeWu Technology
DeWu Technology
DeWu Technology
How We Uncovered Hidden Bottlenecks in Rust Services with Profiling

1. Profiling: The "Mirror" Revealing Bottlenecks

After migrating nearly ten‑thousand cores of Java services to Rust, most services showed large performance gains, but a small subset improved by only about 10%. To investigate the cause, we introduced a profiling system.

2. Project Configuration

Add the following dependencies to Cargo.toml (Rust 1.87):

[target.'cfg(all(not(target_env = "msvc"), not(target_os = "windows")))'.dependencies]
# Memory allocator with profiling
 tikv-jemallocator = { version = "0.6", features = ["profiling", "unprefixed_malloc_on_supported_platforms"] }
 tikv-jemalloc-ctl = { version = "0.6", features = ["use_std", "stats"] }
 tikv-jemalloc-sys = { version = "0.6", features = ["profiling"] }
 jemalloc_pprof = { version = "0.7", features = ["symbolize", "flamegraph"] }
 pprof = { version = "0.14", features = ["flamegraph", "protobuf-codec"] }

3. Global Configuration

In main.rs set Jemalloc as the global allocator and enable profiling:

// Enable jemalloc profiling
pub static malloc_conf: &[u8] = b"prof:true,prof_active:true,lg_prof_sample:16\0";

#[cfg(not(target_env = "msvc"))]
use tikv_jemallocator::Jemalloc;

#[cfg(not(target_env = "msvc"))]
#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;

The lg_prof_sample:16 parameter samples roughly every 64 KB of allocation.

4. Profile Generation Functions

Async functions generate pprof‑compatible files on demand.

#[cfg(not(target_env = "msvc"))]
async fn dump_memory_profile() -> Result<String, String> {
    let prof_ctl = jemalloc_pprof::PROF_CTL.as_ref()
        .ok_or_else(|| "Profiling controller not available".to_string())?;
    let mut ctl = prof_ctl.lock().await;
    if !ctl.activated() { return Err("Jemalloc profiling is not activated".to_string()); }
    let data = ctl.dump_pprof().map_err(|e| format!("Failed to dump pprof: {}", e))?;
    let filename = format!("memory_profile_{}.pb", chrono::Utc::now().format("%Y%m%d_%H%M%S"));
    std::fs::write(&filename, data).map_err(|e| format!("Failed to write profile file: {}", e))?;
    info!("Memory profile dumped to: {}", filename);
    Ok(filename)
}

#[cfg(not(target_env = "msvc"))]
async fn dump_cpu_profile() -> Result<String, String> {
    use pprof::ProfilerGuard;
    let guard = ProfilerGuard::new(100).map_err(|e| format!("Failed to create profiler: {}", e))?;
    tokio::time::sleep(std::time::Duration::from_secs(60)).await;
    let report = guard.report().build().map_err(|e| format!("Failed to build report: {}", e))?;
    let filename = format!("cpu_profile_{}.pb", chrono::Utc::now().format("%Y%m%d_%H%M%S"));
    let mut file = std::fs::File::create(&filename).map_err(|e| format!("Failed to create file: {}", e))?;
    report.pprof().map_err(|e| format!("Failed to convert to pprof: {}", e))?
        .write_to_writer(&mut file).map_err(|e| format!("Failed to write profile: {}", e))?;
    info!("CPU profile dumped to: {}", filename);
    Ok(filename)
}

5. Triggering Profiles

Profiles can be started automatically on a timer or manually via HTTP endpoints using Warp:

fn start_profilers() {
    tokio::spawn(async {
        let mut interval = tokio::time::interval(std::time::Duration::from_secs(300));
        loop {
            interval.tick().await;
            #[cfg(not(target_env = "msvc"))]
            {
                match dump_memory_profile().await {
                    Ok(p) => info!("Memory profile dumped: {}", p),
                    Err(e) => info!("Failed to dump memory profile: {}", e),
                }
            }
        }
    });
    // similar for CPU profiler
}

async fn trigger_memory_profile() -> Result<impl warp::Reply, std::convert::Infallible> {
    #[cfg(not(target_env = "msvc"))]
    {
        match dump_memory_profile().await {
            Ok(p) => Ok(warp::reply::with_status(p, warp::http::StatusCode::OK)),
            Err(e) => Ok(warp::reply::with_status(e, warp::http::StatusCode::INTERNAL_SERVER_ERROR)),
        }
    }
}

fn profile_routes() -> impl warp::Filter<Extract = impl warp::Reply, Error = warp::Rejection> + Clone {
    let memory = warp::post().and(warp::path("profile")).and(warp::path("memory")).and(warp::path::end()).and_then(trigger_memory_profile);
    let cpu = warp::post().and(warp::path("profile")).and(warp::path("cpu")).and(warp::path::end()).and_then(trigger_cpu_profile);
    memory.or(cpu)
}

6. Analyzing Flame Graphs

Using go tool pprof -http=localhost:8080 we observed that OSS::new consumed ~19 % of CPU time because each write created a new OSS client, triggering a TLS handshake.

CPU flame graph
CPU flame graph
Memory flame graph
Memory flame graph

7. Optimization

We refactored the code to hold a shared Arc<OSS> client, avoiding per‑request instantiation, and added an automatic recreation mechanism when failures exceed thresholds.

let oss_client = Arc::new(create_oss_client(oss_config.clone()));
// In each task
let oss_instance = self.oss_client.clone();
let result = oss_instance.append_object(data, file_name, headers, resources).await;

8. Results

After deploying the changes:

CPU usage dropped ~20 %.

OSS write latency fell ~17 % and became the fastest writer in the cluster.

Overall memory usage decreased ~8.8 %, helped by switching from mimalloc to jemalloc.

The experience confirms that deep profiling in Rust can uncover hidden performance problems even after a major language migration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

microservicesRustpprofasyncprofilingjemallocperformance-optimization
DeWu Technology
Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.