How We Uncovered Hidden Bottlenecks in Rust Services with Profiling
After migrating thousands of Java cores to Rust, the team used jemalloc and pprof profiling to pinpoint why a few services only improved 10%, refactored the OSS client for reuse, and achieved up to 20% CPU reduction and significant memory savings, demonstrating the power of deep performance analysis in production Rust services.
1. Profiling: The "Mirror" Revealing Bottlenecks
After migrating nearly ten‑thousand cores of Java services to Rust, most services showed large performance gains, but a small subset improved by only about 10%. To investigate the cause, we introduced a profiling system.
2. Project Configuration
Add the following dependencies to Cargo.toml (Rust 1.87):
[target.'cfg(all(not(target_env = "msvc"), not(target_os = "windows")))'.dependencies]
# Memory allocator with profiling
tikv-jemallocator = { version = "0.6", features = ["profiling", "unprefixed_malloc_on_supported_platforms"] }
tikv-jemalloc-ctl = { version = "0.6", features = ["use_std", "stats"] }
tikv-jemalloc-sys = { version = "0.6", features = ["profiling"] }
jemalloc_pprof = { version = "0.7", features = ["symbolize", "flamegraph"] }
pprof = { version = "0.14", features = ["flamegraph", "protobuf-codec"] }3. Global Configuration
In main.rs set Jemalloc as the global allocator and enable profiling:
// Enable jemalloc profiling
pub static malloc_conf: &[u8] = b"prof:true,prof_active:true,lg_prof_sample:16\0";
#[cfg(not(target_env = "msvc"))]
use tikv_jemallocator::Jemalloc;
#[cfg(not(target_env = "msvc"))]
#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;The lg_prof_sample:16 parameter samples roughly every 64 KB of allocation.
4. Profile Generation Functions
Async functions generate pprof‑compatible files on demand.
#[cfg(not(target_env = "msvc"))]
async fn dump_memory_profile() -> Result<String, String> {
let prof_ctl = jemalloc_pprof::PROF_CTL.as_ref()
.ok_or_else(|| "Profiling controller not available".to_string())?;
let mut ctl = prof_ctl.lock().await;
if !ctl.activated() { return Err("Jemalloc profiling is not activated".to_string()); }
let data = ctl.dump_pprof().map_err(|e| format!("Failed to dump pprof: {}", e))?;
let filename = format!("memory_profile_{}.pb", chrono::Utc::now().format("%Y%m%d_%H%M%S"));
std::fs::write(&filename, data).map_err(|e| format!("Failed to write profile file: {}", e))?;
info!("Memory profile dumped to: {}", filename);
Ok(filename)
}
#[cfg(not(target_env = "msvc"))]
async fn dump_cpu_profile() -> Result<String, String> {
use pprof::ProfilerGuard;
let guard = ProfilerGuard::new(100).map_err(|e| format!("Failed to create profiler: {}", e))?;
tokio::time::sleep(std::time::Duration::from_secs(60)).await;
let report = guard.report().build().map_err(|e| format!("Failed to build report: {}", e))?;
let filename = format!("cpu_profile_{}.pb", chrono::Utc::now().format("%Y%m%d_%H%M%S"));
let mut file = std::fs::File::create(&filename).map_err(|e| format!("Failed to create file: {}", e))?;
report.pprof().map_err(|e| format!("Failed to convert to pprof: {}", e))?
.write_to_writer(&mut file).map_err(|e| format!("Failed to write profile: {}", e))?;
info!("CPU profile dumped to: {}", filename);
Ok(filename)
}5. Triggering Profiles
Profiles can be started automatically on a timer or manually via HTTP endpoints using Warp:
fn start_profilers() {
tokio::spawn(async {
let mut interval = tokio::time::interval(std::time::Duration::from_secs(300));
loop {
interval.tick().await;
#[cfg(not(target_env = "msvc"))]
{
match dump_memory_profile().await {
Ok(p) => info!("Memory profile dumped: {}", p),
Err(e) => info!("Failed to dump memory profile: {}", e),
}
}
}
});
// similar for CPU profiler
}
async fn trigger_memory_profile() -> Result<impl warp::Reply, std::convert::Infallible> {
#[cfg(not(target_env = "msvc"))]
{
match dump_memory_profile().await {
Ok(p) => Ok(warp::reply::with_status(p, warp::http::StatusCode::OK)),
Err(e) => Ok(warp::reply::with_status(e, warp::http::StatusCode::INTERNAL_SERVER_ERROR)),
}
}
}
fn profile_routes() -> impl warp::Filter<Extract = impl warp::Reply, Error = warp::Rejection> + Clone {
let memory = warp::post().and(warp::path("profile")).and(warp::path("memory")).and(warp::path::end()).and_then(trigger_memory_profile);
let cpu = warp::post().and(warp::path("profile")).and(warp::path("cpu")).and(warp::path::end()).and_then(trigger_cpu_profile);
memory.or(cpu)
}6. Analyzing Flame Graphs
Using go tool pprof -http=localhost:8080 we observed that OSS::new consumed ~19 % of CPU time because each write created a new OSS client, triggering a TLS handshake.
7. Optimization
We refactored the code to hold a shared Arc<OSS> client, avoiding per‑request instantiation, and added an automatic recreation mechanism when failures exceed thresholds.
let oss_client = Arc::new(create_oss_client(oss_config.clone()));
// In each task
let oss_instance = self.oss_client.clone();
let result = oss_instance.append_object(data, file_name, headers, resources).await;8. Results
After deploying the changes:
CPU usage dropped ~20 %.
OSS write latency fell ~17 % and became the fastest writer in the cluster.
Overall memory usage decreased ~8.8 %, helped by switching from mimalloc to jemalloc.
The experience confirms that deep profiling in Rust can uncover hidden performance problems even after a major language migration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
