Backend Development 40 min read

Mastering Software Performance: 6 Core Techniques to Trade Time and Space

This article explains how software performance optimization is fundamentally a trade‑off between time and space, introduces six universal techniques—indexing, compression, caching, prefetching, peak‑shaving, and batch processing—covers their principles, common data structures, practical usage patterns, and the associated drawbacks, and finally offers guidance on choosing the right approach.

dbaplus Community

Feb 1, 2023

Mastering Software Performance: 6 Core Techniques to Trade Time and Space

Indexing Techniques

Indexes trade extra storage for faster look‑ups, reducing read complexity from O(n) to O(log n) or O(1). Common index structures and their typical use cases include:

Hash Table : Direct O(1) lookup by hashing the key. Used in language Map/Dict implementations and many key‑value stores.

Binary Search Tree (BST) : Ordered tree; balanced variants such as Red‑Black trees give O(log n) lookup. Used in Java TreeMap/TreeSet, C++ std::map, and the Linux CPU scheduler.

B‑Tree : Multi‑way balanced tree that reduces depth for large data sets. MongoDB uses B‑Tree for its indexes.

B+ Tree : Only leaf nodes store data and leaves are linked, enabling efficient range scans. Used by MySQL and many file systems.

LSM Tree : Log‑Structured Merge Tree optimises write throughput by sequential writes and background merges; read performance is slower. Core to many NoSQL stores (e.g., RocksDB, Cassandra).

Trie (Prefix Tree) : Fast prefix searches, ideal for autocomplete, URL routing, and Redis radix‑tree implementations.

Skip List : Layered ordered list that supports concurrent writes without rebalancing. Redis ZSet combines a hash with a skip‑list.

Inverted Index : Maps terms to the documents they appear in. Fundamental to full‑text search engines such as Elasticsearch and time‑series databases like Prometheus.

When designing database schemas, follow these best practices:

Define a primary key (clustered index) and use it for most queries.

Create indexes on columns frequently used in WHERE, JOIN, ORDER BY, or GROUP BY clauses.

Avoid indexing low‑cardinality or frequently updated columns.

Prefer covering indexes to eliminate row look‑ups.

Caching Techniques

Caching also trades storage for speed and appears at many layers:

Client‑side DNS cache, OS DNS cache, and upstream DNS resolvers.

Server‑side key‑value caches (e.g., Redis, Memcached) for hot data.

Operating‑system page cache for recently accessed files.

Hardware caches in SSDs and CPUs.

Cache invalidation is a hard problem. Common failure modes and mitigations:

Cache penetration : Cache empty values or use a Bloom filter to block queries for non‑existent keys.

Cache breakdown (thundering herd) : Serialize the first miss with a request‑level mutex.

Cache avalanche : Randomise TTLs to avoid massive simultaneous expirations.

Compression Techniques

Compression spends CPU cycles to reduce data size, which can dramatically lower network transfer time and storage I/O. Typical lossless uses:

HTTP Accept‑Encoding: gzip/deflate for text assets (HTML, CSS, JS).

HTTP/2 HPACK header compression.

JS/CSS minification (Uglify, terser).

Binary encoding and compression (Gzip, Snappy, LZ4) in RPC protocols and message queues.

Cache services often store compressed blobs and decompress on read.

Bulk data storage may use higher‑ratio algorithms for archival data.

JVM UseCompressedOops compresses object pointers when heap < 32 GB.

MongoDB BSON is a more compact binary representation than JSON.

Lossy compression (e.g., video/audio codecs, thumbnails) sacrifices fidelity for bandwidth.

Prefetch Techniques

Prefetching adds a “time‑for‑time” trade‑off: spend time now to eliminate the first‑load latency later. Typical scenarios:

Video buffering or HTTP/2 server push.

Client applications that load data in a background process.

Server‑side warm‑up: load hot data into memory caches, pre‑populate CPU caches, JIT‑compile hot methods.

Side effects include longer startup time, extra CPU usage during idle periods, and possible fetching of unused data.

Peak‑Shaving (削峰填谷) Techniques

Peak‑shaving smooths load by moving work to off‑peak periods or limiting request rates:

Message queues (Kafka, RocketMQ, Redis streams) buffer bursts and allow asynchronous processing.

Rate‑limiting, leaky‑bucket, token‑bucket, or circuit‑breaker patterns to protect downstream services.

Scheduling background jobs (e.g., analytics, batch imports) during low‑traffic windows.

Batch Processing Techniques

Batching reduces per‑item overhead by handling many items together. Examples and practical batch‑size guidelines:

Bundle JS/CSS assets or sprite sheets to reduce HTTP requests.

Redis MGET/MSET: 50‑100 keys per batch is a good starting point; benchmark for your workload.

MySQL/Oracle bulk INSERT: 5 000‑10 000 rows per statement balances parsing cost and network overhead.

Message‑queue payloads: keep each batch <1 MB (AWS SQS limits to 256 KB per message).

Too large a batch can increase latency, cause memory pressure, or exceed protocol limits, so empirical benchmarking is essential.

Hardware Latency Reference

Typical operation latencies (order of magnitude):

CPU L1/L2 cache access: 1‑10 ns.

Main memory (RAM) access: ~100 ns.

SSD random read/write: 10 µs‑1 ms.

LAN round‑trip: ~0.5 ms.

Memcached/Redis get/set: 1‑5 ms.

Simple DB query/update: 5‑50 ms.

System call (user↔kernel): 100 ns‑1 µs.

Memory Layout Insights (JVM Example)

Each Java object has a 12‑byte header (mark word + klass pointer). Fields are aligned to 4‑byte boundaries, and objects to 8‑byte boundaries. On a 64‑bit JVM with UseCompressedOops disabled (heap > 32 GB), object references become 8 bytes, dramatically increasing memory consumption.

Optimization Strategies

1. Focus (聚焦)

Reduce system‑call and context‑switch overhead. Useful references:

https://stackoverflow.com/questions/21887797/what-is-the-overhead-of-a-context-switch

https://stackoverflow.com/questions/23599074/system-calls-overhead

Prioritise I/O‑bound workloads: use epoll/event‑driven models, zero‑copy DMA, and CPU affinity to keep threads on the same core.

2. Evolve (蜕变)

Adopt more efficient data structures and algorithms:

Replace linear scans with hash lookups or tree searches.

Use lock‑striped or lock‑free containers (e.g., Java 8 ConcurrentHashMap).

Apply classic algorithms (Fisher‑Yates shuffle, Dijkstra) where appropriate.

3. Adapt (适应)

Tailor optimisations to the runtime environment:

In browsers, minimise DOM reflows, use requestAnimationFrame for batch UI updates, and avoid eval or dynamic prototype changes.

In Node.js, prefer WebSocket over frequent Ajax polling, and use the built‑in profiler to locate hot paths.

On the JVM, enable tiered compilation, stack allocation, and tune GC parameters.

On Linux, tune kernel parameters (e.g., net.core.somaxconn, fs.file‑max) and use readv/writev for vectored I/O.

4. Orchestrate (运筹)

Take a system‑wide view to allocate resources where they yield the highest ROI:

Select appropriate cloud instance types (CPU‑optimized vs. memory‑optimized) based on measured bottlenecks.

Consider ARM servers for cost‑effective throughput when workloads are architecture‑agnostic.

When scaling horizontally, ensure services are stateless, use load balancers, and replicate read‑heavy workloads (CDN, cache replicas).

For stateful services, apply sharding (partitioning) and choose a good shard key to distribute load evenly.

Reduce lock contention by designing lock‑free algorithms or by narrowing lock scope (optimistic locking, CAS).

Advanced Parallel‑Oriented Techniques (Second Part)

1. Resource Exhaustion (八门遁甲)

Eliminate unnecessary work: minimise system calls, use zero‑copy I/O, bind threads to CPUs, and avoid heavyweight abstractions when low‑level APIs suffice.

2. Horizontal Scaling (影分身术)

Scale out stateless services behind a load balancer. Use auto‑scaling based on metrics (CPU, QPS) and ensure replicas are identical.

3. Sharding (奥义)

Partition stateful data across multiple nodes. Key considerations:

Select a shard key with uniform distribution and low cross‑shard joins.

Implement a routing layer that directs requests to the correct shard.

Combine sharding with caching to mitigate hot‑spot contention.

4. Lock‑Free Design (秘术)

Lock contention limits parallelism. Techniques include:

Optimistic concurrency control (version fields, compare‑and‑swap).

Lock‑free data structures (e.g., Java 8 ConcurrentHashMap segment‑free design).

Pipeline processing (Redis pipeline, CPU instruction pipelines) to batch operations.

Protocol‑level optimisations such as QUIC/HTTP‑3 to reduce head‑of‑line blocking.

Conclusion

Performance optimisation is an iterative, measurement‑driven process: profile, locate bottlenecks, apply the most appropriate technique, and avoid premature or excessive optimisation. Aim for high‑impact, low‑cost improvements (the 80/20 rule) and continuously monitor the system to guide future refinements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.