Big Data 13 min read

Lossless Image Compression Overview and Lepton Optimization for Large‑Scale Storage

The article explains JPEG’s lossy fundamentals, introduces Lepton’s lossless layer and its optimizations—such as arithmetic coding and multithreaded Huffman switching—and describes how vivo’s hybrid physical‑server and Kubernetes deployment achieves roughly 22 % storage reduction across petabytes of JPEG images despite high CPU demands.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Lossless Image Compression Overview and Lepton Optimization for Large‑Scale Storage

Author: vivo Internet Database Team – Li Shihai

This article introduces the overall process and principles of lossless image compression, and discusses the problems discovered during early research on Lepton lossless compression as well as the solutions.

1. A Game to Illustrate Image Size Differences

Participants are asked to find differences between two images within 15 seconds. The two pictures are actually the same visual content: one is a 3.7 MB JPEG original, the other is a 485 KB compressed JPEG. The exercise leads to questions about why the size changed, what information was lost, why quality seems unchanged, and whether further compression is possible.

2. Understanding JPEG Compression

2.1 Human Visual Weakness

The human retina contains cones (color) and rods (brightness). Our perception of brightness is more pronounced than color, so color information can be compressed more aggressively.

2.2 JPEG Pipeline

Before compression, the image is converted from RGB to YCbCr color space (Y = luminance, Cb/Cr = chroma). The image is divided into 8×8 blocks, and a Discrete Cosine Transform (DCT) is applied to obtain frequency coefficients. High‑frequency components, which the eye is less sensitive to, are quantized (lossy step). The remaining data are entropy‑coded (lossless) using methods such as run‑length encoding, Huffman coding, or arithmetic coding.

The basic flow is shown in the following diagram:

JPEG Compression is Lossy

After color‑space conversion, chroma subsampling (e.g., 4:2:0) discards about 75 % of Cb/Cr data, which is irreversible. Subsequent entropy coding (e.g., Huffman) is lossless.

3. Lepton – The Main Actor

Although JPEG achieves good visual quality, it is still lossy and leaves further compression potential. Lepton adds a lossless layer on top of JPEG.

3.1 Why Choose Lepton

Compared with tools such as jpeg‑can, MozJPEG, PackJPG, and PAQ8PX, Lepton offers better suitability for production environments. For example, PackJPG requires global sorting of pixel values, making decompression single‑threaded and memory‑intensive.

A comparison from the Lepton paper is shown below:

3.2 Lepton Optimizations

The JPEG header is compressed with DEFLATE (lossless).

Image data are encoded with arithmetic coding instead of Huffman, improving compression ratio.

Lepton introduces a "Huffman switching" technique to enable multithreading.

A sophisticated adaptive probability model predicts coefficient values based on large‑scale image testing.

Processing is stream‑based, line‑by‑line, allowing low memory usage and safe I/O.

3.3 Lepton in vivo Storage

vivo’s object‑storage cluster holds ~100 PB, with ~70 % images, 90 % of which are JPEG. Assuming an average 23 % compression rate, the potential saving is:

100 PB × 70 % × 90 % × 23 % ≈ 14.5 PB.

Challenges

Lepton’s compression/decompression is CPU‑intensive (e.g., a 4–5 MB file requires ~1 s on a 16‑core server at >95 % core utilization).

Need to fully utilize idle CPU resources to reduce cost.

Require dynamic scaling to handle workload spikes.

Solution

A hybrid deployment of physical servers and Kubernetes is adopted. Physical servers are managed via service registration/discovery and cgroup/Taskset for CPU allocation. Containers provide flexible scaling for compute‑heavy services.

3.4 Performance Evaluation

Both synchronous and asynchronous compression focus on image read latency. Tests on various file sizes show an average compression ratio of ~22 %.

Compression and decompression time proportions (orange = decompression, blue = compression) are illustrated below:

Scalability test with 32 threads processing 100 files each shows linear performance improvement.

4. Common Issues in Image Compression

4.1 Lossy vs. Lossless Formats

Lossy formats

JPEG, JPG, WMF, …

Lossless formats

BMP, PCX, TIFF, GIF, TGA, PNG, RAW, …

4.2 Typical Lossless Algorithms

Run‑Length Encoding, Shannon‑Fano Coding, Huffman Coding, Arithmetic Coding, Burrows‑Wheeler Transform.

5. Conclusion

Lepton’s lossless compression provides a high compression ratio without degrading user experience, yielding significant cost savings in massive image‑storage scenarios.

Limitations : High CPU demand and support limited to JPEG images. Industry solutions such as FPGA acceleration or elastic compute can mitigate these drawbacks.

References

The Design, Implementation, and Deployment of a System to Transparently Compress Hundreds of Petabytes of Image Files For a File‑Storage Service

基于深度学习的JPEG图像云存储研究

JPEG‑Lepton压缩技术关键模块VLSI结构设计研究

image compressionStorage OptimizationHuffman codingJPEGLeptonlossless compression
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.