Backend Development 15 min read

High‑Availability Architecture of Upyun Image Processing Service

The article details Upyun's high‑availability image processing architecture, covering workload‑aware system design, custom GmServer implementation, task scheduling, current strengths and limitations, and future directions such as a ServiceServer‑based queue and Docker‑driven dynamic scaling.

High Availability Architecture

Mar 18, 2016

Choosing the Right System Architecture for Business Scenarios

Upyun provides a public‑cloud image processing service that must support a wide range of operations such as watermarking, blurring, and special effects; therefore a soft‑encoding approach is the only viable choice, while pure thumbnail generation can still benefit from hard‑encoding for performance.

Upyun Image Processing Cluster Scale and Architecture

Cluster Scale

The cluster consists of 30 servers, each with 24 CPU cores and 48 GB RAM, yielding roughly 690 usable cores (24 – 1 per server) for image processing.

The monitoring system tracks QPS and average processing latency for each sub‑service, showing stable processing capacity and volume.

Current Architecture

Eight front‑end Nginx instances perform Layer‑7 load balancing (IP round‑robin) without LVS, delegating fault tolerance to application logic.

Nginx’s upstream includes 30 GmServer instances: 28 regular GmServers and 2 special BigGmServers for handling the rare large‑size images (e.g., multi‑megabyte GIFs) that constitute less than 1 % of traffic.

Self‑Developed GmServer

Key strategy: a GmServer processes tasks only up to the number of its CPU cores; excess requests receive a 502 response, prompting Nginx to retry another server.

GmServer is built on Linux + epoll + GraphicsMagick, performing all image manipulation in memory (using /dev/shm) to avoid disk I/O and achieve maximum performance.

The team continuously patches GraphicsMagick, contributes fixes upstream (e.g., WebP support), and handles malformed or heavily compressed images that the original library cannot process.

Task Scheduling Logic Details

Task A (normal image) : Randomly select one of the eight Nginx nodes, which then forwards the request to a randomly chosen GmServer for processing.

Task B (large image or large GIF) : The request is first sent to a regular GmServer; if it times out after 5 seconds (returning 413), Nginx reroutes it to a BigGmServer, which has a 60‑second timeout.

Task C (normal image when a GmServer is saturated) : If a GmServer is already handling its maximum concurrent tasks, it returns 502, causing Nginx to select another server; the retry typically completes within 20 ms.

Analysis of Current Architecture Pros and Cons

Advantages

Service stability – overload on a subset of nodes does not affect the entire platform.

Simple and generic design.

Disadvantages

Dynamic scaling is cumbersome because the architecture relies on static Nginx upstream configuration; automated discovery and Lua‑based upstream updates are required for smoother scaling.

Future plans aim to merge image and video processing clusters to improve resource utilization, potentially using Docker for elastic scaling.

Future Architecture Direction

The proposal replaces Nginx with a custom ServiceServer that queues tasks via a message queue; existing GmServers become lightweight GmWorkers that pull tasks from the queue, enabling effortless horizontal scaling and Docker‑driven deployment.

Benefits include automatic worker registration, no need for pre‑configured upstream lists, and straightforward integration with Docker for dynamic scaling.

Operational Pitfalls Encountered

Cache invalidation can cause a sudden surge in image‑processing requests; the system mitigates this by returning 503 with a client‑side cache duration of at least one minute.

Malicious or mis‑configured tenant requests (e.g., massive thumbnail version changes) can overload the cluster; rate‑limiting and peak‑control mechanisms are applied per tenant to protect overall stability.

Images are processed entirely in memory and never written to disk, though a two‑layer cache exists for already processed results.

Q & A

1. Why use only Layer‑7 load balancing with Nginx instead of LVS? The load is low; IP round‑robin plus application‑level fault tolerance is sufficient.

2. Where are processed images stored? In Upyun’s cloud storage; a key/value store is also suitable.

3. Any plans to use GPU for image processing? GPU would increase development complexity; CPU remains the optimal choice for current workloads.

4. Comparison of GraphicsMagick vs. ImageMagick? GraphicsMagick is the core library; ImageMagick adds a higher‑level wrapper, sacrificing some performance.

5. Does Upyun provide OCR? No, not at present.

6. How does the new mixed‑business architecture avoid cross‑traffic impact? Workers are pinned to individual CPU cores; synchronous tasks (image processing) receive priority, and per‑tenant resource limits prevent one tenant from overwhelming others.

7. How to reduce repeated processing for direct‑link usage? Implement access control, rate limiting, or use thumbnail versioning to cache results.

8. Should a connection pool replace per‑node attempts? A pool can cause imbalance; the queue‑based worker model is preferred.

9. Where is timeout detection performed? Inside GmServer; it terminates its own process on timeout, avoiding resource waste on the Nginx side.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Services docker Backend Architecture image processing high availability load balancing Task Scheduling

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.