High‑Availability Architecture of Upyun Image Processing Service
The article details Upyun's high‑availability image processing architecture, covering workload‑aware system design, custom GmServer implementation, task scheduling, current strengths and limitations, and future directions such as a ServiceServer‑based queue and Docker‑driven dynamic scaling.
Choosing the Right System Architecture for Business Scenarios
Upyun provides a public‑cloud image processing service that must support a wide range of operations such as watermarking, blurring, and special effects; therefore a soft‑encoding approach is the only viable choice, while pure thumbnail generation can still benefit from hard‑encoding for performance.
Upyun Image Processing Cluster Scale and Architecture
Cluster Scale
The cluster consists of 30 servers, each with 24 CPU cores and 48 GB RAM, yielding roughly 690 usable cores (24 – 1 per server) for image processing.
The monitoring system tracks QPS and average processing latency for each sub‑service, showing stable processing capacity and volume.
Current Architecture
Eight front‑end Nginx instances perform Layer‑7 load balancing (IP round‑robin) without LVS, delegating fault tolerance to application logic.
Nginx’s upstream includes 30 GmServer instances: 28 regular GmServers and 2 special BigGmServers for handling the rare large‑size images (e.g., multi‑megabyte GIFs) that constitute less than 1 % of traffic.
Self‑Developed GmServer
Key strategy: a GmServer processes tasks only up to the number of its CPU cores; excess requests receive a 502 response, prompting Nginx to retry another server.
GmServer is built on Linux + epoll + GraphicsMagick, performing all image manipulation in memory (using /dev/shm) to avoid disk I/O and achieve maximum performance.
The team continuously patches GraphicsMagick, contributes fixes upstream (e.g., WebP support), and handles malformed or heavily compressed images that the original library cannot process.
Task Scheduling Logic Details
Task A (normal image) : Randomly select one of the eight Nginx nodes, which then forwards the request to a randomly chosen GmServer for processing.
Task B (large image or large GIF) : The request is first sent to a regular GmServer; if it times out after 5 seconds (returning 413), Nginx reroutes it to a BigGmServer, which has a 60‑second timeout.
Task C (normal image when a GmServer is saturated) : If a GmServer is already handling its maximum concurrent tasks, it returns 502, causing Nginx to select another server; the retry typically completes within 20 ms.
Analysis of Current Architecture Pros and Cons
Advantages
Service stability – overload on a subset of nodes does not affect the entire platform.
Simple and generic design.
Disadvantages
Dynamic scaling is cumbersome because the architecture relies on static Nginx upstream configuration; automated discovery and Lua‑based upstream updates are required for smoother scaling.
Future plans aim to merge image and video processing clusters to improve resource utilization, potentially using Docker for elastic scaling.
Future Architecture Direction
The proposal replaces Nginx with a custom ServiceServer that queues tasks via a message queue; existing GmServers become lightweight GmWorkers that pull tasks from the queue, enabling effortless horizontal scaling and Docker‑driven deployment.
Benefits include automatic worker registration, no need for pre‑configured upstream lists, and straightforward integration with Docker for dynamic scaling.
Operational Pitfalls Encountered
Cache invalidation can cause a sudden surge in image‑processing requests; the system mitigates this by returning 503 with a client‑side cache duration of at least one minute.
Malicious or mis‑configured tenant requests (e.g., massive thumbnail version changes) can overload the cluster; rate‑limiting and peak‑control mechanisms are applied per tenant to protect overall stability.
Images are processed entirely in memory and never written to disk, though a two‑layer cache exists for already processed results.
Q & A
1. Why use only Layer‑7 load balancing with Nginx instead of LVS? The load is low; IP round‑robin plus application‑level fault tolerance is sufficient.
2. Where are processed images stored? In Upyun’s cloud storage; a key/value store is also suitable.
3. Any plans to use GPU for image processing? GPU would increase development complexity; CPU remains the optimal choice for current workloads.
4. Comparison of GraphicsMagick vs. ImageMagick? GraphicsMagick is the core library; ImageMagick adds a higher‑level wrapper, sacrificing some performance.
5. Does Upyun provide OCR? No, not at present.
6. How does the new mixed‑business architecture avoid cross‑traffic impact? Workers are pinned to individual CPU cores; synchronous tasks (image processing) receive priority, and per‑tenant resource limits prevent one tenant from overwhelming others.
7. How to reduce repeated processing for direct‑link usage? Implement access control, rate limiting, or use thumbnail versioning to cache results.
8. Should a connection pool replace per‑node attempts? A pool can cause imbalance; the queue‑based worker model is preferred.
9. Where is timeout detection performed? Inside GmServer; it terminates its own process on timeout, avoiding resource waste on the Nginx side.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.