Backend Development 15 min read

Scaling KuJiaLe's ExaCloud: Inside the Distributed Rendering Architecture

This article chronicles the evolution of KuJiaLe's ExaCloud rendering platform from its 2013 GPU‑based prototype to a multi‑IDC, 2000‑node distributed system, detailing architectural redesigns, load‑balancing strategies, hybrid CPU/GPU processing, and operational lessons learned to achieve high‑throughput cloud rendering.

CoolHome R&D Department

Dec 30, 2017

Scaling KuJiaLe's ExaCloud: Inside the Distributed Rendering Architecture

Background

Rendering is a core competitive advantage of KuJiaLe. The ExaCloud rendering engine, developed in 2013, uses Nvidia OptiX ray‑tracing on GPUs, achieving an average of 21 seconds per image. GPU parallelism provides speed but introduces longer initialization and higher error rates due to reduced floating‑point precision.

Since CPU cores have become abundant, a CPU version was released in 2016, cutting the average rendering time to 10 seconds per image.

In 2017 the research lab combined GPU‑accelerated deep‑learning with ExaCloud to form a heterogeneous rendering mode, improving quality while keeping speed low.

Both GPU and CPU engines are compute‑intensive; a screenshot shows 64 CPU threads at 100 % utilization during rendering.

Distributed Rendering

To service‑ify ExaCloud and accelerate rendering, KuJiaLe adopted distributed rendering, splitting a single frame into many tiles processed across multiple machines, then compositing the results. This removes single‑machine limits but adds architectural complexity and higher overall resource consumption.

By the end of 2017 the platform ran on four IDC clusters with 2 000 compute nodes, producing 1.3 million images per day. The architecture has been repeatedly refactored to handle growing traffic.

Architecture 1.0 (2013‑2014)

The first SaaS rendering architecture routed client requests through Nginx to a render‑script service, which generated a render request stored in MongoDB. An LVS load balancer distributed request IDs to render nodes, which processed them sequentially, downloading assets from a storage server, rendering with ExaCloud, and uploading results to Alibaba OSS. Clients polled the script service for progress.

Limitations included uneven node load, lack of distributed rendering, single‑point failures in MongoDB, OSS, and storage, limited network bandwidth, and no cross‑IDC load balancing.

Architecture 2.0 (2014‑2016)

To address scaling, the 2.0 redesign introduced a render queue persisted in MySQL, decoupling queue management from render nodes. Render nodes were grouped in small clusters under LVS, each node also performed task slicing for distributed rendering of 800×600 pixel tiles.

Load‑aware RPC SDK in the render‑script service enabled cross‑IDC load balancing, allowing multi‑IDC deployment. Distributed rendering was realized by render nodes requesting sub‑tiles, merging them, and uploading the final image.

Heartbeat messages reported task progress and success/failure. An error‑retry mechanism reset failed tasks, reducing error rates. OpenStack Swift provided distributed storage, raising throughput.

After the upgrade, the platform operated 250 nodes, reduced idle rate to near 0 %, and supported 200 k daily renders.

New challenges emerged: mixed render‑node roles increased code coupling; MySQL‑based queue suffered performance pressure from frequent inserts/updates; supporting multiple rendering engines (photo‑realistic, baking, deep‑learning) required extensive coordination; reliance on external services (OSS, MongoDB, MySQL) introduced availability risks; and monitoring remained limited.

Architecture 3.0 (2015‑2017)

The 3.0 redesign targeted a single cluster of 1 000 nodes while lowering development cost. Render nodes were made stateless, handling only CPU/GPU computation; scheduling logic moved to a new proxy service composed of a master‑slave cluster using ring‑hash for task assignment.

Render‑node RPC calls were simplified; nodes obtained tasks and reported progress via heartbeat to the proxy.

External dependencies were reduced: MongoDB was removed, and a handshake protocol filtered “junk” requests, while OSS and Kingsoft KS3 provided replicated object storage, improving upload reliability.

All components were made horizontally scalable, eliminating expansion bottlenecks. Comprehensive monitoring, alerting, and analytics were added.

Multi‑level queues were introduced: high‑priority tasks reside in an in‑memory master queue, replicated to Redis for failover, and overflow tasks are persisted to OSS, ensuring unlimited queue length without OOM risks.

Stateless render nodes support hot‑plug render engines; up to five engines (fast GPU, fast CPU, high‑quality CPU, baking, deep‑learning) coexist in the same cluster, selected per job.

By September 2017 the 3.0 system comprised four IDC clusters with 2 000 nodes, handling 1.3 million daily renders and 1.4 million models, while fault rates dropped dramatically.

Operational improvements included health checks that automatically offline unhealthy nodes and auto‑recovery mechanisms, reducing manual maintenance.

Future Directions

The main remaining challenge is the “peak‑valley” traffic pattern, with high demand during work hours and low demand otherwise. Attempts such as elastic cloud clusters, adaptive tiling, and deep‑learning‑based resource augmentation have been made, but balancing load through product, technical, and operational measures remains an ongoing focus.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU computing cloud architecture backend scaling CPU rendering distributed rendering

Written by

CoolHome R&D Department

Official account of CoolHome R&D Department, sharing technology and innovation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.