Operations 18 min read

Ctrip's Full‑Chain Desktop Operations Platform for Managing Tens of Thousands of PCs

The article presents a comprehensive case study of Ctrip's full‑chain desktop operations system, detailing its architecture, cross‑platform Rust/Tauri agents, SpringBoot server, security measures, operational challenges, performance optimizations, and the measurable improvements in fault detection and repair across a massive corporate PC fleet.

Ctrip Technology

Jul 25, 2024

Ctrip's Full‑Chain Desktop Operations Platform for Managing Tens of Thousands of PCs

As Ctrip’s employee base grew, the company faced the challenge of managing tens of thousands of diverse PCs while meeting compliance and information‑security requirements and ensuring stable, efficient work environments.

Existing automated tools reduced some user‑initiated fixes, but many incidents still required manual intervention, leading to low efficiency and missed issues. Ctrip therefore adopted a proactive, automated desktop‑operations model that detects faults early and repairs them automatically.

Architecture selection

Cross‑platform agents are built with Rust and the Tauri framework to run on Windows, macOS, and Linux, providing memory safety and high performance.

The server side uses SpringBoot to expose APIs for policy distribution and data collection.

Client scripts (PowerShell, BAT, EXE) are managed centrally via the server, keeping agents and scripts loosely coupled for extensibility.

The management console is built with Django, Django‑SimpleUI, and Vue, enabling rapid development of rich UI features.

Business evaluation and design

Each PC generates tens of detection items every hour, resulting in tens of millions of records; the system uses multithreading, asynchronous queues, and careful index design to handle the load.

Data collection operates in two modes: scheduled client uploads and real‑time external queries, both secured with asymmetric encryption and token‑based authentication.

Operational workflow

Clients log in, the agent schedules tasks, downloads required scripts, executes them with System privileges, collects results, encrypts them, and reports to the server. The server validates, stores, and, if needed, pushes repair scripts back to the client. Repair scripts can be configured for direct fix, reminder‑fix, or reminder‑only, with UI pop‑ups for user interaction.

Management modules

Check‑item management allows engineers to define items, thresholds, validation logic, and gray‑release strategies.

Script management links scripts to check items, supports OS selection, parameters, timeout settings, and requires approval before deployment.

Gray‑release controls enable fine‑grained targeting by employee, team, location, etc.

Data query UI provides filtering by machine, user, item, result, and timestamp.

Additional features include alert configuration, permission management, and audit logging.

Security measures

Bidirectional asymmetric encryption secures agent‑server communication; tokens enforce session authentication.

Dual internal/external domain names allow agents to operate across isolated, zero‑trust networks.

MD5 verification of cached files prevents tampering.

Challenges and solutions

Data volume surged to >50 million records; an incremental update strategy reduced growth by over 70 %.

Real‑time query latency was improved with parallel calls and caching.

System‑level agents could not display GUI prompts; the solution split the agent into FLT‑System.exe (System privilege) and FLT‑User.exe (User privilege) communicating via RPC with encrypted local sockets.

Coverage for logged‑out machines was increased by running both executables with System privileges during logout.

Script execution time monitoring and timeout enforcement cut abnormal collection durations from >1 hour to the normal 2‑5 minute range.

Results

The platform achieved automatic detection and repair of PC faults, reducing weekly average fault counts by 20‑30 % and decreasing manual service tickets by over 10 %. It provided near‑real‑time health monitoring, supported diverse employee needs, and maintained information‑security compliance.

Future work includes deeper analysis of script failures, continued performance optimization, and expanding the platform’s capabilities to provide even more robust desktop‑system operations for the enterprise.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Automation Rust Security SpringBoot IT Operations Desktop Management

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.