Comprehensive Case Study of Large‑Scale Desktop IT Management and Automated Fault Detection at Ctrip
This article presents a detailed case study of Ctrip's large‑scale desktop IT management solution, describing the challenges of handling tens of thousands of office PCs, the full‑link architecture built with Rust, Tauri, SpringBoot and Django, automated health monitoring, fault detection, remediation workflows, security measures, performance optimizations, and the measurable operational improvements achieved.
Introduction – Managing tens of thousands of employee PCs while meeting compliance and information‑security requirements is a complex task that demands efficient, stable, and secure operations.
Current Situation – High fault rates and reliance on manual troubleshooting lead to low efficiency; a proactive, automated desktop‑operations model is needed.
Full‑Link R&D Operations Practice
Architecture Selection – A cross‑platform Rust agent built with Tauri runs on Windows, macOS and Linux; the server side uses SpringBoot for APIs, while the management console is built with Django, Django‑SimpleUI and Vue.
Data Collection & Script Management – Agents periodically execute PowerShell, BAT or EXE scripts, managed centrally via the server; scripts and agents are loosely coupled to enable extensibility.
Business Process – Agents schedule tasks, download scripts, collect health data, encrypt and upload results, receive remediation scripts, and handle three remediation modes (direct fix, reminder‑fix, reminder only). The server caches task definitions, stores scripts as BLOBs with MD5 verification, validates data against configurable rules, and persists results via asynchronous queues.
Operations Management Module – Provides UI for check‑item management, script governance, gray‑release strategies, employee batch handling, data query, permission control, and audit logging.
Challenges & Solutions
1) Massive data volume – over 50 million records – mitigated by incremental updates and active‑record flags, reducing storage growth by >70%.
2) Real‑time query latency – addressed with parallel queries and caching.
3) GUI interaction from System‑level agents – solved by splitting the agent into FLT‑System.exe (system tasks) and FLT‑User.exe (user‑visible tasks) with RPC communication and encrypted channels.
4) Coverage in logged‑out state – both executables run with System privileges when no user is logged in.
5) Rapid fault localization – audit logs record all configuration changes for quick troubleshooting.
6) Script execution time – added execution‑time statistics, timeout enforcement, and a “collection‑time” field to improve overall efficiency.
Results – Automated detection and remediation reduced weekly PC fault volume by 20‑30%, lowered manual service tickets by >10%, and improved overall PC health metrics. The system now supports real‑time monitoring and provides a solid foundation for future desktop‑operations enhancements.
Future Work – Further analysis of script failures, performance tuning, and expanding functionality to maintain a leading‑edge desktop‑operations platform.
Wukong Talks Architecture
Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.