How Qcmd Revolutionizes Automated Operations for 7,000+ Servers
Qcmd, the command execution system behind 360’s private HULK cloud platform, replaces SaltStack with an asynchronous, Golang‑based architecture that ensures high‑availability, encrypted messaging, and reliable mass‑host command execution across thousands of servers, dramatically reducing task timeouts and operational overhead.
Background
The HULK cloud platform requires a command execution system that can batch‑run scripts on hosts and present results. Initially SaltStack was used for its ability to execute commands on many hosts, but growing network complexity, host count, and evolving requirements exposed its limitations.
Pub/Sub model demands high network reliability; time‑outs often require manual re‑execution.
Managing a large number of Minions becomes cumbersome, and restarting the Master is risky.
Inconsistent SaltStack versions lead to poor compatibility with new features.
Minion maintenance, especially upgrades, is costly.
Optimizations to SaltStack could not fully resolve these issues, and task timeouts still required manual host inspection. The team therefore explored a more stable messaging approach and chose asynchronous communication with Golang instead of Python.
Goals
Message delivery success rate – eliminate message loss.
Performance – support massive host command execution.
High availability – implement a Multi‑Master architecture.
Encryption – use AES‑CBC for secure messaging.
Common interfaces – provide essential Master APIs.
Maintainability – enable self‑upgrading.
Distribution – support multi‑datacenter deployment.
Overall Design
Qcmd’s architecture separates scheduling and execution modules, collectively referred to as Qcmd. The system runs on the HULK cloud platform and has evolved over four years of production use.
The diagram above shows the multi‑datacenter command dispatch flow.
Qcmd Design
Master Internal Message Passing
When a client calls the Pub interface, the Master validates and filters data, then serializes and encrypts the message before publishing it to the appropriate Topic and Channel, each of which maps one‑to‑one with a Minion.
Each Topic spawns a MessageDump goroutine that reads from MsgChan and pushes messages into the corresponding Channel.
Master Subscribe Design
When a message arrives via Pub, the Master processes it, determines the target Minion ID, and writes the message into the Topic’s MsgChan. The Channel’s MessageDump then forwards it to the MinionMsgChan.
Minion Subscribe Design
Minions actively subscribe to their assigned Channels. Upon receiving a message, it is placed into CmdChan, decrypted, and dispatched to the appropriate function. If the operation completes before its timeout, the result is returned to the Master; otherwise a callback notifies the Master of the timeout.
Minion Heartbeat Design
Minions periodically send heartbeats to the Master and check the Master’s key status. If the Master’s IP changes, the Minion performs a RetryDns sequence: disconnect from the old Master, connect to the new Master, and re‑authenticate.
Performance
Testing on a production Master (16 CPU × 32 GB) with over 7,000 Minions showed that 20 parallel test.ping tasks completed in roughly 21 seconds, with modest network traffic but high CPU usage due to encryption/decryption.
Summary
Qcmd was designed and delivered as an Alpha version within two weeks. Key takeaways include:
Choose an automation tool that matches your workload; for HULK, only a small fraction of hosts require batch execution, making a lightweight, reliable system essential.
Reliability is paramount—message loss leads to manual intervention and delays.
Since October 2016, over 20,000 hosts have deployed Qcmd, with the largest Master handling more than 7,000 Minions and executing thousands of tasks daily, virtually eliminating timeouts. Future versions will encapsulate the communication protocol and enhance task controllability.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.