Operations 8 min read

Scaling Command Execution to Tens of Thousands of Machines Using SaltStack Syndic and Redis

Discover how 360’s private cloud platform leverages SaltStack’s Syndic architecture and a Redis Pub/Sub bridge to efficiently manage and execute commands across tens of thousands of servers, addressing multi‑datacenter challenges, reducing message loss, and improving automation performance.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Scaling Command Execution to Tens of Thousands of Machines Using SaltStack Syndic and Redis

Background

360’s private cloud platform (HULK) manages more than 90% of the company’s business lines. To control the massive fleet of servers, a complete automation tool is required. The command system is built on SaltStack, allowing batch script execution on many machines.

The main difficulty lies in the large number of data‑center locations and the sheer volume of machines, which makes deploying a single SaltStack Master impractical. A multi‑level, multi‑datacenter Master architecture is used.

SaltStack Architecture

In SaltStack, the Master node publishes commands, while managed machines run a Minion that listens for those commands.

Salt Syndic Introduction

When the number of Minions grows, the Master becomes a performance bottleneck. Deploying multiple Masters solves the bottleneck but introduces complexity: commands must be sent to the appropriate Master. SaltStack 0.9.0 introduced Syndic, a special Minion that connects a lower‑level Master to a higher‑level Master (the “Master of Masters”).

Advantages

Multi‑layer architecture allows all commands to be executed by the high‑level Master, creating a flexible “Master of Masters”.

Since Syndic only subscribes to the Master of Masters, other services (e.g., file server) are offloaded, greatly reducing pressure on the top‑level Master.

Disadvantages

Configuration of file_roots and pillar_roots on Syndic must match the Master of Masters.

The Master of Masters cannot see how many Minions are under lower‑level Masters; network jitter can cause message loss or delay without the Master of Masters being aware, leading to incomplete task results.

Architecture Diagram

Problem

In unreliable network conditions, Syndic’s message transmission reliability drops. If Syndic does not receive a message, its Minions never see the task. The official recommendation to increase syndic_wait only partially mitigates the issue.

Thought

Since Syndic is essentially a Minion, could other solutions replace it? ZeroMQ and Redis both support Pub/Sub and can be deployed in a master‑slave fashion across data centers. Redis Pub/Sub offers good performance.

Test

ZeroMQ and Redis Pub/Sub were tested in the same data‑center environment.

Conclusion

The same‑room test showed ZeroMQ transmits messages faster than Redis, but it also experiences message loss. Redis performed acceptably and can be considered as a replacement for ZeroMQ.

Main Process

1. Subscribe Process on Master of Masters

self.opts['master_addr'] = salt.utils.dns_check(self.opts['master'])
#获取master的ip地址
context = zmq.Context()
master_pub ='tcp://{0}:{1}'.format(self.opts['master_addr'], self.opts['master_publish_port'])
ub_sock = context.socket(zmq.SUB)
ub_sock = set_tcp_keepalive(ub_sock, opts=self.opts)
ub_sock.connect(master_pub)
ub_sock.setsockopt(zmq.SUBSCRIBE, b'')
#启动Subscribe
pool = ConnectionPool(host=self.opts['redis_host'], port=self.opts['redis_port'], db=self.opts['redis_db'], password=self.opts['redis_pass'])
while True:
    message = sub_sock.recv_multipart()
    #从ZeroMQ 订阅消息
    r = Redis(connection_pool=pool)
    r.publish("salttest", message)
    #将订阅到的消息Publish到Redis中的Channel,Channel名为"salttest"

2. Publish Process on Secondary Master (original Syndic)

self.opts['master_addr'] = salt.utils.dns_check(self.opts['master'])
context = zmq.Context()
pub_uri = 'tcp://{interface}:{publish_port}'.format(**self.opts)
pub_sock = context.socket(zmq.PUB)
pub_sock = set_tcp_keepalive(pub_sock, opts=self.opts)
pub_sock.bind(pub_uri)
conn_pool = client.ConnectionPool(host=self.opts['redis_host'], port=self.opts['redis_port'], db=self.opts['redis_db'], password=self.opts['redis_pass'])
sub = client.PubSub(conn_pool)
sub.subscribe('salttest')
for msg in sub.listen():
    if msg['type'] == 'message':
        data = eval(msg['data'])
        pub_sock.send_multipart(data)
        #通过ZeroMQ将消息Publish下去

3. Configuration on Master

syndic_master: node1.example.com
syndic_master_port: 4516
syndic_master_publish_port: 4515

Summary

After implementation, the system essentially becomes a Salt Syndic that uses Redis Pub/Sub. Running it for a period has significantly reduced message loss.

RedisDevOpsMessage QueueSaltStackSyndic
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.