Securing Ray Clusters on Alibaba Cloud ACK: Best Practices and Configurations
This guide details comprehensive security best practices for deploying Ray clusters on Alibaba Cloud ACK, covering TLS communication, namespace isolation, resource quotas, RBAC, security contexts, image scanning, resource limits, RRSA integration, multi‑cluster isolation, and recommendations for protecting dashboards and services from unauthorized access.
1. RayCluster Communication Domain Security Settings
1.1. RayCluster Head and Work Data Communication
If you need TLS‑encrypted communication between the RayCluster head and work pods, refer to the official Ray TLS documentation at https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/tls.html and the RayCluster TLS configuration example.
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: raycluster-tls
spec:
rayVersion: '2.9.0'
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
template:
spec:
initContainers:
- name: ray-head-tls
image: rayproject/ray:2.9.0
command: ["/bin/sh", "-c", "cp -R /etc/ca/tls /etc/ray && /etc/gen/tls/gencert_head.sh"]
volumeMounts:
- mountPath: /etc/ca/tls
name: ca-tls
readOnly: true
- mountPath: /etc/ray/tls
name: ray-tls
- mountPath: /etc/gen/tls
name: gen-tls-script
containers:
- name: ray-head
image: rayproject/ray:2.9.0
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /etc/ca/tls
name: ca-tls
readOnly: true
- mountPath: /etc/ray/tls
name: ray-tls
env:
- name: RAY_USE_TLS
value: "1"
- name: RAY_TLS_SERVER_CERT
value: "/etc/ray/tls/tls.crt"
- name: RAY_TLS_SERVER_KEY
value: "/etc/ray/tls/tls.key"
- name: RAY_TLS_CA_CERT
value: "/etc/ca/tls/ca.crt"
volumes: []
workerGroupSpecs:
- replicas: 1
minReplicas: 1
maxReplicas: 10
groupName: small-group
template:
spec:
initContainers:
- name: ray-worker-tls
image: rayproject/ray:2.9.0
command: ["/bin/sh", "-c", "cp -R /etc/ca/tls /etc/ray && /etc/gen/tls/gencert_worker.sh"]
volumeMounts:
- mountPath: /etc/ca/tls
name: ca-tls
readOnly: true
- mountPath: /etc/ray/tls
name: ray-tls
- mountPath: /etc/gen/tls
name: gen-tls-script
containers:
- name: ray-worker
image: rayproject/ray:2.9.0
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /etc/ca/tls
name: ca-tls
readOnly: true
- mountPath: /etc/ray/tls
name: ray-tls
env:
- name: RAY_USE_TLS
value: "1"
- name: RAY_TLS_SERVER_CERT
value: "/etc/ray/tls/tls.crt"
- name: RAY_TLS_SERVER_KEY
value: "/etc/ray/tls/tls.key"
- name: RAY_TLS_CA_CERT
value: "/etc/ca/tls/ca.crt"
volumes: []1.2. RayClient / Ray Dashboard Internal Access
Ray executes any code submitted to it; therefore, when using RayClient, developers must ensure that only trusted code is run and that credentials are protected.
1.3. RayClient Public Access
Exposing the Ray GCS service (default port 6379) to the public internet is discouraged because Ray lacks built‑in authentication, allowing any external user to submit arbitrary jobs that could crash the cluster.
1.4. RayCluster Internal Communication Restrictions
Use Kubernetes NetworkPolicy to restrict traffic to Ray components. Example policies are provided below.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ray-head-ingress
spec:
podSelector:
matchLabels:
app: ray-cluster-head
policyTypes:
- Ingress
ingress:
- from:
- podSelector: {}
ports:
- protocol: TCP
port: 6380
- from:
- podSelector: {}
ports:
- protocol: TCP
port: 8265
- from:
- podSelector: {}
ports:
- protocol: TCP
port: 10001
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ray-head-egress
spec:
podSelector:
matchLabels:
app: ray-cluster-head
policyTypes:
- Egress
egress:
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379
- to:
- podSelector:
matchLabels:
app: ray-cluster-worker
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ray-worker-ingress
spec:
podSelector:
matchLabels:
app: ray-cluster-worker
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: ray-cluster-head
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ray-worker-egress
spec:
podSelector:
matchLabels:
app: ray-cluster-worker
policyTypes:
- Egress
egress:
- to:
- podSelector:
matchLabels:
app: ray-cluster-head1.5. Ray Dashboard Public Access
The Dashboard (default port 8265) provides read‑only debugging and write‑only job management APIs. Exposing it publicly is discouraged; if necessary, place a proxy with authentication or use ACLs.
kubectl port‑forward (recommended)
Ray HistoryServer (recommended)
Public ACL / authentication
Port‑forward example:
kubectl port-forward svc/myfirst-ray-cluster-head-svc --address 0.0.0.0 8265 -n ${RAY_CLUSTER_NS}1.5.1. Port‑Forward
Forward the Dashboard port locally to access it securely.
1.5.2. ACK Ray HistoryServer
The HistoryServer retains dashboards for terminated clusters, integrates with ARMS monitoring, and eliminates the need for a separate Prometheus/Grafana stack.
1.5.3. ACL / Authentication
Configure the Service type as LoadBalancer, then set a restrictive SLB ACL. For authentication, use basic auth via an NGINX ingress controller and a secret.
htpasswd -c auth foo
kubectl create secret generic basic-auth --from-file=auth1.6. RayCluster / RayJob Configuration
Prefer ClusterIP services for both RayDashboard (8265) and GCS (6379) to avoid public exposure.
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: myfirst-ray-cluster
spec:
headGroupSpec:
serviceType: ClusterIP
---
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: myjob
spec:
submissionMode: "K8sJobMode"
rayClusterSpec:
headGroupSpec:
serviceType: ClusterIP2. Namespace Isolation
Separate Ray clusters into different Kubernetes namespaces to leverage namespace‑level policies such as ResourceQuota and NetworkPolicy.
3. ResourceQuota / ElasticQuotaTree
ResourceQuota : Limit CPU, GPU, memory, etc., to prevent denial‑of‑service due to resource exhaustion.
ElasticQuotaTree : Use ACK’s ElasticQuotaTree for finer‑grained quota and queue management (see documentation link).
4. RBAC
Assign a dedicated ServiceAccount to each RayCluster and grant the minimal required permissions.
If the cluster does not need to access Kubernetes resources, set automountServiceAccountToken: false to disable token mounting.
5. Security Context
Disable privileged mode and running as root.
Restrict hostPath usage to read‑only mounts with specific prefixes.
Disallow privilege escalation.
6. Head / Work Pod Secure Images
Scan Ray container images with Alibaba Cloud image security scanning before production deployment.
7. Request / Limit
Define resource requests and limits for each pod to avoid uncontrolled consumption that could crash the Kubelet or evict other pods.
8. RRSA
When Ray jobs need to access Alibaba Cloud resources (e.g., OSS), use the RRSA mechanism instead of embedding AK/SK credentials in environment variables.
9. Multi‑RayCluster Isolation / One‑Job‑One‑Cluster
Submit different jobs to separate RayClusters using RayJob, leveraging namespace isolation and RBAC to prevent cross‑cluster impact.
10. Other Recommendations
Refer to ACK security system documentation and Ray on ACK best‑practice guides for additional hardening steps.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
