Securing Ray Clusters on Alibaba Cloud ACK: Best Practices and Configurations

This guide details comprehensive security best practices for deploying Ray clusters on Alibaba Cloud ACK, covering TLS communication, namespace isolation, resource quotas, RBAC, security contexts, image scanning, resource limits, RRSA integration, multi‑cluster isolation, and recommendations for protecting dashboards and services from unauthorized access.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Securing Ray Clusters on Alibaba Cloud ACK: Best Practices and Configurations

1. RayCluster Communication Domain Security Settings

1.1. RayCluster Head and Work Data Communication

If you need TLS‑encrypted communication between the RayCluster head and work pods, refer to the official Ray TLS documentation at https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/tls.html and the RayCluster TLS configuration example.

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-tls
spec:
  rayVersion: '2.9.0'
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      spec:
        initContainers:
        - name: ray-head-tls
          image: rayproject/ray:2.9.0
          command: ["/bin/sh", "-c", "cp -R /etc/ca/tls /etc/ray && /etc/gen/tls/gencert_head.sh"]
          volumeMounts:
          - mountPath: /etc/ca/tls
            name: ca-tls
            readOnly: true
          - mountPath: /etc/ray/tls
            name: ray-tls
          - mountPath: /etc/gen/tls
            name: gen-tls-script
        containers:
        - name: ray-head
          image: rayproject/ray:2.9.0
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "ray stop"]
          volumeMounts:
          - mountPath: /tmp/ray
            name: ray-logs
          - mountPath: /etc/ca/tls
            name: ca-tls
            readOnly: true
          - mountPath: /etc/ray/tls
            name: ray-tls
          env:
          - name: RAY_USE_TLS
            value: "1"
          - name: RAY_TLS_SERVER_CERT
            value: "/etc/ray/tls/tls.crt"
          - name: RAY_TLS_SERVER_KEY
            value: "/etc/ray/tls/tls.key"
          - name: RAY_TLS_CA_CERT
            value: "/etc/ca/tls/ca.crt"
        volumes: []
  workerGroupSpecs:
  - replicas: 1
    minReplicas: 1
    maxReplicas: 10
    groupName: small-group
    template:
      spec:
        initContainers:
        - name: ray-worker-tls
          image: rayproject/ray:2.9.0
          command: ["/bin/sh", "-c", "cp -R /etc/ca/tls /etc/ray && /etc/gen/tls/gencert_worker.sh"]
          volumeMounts:
          - mountPath: /etc/ca/tls
            name: ca-tls
            readOnly: true
          - mountPath: /etc/ray/tls
            name: ray-tls
          - mountPath: /etc/gen/tls
            name: gen-tls-script
        containers:
        - name: ray-worker
          image: rayproject/ray:2.9.0
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "ray stop"]
          volumeMounts:
          - mountPath: /tmp/ray
            name: ray-logs
          - mountPath: /etc/ca/tls
            name: ca-tls
            readOnly: true
          - mountPath: /etc/ray/tls
            name: ray-tls
          env:
          - name: RAY_USE_TLS
            value: "1"
          - name: RAY_TLS_SERVER_CERT
            value: "/etc/ray/tls/tls.crt"
          - name: RAY_TLS_SERVER_KEY
            value: "/etc/ray/tls/tls.key"
          - name: RAY_TLS_CA_CERT
            value: "/etc/ca/tls/ca.crt"
        volumes: []

1.2. RayClient / Ray Dashboard Internal Access

Ray executes any code submitted to it; therefore, when using RayClient, developers must ensure that only trusted code is run and that credentials are protected.

1.3. RayClient Public Access

Exposing the Ray GCS service (default port 6379) to the public internet is discouraged because Ray lacks built‑in authentication, allowing any external user to submit arbitrary jobs that could crash the cluster.

1.4. RayCluster Internal Communication Restrictions

Use Kubernetes NetworkPolicy to restrict traffic to Ray components. Example policies are provided below.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ray-head-ingress
spec:
  podSelector:
    matchLabels:
      app: ray-cluster-head
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector: {}
    ports:
    - protocol: TCP
      port: 6380
  - from:
    - podSelector: {}
    ports:
    - protocol: TCP
      port: 8265
  - from:
    - podSelector: {}
    ports:
    - protocol: TCP
      port: 10001
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ray-head-egress
spec:
  podSelector:
    matchLabels:
      app: ray-cluster-head
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379
  - to:
    - podSelector:
        matchLabels:
          app: ray-cluster-worker
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ray-worker-ingress
spec:
  podSelector:
    matchLabels:
      app: ray-cluster-worker
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: ray-cluster-head
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ray-worker-egress
spec:
  podSelector:
    matchLabels:
      app: ray-cluster-worker
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: ray-cluster-head

1.5. Ray Dashboard Public Access

The Dashboard (default port 8265) provides read‑only debugging and write‑only job management APIs. Exposing it publicly is discouraged; if necessary, place a proxy with authentication or use ACLs.

kubectl port‑forward (recommended)

Ray HistoryServer (recommended)

Public ACL / authentication

Port‑forward example:

kubectl port-forward svc/myfirst-ray-cluster-head-svc --address 0.0.0.0 8265 -n ${RAY_CLUSTER_NS}

1.5.1. Port‑Forward

Forward the Dashboard port locally to access it securely.

1.5.2. ACK Ray HistoryServer

The HistoryServer retains dashboards for terminated clusters, integrates with ARMS monitoring, and eliminates the need for a separate Prometheus/Grafana stack.

1.5.3. ACL / Authentication

Configure the Service type as LoadBalancer, then set a restrictive SLB ACL. For authentication, use basic auth via an NGINX ingress controller and a secret.

htpasswd -c auth foo
kubectl create secret generic basic-auth --from-file=auth

1.6. RayCluster / RayJob Configuration

Prefer ClusterIP services for both RayDashboard (8265) and GCS (6379) to avoid public exposure.

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: myfirst-ray-cluster
spec:
  headGroupSpec:
    serviceType: ClusterIP
---
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: myjob
spec:
  submissionMode: "K8sJobMode"
  rayClusterSpec:
    headGroupSpec:
      serviceType: ClusterIP

2. Namespace Isolation

Separate Ray clusters into different Kubernetes namespaces to leverage namespace‑level policies such as ResourceQuota and NetworkPolicy.

3. ResourceQuota / ElasticQuotaTree

ResourceQuota : Limit CPU, GPU, memory, etc., to prevent denial‑of‑service due to resource exhaustion.

ElasticQuotaTree : Use ACK’s ElasticQuotaTree for finer‑grained quota and queue management (see documentation link).

4. RBAC

Assign a dedicated ServiceAccount to each RayCluster and grant the minimal required permissions.

If the cluster does not need to access Kubernetes resources, set automountServiceAccountToken: false to disable token mounting.

5. Security Context

Disable privileged mode and running as root.

Restrict hostPath usage to read‑only mounts with specific prefixes.

Disallow privilege escalation.

6. Head / Work Pod Secure Images

Scan Ray container images with Alibaba Cloud image security scanning before production deployment.

7. Request / Limit

Define resource requests and limits for each pod to avoid uncontrolled consumption that could crash the Kubelet or evict other pods.

8. RRSA

When Ray jobs need to access Alibaba Cloud resources (e.g., OSS), use the RRSA mechanism instead of embedding AK/SK credentials in environment variables.

9. Multi‑RayCluster Isolation / One‑Job‑One‑Cluster

Submit different jobs to separate RayClusters using RayJob, leveraging namespace isolation and RBAC to prevent cross‑cluster impact.

10. Other Recommendations

Refer to ACK security system documentation and Ray on ACK best‑practice guides for additional hardening steps.

KubernetesRayACK
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.