Cloud Native 12 min read

How to Plan, Build, and Optimize a High‑Performance Alibaba Cloud Kubernetes Cluster

This article walks through practical planning, creation, and fine‑tuning of an Alibaba Cloud Kubernetes cluster, covering network design, API server exposure, security groups, master and worker sizing, deployment manifests, service decoupling, and operational best practices.

Alibaba Cloud Native

Sep 3, 2019

How to Plan, Build, and Optimize a High‑Performance Alibaba Cloud Kubernetes Cluster

Introduction

The author shares a year‑long experience of migrating internal systems to Alibaba Cloud Container Service (Kubernetes), highlighting rapid user growth and the need for reliable, scalable cluster operations.

Cluster Planning

Network Planning

Network plugin options: Flannel or Alibaba‑custom Terway (Terway is fully compatible with Flannel; choose Flannel for a conservative approach).

Pod network CIDR: default /16, or any non‑overlapping range such as 10.0.0.0/8, 172.16‑31.0.0/12‑16, 192.168.0.0/16.

Service CIDR: default /20; selectable ranges include 10.0.0.0/16‑24, 172.16‑31.0.0/16‑24, 192.168.0.0/16‑24. CIDR blocks must not conflict and cannot be changed after cluster creation.

API Server Access

For high‑security production clusters, keep the API server private behind an internal SLB and avoid public exposure (cannot use cloud‑eff release).

For development or pre‑release clusters, expose the API server via a public SLB and immediately apply strict access control.

Note: Most Kubernetes security vulnerabilities involve the API server; keep it patched or private.

Security Group

Define security‑group rules that restrict inbound traffic to master and worker nodes only.

Master Node Sizing

1‑5 nodes: 4 CPU × 8 GB

6‑20 nodes: 4 CPU × 16 GB

21‑100 nodes: 8 CPU × 32 GB

100‑200 nodes: 16 CPU × 64 GB

Use high‑performance SSDs (50‑100 GB) for etcd storage; OS memory should not exceed 8 GB.

Worker Node Sizing

Prefer Alibaba Cloud “Shenlong” instances; if unavailable, select high‑spec ECS instances.

Example configuration used: 32 CPU × 64 GB ECS, 100 GB SSD system disk, 400 GB high‑efficiency data disk, CentOS 7.4 64‑bit.

Cluster Creation and Configuration

Use the console’s one‑click cluster creation wizard.

Apply the planned master/worker specifications and mount /var/lib/docker to a data disk.

Set appropriate Pod CIDR and Service CIDR.

Decide whether to expose the API server; if exposed, enforce strict SLB access control.

Choose Ingress type (internal or external) via the console.

Prefer IPVS mode for kube-proxy over iptables to avoid lock‑up issues.

Adjust default pod limit per node from 128 to 64.

Optionally enlarge NodePort / SLB port ranges if needed.

Configuration Adjustments

Scale the cluster by adding existing nodes (ensure data‑disk mount for /var/lib/docker).

Upgrade master specifications as needed.

Re‑configure or remove worker nodes using commands such as:

kubectl drain --ignore-daemonsets {node.name}
kubectl delete node {node.name}

Resize or replace ECS instances for workers.

Create namespaces per application and set resource quotas for high‑consumption workloads.

Grant RBAC permissions across sub‑accounts and configure bastion‑host access for developers.

Stateless Deployment Example

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: '34'
  labels:
    app: {app_name}-aone
  name: {app_name}-aone-1
  namespace: {app_name}
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: {app_name}-aone
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: {app_name}-aone
    spec:
      containers:
      - env:
        - name: TZ
          value: Asia/Shanghai
        image: registry-vpc.cn-north-2-gov-1.aliyuncs.com/{namespace}/{app_name}:20190820190005
        imagePullPolicy: Always
        lifecycle:
          preStop:
            exec:
              command:
              - sudo
              - '-u'
              - admin
              - /home/{user_name}/{app_name}/bin/appctl.sh
              - {app_name}
              - stop
        livenessProbe:
          failureThreshold: 10
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 5900
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 10
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 5900
          timeoutSeconds: 1
        resources:
          limits:
            cpu: '4'
            memory: 8Gi
          requests:
            cpu: '4'
            memory: 8Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /home/{user_name}/logs
          name: volume-1553755418538
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: {app_name}-987
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /var/lib/docker/logs/{app_name}
          type: ''
        name: volume-1553755418538

Service Configuration

To prevent the Cloud Controller Manager from automatically deleting the associated SLB when a Service is modified, decouple Service from SLB by using NodePort and manually bind the SLB backend to cluster worker nodes.

apiVersion: v1
kind: Service
metadata:
  name: {app_name}
  namespace: {namespaces}
spec:
  clusterIP: 10.1.50.65
  externalTrafficPolicy: Cluster
  ports:
  - name: {app_name}-80-7001
    nodePort: 32653
    port: 80
    protocol: TCP
    targetPort: 7001
  - name: {app_name}-5908-5908
    nodePort: 30835
    port: 5108
    protocol: TCP
    targetPort: 5108
  selector:
    app: {app_name}
  sessionAffinity: None
  type: NodePort
status:
  loadBalancer: {}

After creating the Service, configure the SLB backend to point to the worker nodes on the specified NodePort (e.g., 32653). This prevents accidental SLB deletion during Service updates and allows controlled traffic shifting.

Conclusion

Alibaba Cloud Container Service offers a simple one‑click deployment experience, but real‑world production requires careful planning of network, security, node sizing, and Service‑SLB decoupling. Integrating the console with other cloud products such as Cloud Eff, EDAS, Cloud Monitor, and Log Service can further streamline operations.

References

https://yq.aliyun.com/articles/594943

https://yq.aliyun.com/articles/599169?spm=a2c4e

https://help.aliyun.com/document_detail/123661.html?spm=5176.10695662.1996646101.searchclickresult.2fc456efWdFrBF

https://help.aliyun.com/document_detail/119035.html?spm=5176.2020520152.0.0.2b8c16ddCEYCf0

https://yq.aliyun.com/articles/715809?spm=a2c4e.11155435.0.0.111f3312TcJgtj

https://yq.aliyun.com/articles/715804?spm=a2c4e.11155435.0.0.111f3312TcJgtj

https://yq.aliyun.com/articles/717073?spm=a2c4e.11153940.0.0.22841aa3WBD0v2

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deployment Kubernetes network Security Alibaba Cloud Cluster Planning

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.