How to Plan, Build, and Optimize a High‑Performance Alibaba Cloud Kubernetes Cluster
This article walks through practical planning, creation, and fine‑tuning of an Alibaba Cloud Kubernetes cluster, covering network design, API server exposure, security groups, master and worker sizing, deployment manifests, service decoupling, and operational best practices.
Introduction
The author shares a year‑long experience of migrating internal systems to Alibaba Cloud Container Service (Kubernetes), highlighting rapid user growth and the need for reliable, scalable cluster operations.
Cluster Planning
Network Planning
Network plugin options: Flannel or Alibaba‑custom Terway (Terway is fully compatible with Flannel; choose Flannel for a conservative approach).
Pod network CIDR: default /16, or any non‑overlapping range such as 10.0.0.0/8, 172.16‑31.0.0/12‑16, 192.168.0.0/16.
Service CIDR: default /20; selectable ranges include 10.0.0.0/16‑24, 172.16‑31.0.0/16‑24, 192.168.0.0/16‑24. CIDR blocks must not conflict and cannot be changed after cluster creation.
API Server Access
For high‑security production clusters, keep the API server private behind an internal SLB and avoid public exposure (cannot use cloud‑eff release).
For development or pre‑release clusters, expose the API server via a public SLB and immediately apply strict access control.
Note: Most Kubernetes security vulnerabilities involve the API server; keep it patched or private.
Security Group
Define security‑group rules that restrict inbound traffic to master and worker nodes only.
Master Node Sizing
1‑5 nodes: 4 CPU × 8 GB
6‑20 nodes: 4 CPU × 16 GB
21‑100 nodes: 8 CPU × 32 GB
100‑200 nodes: 16 CPU × 64 GB
Use high‑performance SSDs (50‑100 GB) for etcd storage; OS memory should not exceed 8 GB.
Worker Node Sizing
Prefer Alibaba Cloud “Shenlong” instances; if unavailable, select high‑spec ECS instances.
Example configuration used: 32 CPU × 64 GB ECS, 100 GB SSD system disk, 400 GB high‑efficiency data disk, CentOS 7.4 64‑bit.
Cluster Creation and Configuration
Use the console’s one‑click cluster creation wizard.
Apply the planned master/worker specifications and mount /var/lib/docker to a data disk.
Set appropriate Pod CIDR and Service CIDR.
Decide whether to expose the API server; if exposed, enforce strict SLB access control.
Choose Ingress type (internal or external) via the console.
Prefer IPVS mode for kube-proxy over iptables to avoid lock‑up issues.
Adjust default pod limit per node from 128 to 64.
Optionally enlarge NodePort / SLB port ranges if needed.
Configuration Adjustments
Scale the cluster by adding existing nodes (ensure data‑disk mount for /var/lib/docker).
Upgrade master specifications as needed.
Re‑configure or remove worker nodes using commands such as:
kubectl drain --ignore-daemonsets {node.name}
kubectl delete node {node.name}Resize or replace ECS instances for workers.
Create namespaces per application and set resource quotas for high‑consumption workloads.
Grant RBAC permissions across sub‑accounts and configure bastion‑host access for developers.
Stateless Deployment Example
apiVersion: apps/v1beta2
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: '34'
labels:
app: {app_name}-aone
name: {app_name}-aone-1
namespace: {app_name}
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: {app_name}-aone
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app: {app_name}-aone
spec:
containers:
- env:
- name: TZ
value: Asia/Shanghai
image: registry-vpc.cn-north-2-gov-1.aliyuncs.com/{namespace}/{app_name}:20190820190005
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- sudo
- '-u'
- admin
- /home/{user_name}/{app_name}/bin/appctl.sh
- {app_name}
- stop
livenessProbe:
failureThreshold: 10
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 5900
timeoutSeconds: 1
readinessProbe:
failureThreshold: 10
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 5900
timeoutSeconds: 1
resources:
limits:
cpu: '4'
memory: 8Gi
requests:
cpu: '4'
memory: 8Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /home/{user_name}/logs
name: volume-1553755418538
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: {app_name}-987
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /var/lib/docker/logs/{app_name}
type: ''
name: volume-1553755418538Service Configuration
To prevent the Cloud Controller Manager from automatically deleting the associated SLB when a Service is modified, decouple Service from SLB by using NodePort and manually bind the SLB backend to cluster worker nodes.
apiVersion: v1
kind: Service
metadata:
name: {app_name}
namespace: {namespaces}
spec:
clusterIP: 10.1.50.65
externalTrafficPolicy: Cluster
ports:
- name: {app_name}-80-7001
nodePort: 32653
port: 80
protocol: TCP
targetPort: 7001
- name: {app_name}-5908-5908
nodePort: 30835
port: 5108
protocol: TCP
targetPort: 5108
selector:
app: {app_name}
sessionAffinity: None
type: NodePort
status:
loadBalancer: {}After creating the Service, configure the SLB backend to point to the worker nodes on the specified NodePort (e.g., 32653). This prevents accidental SLB deletion during Service updates and allows controlled traffic shifting.
Conclusion
Alibaba Cloud Container Service offers a simple one‑click deployment experience, but real‑world production requires careful planning of network, security, node sizing, and Service‑SLB decoupling. Integrating the console with other cloud products such as Cloud Eff, EDAS, Cloud Monitor, and Log Service can further streamline operations.
References
https://yq.aliyun.com/articles/594943
https://yq.aliyun.com/articles/599169?spm=a2c4e
https://help.aliyun.com/document_detail/123661.html?spm=5176.10695662.1996646101.searchclickresult.2fc456efWdFrBF
https://help.aliyun.com/document_detail/119035.html?spm=5176.2020520152.0.0.2b8c16ddCEYCf0
https://yq.aliyun.com/articles/715809?spm=a2c4e.11155435.0.0.111f3312TcJgtj
https://yq.aliyun.com/articles/715804?spm=a2c4e.11155435.0.0.111f3312TcJgtj
https://yq.aliyun.com/articles/717073?spm=a2c4e.11153940.0.0.22841aa3WBD0v2
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
