How I Transitioned from Traditional Ops to SRE/DevOps in 18 Months
This detailed guide shares a step‑by‑step 18‑month roadmap, covering self‑assessment, skill acquisition (Python, Kubernetes, monitoring), project execution, interview preparation, and real‑world outcomes for engineers moving from legacy operations to SRE/DevOps roles.
Overview
The author transformed from a traditional operations engineer to an SRE in 18 months, doubling salary and moving to a top internet company. The journey is divided into four phases with concrete learning goals, projects, and interview preparation.
SRE vs. DevOps
维度 | DevOps | SRE
----------------|-------------------|-----------------
起源 | 敏捷运动 | Google工程实践
核心关注点 | 交付速度 | 系统可靠性
编程要求 | 中等(脚本为主) | 高(需要开发能力)
系统设计能力 | 中等 | 高
数学/算法要求 | 低 | 中高(容量规划、性能分析)
on‑call要求 | 部分公司有 | 必须
典型技术栈 | CI/CD工具、IaC | 编程语言、监控、分布式系统
职业发展天花板 | 高级DevOps、架构师| SRE专家、技术总监、CTOPhase 0 – Decision & Planning (Month 0‑1)
Self‑assessment of Linux, Shell, Python, Go, Docker, Kubernetes, monitoring.
Chosen SRE path for higher technical ceiling and salary.
Created an 18‑month roadmap with weekly time budget (~20 h/week).
Phase 1 – Skill Building (Month 1‑6)
Month 1‑2: Python
Learned core syntax, OOP, exception handling, and libraries (requests, paramiko, pandas). Built a batch‑management tool:
import paramiko
from concurrent.futures import ThreadPoolExecutor
class ServerManager:
def __init__(self, servers):
self.servers = servers
def exec_command(self, server, command):
ip, user, password = server
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
try:
ssh.connect(ip, username=user, password=password, timeout=10)
stdin, stdout, stderr = ssh.exec_command(command)
return {"ip": ip, "output": stdout.read().decode(), "error": stderr.read().decode(), "success": len(stderr.read()) == 0}
except Exception as e:
return {"ip": ip, "output": "", "error": str(e), "success": False}
finally:
ssh.close()
def batch_exec(self, command, max_workers=10):
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [executor.submit(self.exec_command, s, command) for s in self.servers]
return [f.result() for f in futures]
servers = [("192.168.1.10", "root", "pwd"), ("192.168.1.11", "root", "pwd")]
manager = ServerManager(servers)
results = manager.batch_exec("df -h | grep -w /")
for r in results:
if r["success"]:
print(f"✅ {r['ip']}:
{r['output']}")
else:
print(f"❌ {r['ip']}: {r['error']}")Month 3‑4: Kubernetes
Studied core concepts (Pod, Service, Deployment, ConfigMap, Secret, PV/PVC, Namespace). Practiced with Minikube and Alibaba Cloud K8s. Deployed Nginx, configured Ingress, and implemented rolling updates. Example Deployment YAML:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: production
spec:
replicas: 10
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: registry.example.com/web-app:v1.2.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: web-app
namespace: production
spec:
selector:
app: web-app
ports:
- port: 80
targetPort: 8080
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 10
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70Month 5‑6: Observability
Deep dive into Prometheus, Alertmanager, custom exporters, and Grafana dashboards. Built a Python exporter:
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
http_requests_total = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
http_request_duration_seconds = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])
active_connections = Gauge('active_connections', 'Number of active connections')
# Example usage inside a Flask route
@app.route('/api/users')
def get_users():
start = time.time()
users = fetch_users_from_db()
http_requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()
http_request_duration_seconds.labels(method='GET', endpoint='/api/users').observe(time.time() - start)
return jsonify(users)
start_http_server(9090)Implemented ELK/EFK pipeline (Filebeat → Kafka → Logstash → Elasticsearch → Kibana) and Jaeger tracing via OpenTelemetry.
Phase 2 – Project Practice (Month 7‑12)
SLO/SLI Framework (Month 7‑8)
# Example SLI calculation (Prometheus query)
sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m]))
# SLO definition for order service
availability >= 99.95%
latency P95 < 300ms
error_budget = (1 - 0.9995) * 30 days ≈ 21.6 minutesCI/CD Pipeline (Month 9‑10)
Built a GitLab CI → Docker Registry → ArgoCD → Kubernetes workflow with canary releases and automatic rollback.
# .gitlab-ci.yml
stages:
- build
- test
- deploy-staging
- deploy-production
variables:
IMAGE_TAG: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
build:
stage: build
script:
- docker build -t $IMAGE_TAG .
- docker push $IMAGE_TAG
only:
- master
- develop
test:
stage: test
script:
- go test -v ./...
- go test -cover ./... -coverprofile=coverage.out
- go tool cover -func=coverage.out
coverage: '/total:.*\s(\d+\.\d+)%/'
deploy_staging:
stage: deploy-staging
script:
- kubectl set image deployment/app app=$IMAGE_TAG -n staging
- kubectl rollout status deployment/app -n staging --timeout=5m
- ./scripts/health_check.sh staging
environment:
name: staging
only:
- develop
deploy_production:
stage: deploy-production
script:
- kubectl set image deployment/app-canary app=$IMAGE_TAG -n production
- kubectl rollout status deployment/app-canary -n production
- ./scripts/canary_analysis.sh
- kubectl set image deployment/app app=$IMAGE_TAG -n production
- kubectl rollout status deployment/app -n production
environment:
name: production
when: manual
only:
- master
# ArgoCD Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp
spec:
project: default
source:
repoURL: https://gitlab.example.com/k8s-configs.git
targetRevision: HEAD
path: apps/myapp
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: trueInternal Developer Platform (IDP) (Month 11‑12)
Developed a FastAPI + Vue3 platform for application deployment, CMDB, monitoring, and log access. Key endpoints:
# FastAPI endpoint for listing apps
@app.get("/api/apps")
async def list_apps(db: Session = Depends(get_db)):
apps = db.query(Application).all()
return [app.to_dict() for app in apps]
# Deploy request handling with Celery
@app.post("/api/deploy")
async def deploy_app(request: DeployRequest, db: Session = Depends(get_db)):
if not check_permission(request.user, request.app_name):
raise HTTPException(403, "无权限")
task = DeployTask(app_name=request.app_name, version=request.version, env=request.env, user=request.user)
db.add(task)
db.commit()
deploy_task.delay(task.id)
return {"task_id": task.id, "status": "pending"}
# Celery worker
@celery_app.task
def deploy_task(task_id):
task = db.query(DeployTask).get(task_id)
try:
update_deployment_image(task.app_name, task.version, task.env)
wait_for_rollout(task.app_name, task.env)
if not health_check(task.app_name, task.env):
rollback_deployment(task.app_name, task.env)
task.status = "failed"
task.error = "健康检查失败,已自动回滚"
else:
task.status = "success"
except Exception as e:
task.status = "failed"
task.error = str(e)
send_alert(f"部署失败:{task.app_name}{task.version}")
finally:
task.completed_at = datetime.now()
db.commit()Result: deployment time reduced by 80 %, on‑call tickets down 60 %.
Phase 3 – Interview & Job Search (Month 13‑15)
Resume rewritten with data‑driven impact statements (e.g., “Reduced MTTR from 15 min to 30 s, 96 % improvement”).
LeetCode practice: 50 Easy, 30 Medium, 10 Hard problems.
System design preparation using the SNAKE framework (Scenario, Necessary, Application, Kilobit, Evolve).
Interview experiences at ByteDance, Alibaba, Meituan, and Tencent, resulting in offers with salaries 35‑45 K (monthly) and total packages 35‑60 W.
Phase 4 – Onboarding (Month 16‑18)
Joined ByteDance as an SRE, handling on‑call, architecture design, and scaling. Reported outcomes: salary doubled, higher job satisfaction, clear career trajectory.
Practical Migration Paths
Ops → DevOps (9‑12 months, focus on CI/CD, containers, IaC). Expected salary 20‑35 K.
Ops → SRE (12‑18 months, deep programming, Kubernetes, observability, system design). Expected salary 25‑45 K.
Developer → SRE (6‑9 months, leverage existing coding skills, fill gaps in Linux, K8s, monitoring). Expected salary 30‑50 K.
Key Success Factors
Clear motivation and measurable goals.
Project‑driven learning (build tools, automate, deploy).
Continuous public output (technical blogs, open‑source repos).
Apply SRE principles in current job before switching.
Find mentors or community support.
Allocate ~20 h/week (≈1560 h total) and set milestone rewards.
Industry Outlook
Demand for SRE/DevOps continues to rise with cloud‑native adoption. Salaries for SREs in major Chinese internet firms range from 35‑60 W annual total compensation. Technical requirements increasingly include Python/Go programming, Kubernetes, observability stacks, and high‑availability design. Career paths can progress to architect, technical director, and CTO.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
