Operations 35 min read

How I Transitioned from Traditional Ops to SRE/DevOps in 18 Months

This detailed guide shares a step‑by‑step 18‑month roadmap, covering self‑assessment, skill acquisition (Python, Kubernetes, monitoring), project execution, interview preparation, and real‑world outcomes for engineers moving from legacy operations to SRE/DevOps roles.

Raymond Ops
Raymond Ops
Raymond Ops
How I Transitioned from Traditional Ops to SRE/DevOps in 18 Months

Overview

The author transformed from a traditional operations engineer to an SRE in 18 months, doubling salary and moving to a top internet company. The journey is divided into four phases with concrete learning goals, projects, and interview preparation.

SRE vs. DevOps

维度               | DevOps            | SRE
----------------|-------------------|-----------------
起源             | 敏捷运动          | Google工程实践
核心关注点       | 交付速度          | 系统可靠性
编程要求         | 中等(脚本为主)   | 高(需要开发能力)
系统设计能力     | 中等              | 高
数学/算法要求   | 低                | 中高(容量规划、性能分析)
on‑call要求      | 部分公司有        | 必须
典型技术栈       | CI/CD工具、IaC    | 编程语言、监控、分布式系统
职业发展天花板   | 高级DevOps、架构师| SRE专家、技术总监、CTO

Phase 0 – Decision & Planning (Month 0‑1)

Self‑assessment of Linux, Shell, Python, Go, Docker, Kubernetes, monitoring.

Chosen SRE path for higher technical ceiling and salary.

Created an 18‑month roadmap with weekly time budget (~20 h/week).

Phase 1 – Skill Building (Month 1‑6)

Month 1‑2: Python

Learned core syntax, OOP, exception handling, and libraries (requests, paramiko, pandas). Built a batch‑management tool:

import paramiko
from concurrent.futures import ThreadPoolExecutor

class ServerManager:
    def __init__(self, servers):
        self.servers = servers
    def exec_command(self, server, command):
        ip, user, password = server
        ssh = paramiko.SSHClient()
        ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        try:
            ssh.connect(ip, username=user, password=password, timeout=10)
            stdin, stdout, stderr = ssh.exec_command(command)
            return {"ip": ip, "output": stdout.read().decode(), "error": stderr.read().decode(), "success": len(stderr.read()) == 0}
        except Exception as e:
            return {"ip": ip, "output": "", "error": str(e), "success": False}
        finally:
            ssh.close()
    def batch_exec(self, command, max_workers=10):
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = [executor.submit(self.exec_command, s, command) for s in self.servers]
            return [f.result() for f in futures]

servers = [("192.168.1.10", "root", "pwd"), ("192.168.1.11", "root", "pwd")]
manager = ServerManager(servers)
results = manager.batch_exec("df -h | grep -w /")
for r in results:
    if r["success"]:
        print(f"✅ {r['ip']}:
{r['output']}")
    else:
        print(f"❌ {r['ip']}: {r['error']}")

Month 3‑4: Kubernetes

Studied core concepts (Pod, Service, Deployment, ConfigMap, Secret, PV/PVC, Namespace). Practiced with Minikube and Alibaba Cloud K8s. Deployed Nginx, configured Ingress, and implemented rolling updates. Example Deployment YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web-app
        image: registry.example.com/web-app:v1.2.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: web-app
  namespace: production
spec:
  selector:
    app: web-app
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 10
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Month 5‑6: Observability

Deep dive into Prometheus, Alertmanager, custom exporters, and Grafana dashboards. Built a Python exporter:

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

http_requests_total = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
http_request_duration_seconds = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])
active_connections = Gauge('active_connections', 'Number of active connections')

# Example usage inside a Flask route
@app.route('/api/users')
def get_users():
    start = time.time()
    users = fetch_users_from_db()
    http_requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()
    http_request_duration_seconds.labels(method='GET', endpoint='/api/users').observe(time.time() - start)
    return jsonify(users)

start_http_server(9090)

Implemented ELK/EFK pipeline (Filebeat → Kafka → Logstash → Elasticsearch → Kibana) and Jaeger tracing via OpenTelemetry.

Phase 2 – Project Practice (Month 7‑12)

SLO/SLI Framework (Month 7‑8)

# Example SLI calculation (Prometheus query)
sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m]))

# SLO definition for order service
availability >= 99.95%
latency P95 < 300ms
error_budget = (1 - 0.9995) * 30 days ≈ 21.6 minutes

CI/CD Pipeline (Month 9‑10)

Built a GitLab CI → Docker Registry → ArgoCD → Kubernetes workflow with canary releases and automatic rollback.

# .gitlab-ci.yml
stages:
  - build
  - test
  - deploy-staging
  - deploy-production

variables:
  IMAGE_TAG: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

build:
  stage: build
  script:
    - docker build -t $IMAGE_TAG .
    - docker push $IMAGE_TAG
  only:
    - master
    - develop

test:
  stage: test
  script:
    - go test -v ./...
    - go test -cover ./... -coverprofile=coverage.out
    - go tool cover -func=coverage.out
  coverage: '/total:.*\s(\d+\.\d+)%/'

deploy_staging:
  stage: deploy-staging
  script:
    - kubectl set image deployment/app app=$IMAGE_TAG -n staging
    - kubectl rollout status deployment/app -n staging --timeout=5m
    - ./scripts/health_check.sh staging
  environment:
    name: staging
  only:
    - develop

deploy_production:
  stage: deploy-production
  script:
    - kubectl set image deployment/app-canary app=$IMAGE_TAG -n production
    - kubectl rollout status deployment/app-canary -n production
    - ./scripts/canary_analysis.sh
    - kubectl set image deployment/app app=$IMAGE_TAG -n production
    - kubectl rollout status deployment/app -n production
  environment:
    name: production
  when: manual
  only:
    - master

# ArgoCD Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
spec:
  project: default
  source:
    repoURL: https://gitlab.example.com/k8s-configs.git
    targetRevision: HEAD
    path: apps/myapp
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Internal Developer Platform (IDP) (Month 11‑12)

Developed a FastAPI + Vue3 platform for application deployment, CMDB, monitoring, and log access. Key endpoints:

# FastAPI endpoint for listing apps
@app.get("/api/apps")
async def list_apps(db: Session = Depends(get_db)):
    apps = db.query(Application).all()
    return [app.to_dict() for app in apps]

# Deploy request handling with Celery
@app.post("/api/deploy")
async def deploy_app(request: DeployRequest, db: Session = Depends(get_db)):
    if not check_permission(request.user, request.app_name):
        raise HTTPException(403, "无权限")
    task = DeployTask(app_name=request.app_name, version=request.version, env=request.env, user=request.user)
    db.add(task)
    db.commit()
    deploy_task.delay(task.id)
    return {"task_id": task.id, "status": "pending"}

# Celery worker
@celery_app.task
def deploy_task(task_id):
    task = db.query(DeployTask).get(task_id)
    try:
        update_deployment_image(task.app_name, task.version, task.env)
        wait_for_rollout(task.app_name, task.env)
        if not health_check(task.app_name, task.env):
            rollback_deployment(task.app_name, task.env)
            task.status = "failed"
            task.error = "健康检查失败,已自动回滚"
        else:
            task.status = "success"
    except Exception as e:
        task.status = "failed"
        task.error = str(e)
        send_alert(f"部署失败:{task.app_name}{task.version}")
    finally:
        task.completed_at = datetime.now()
        db.commit()

Result: deployment time reduced by 80 %, on‑call tickets down 60 %.

Phase 3 – Interview & Job Search (Month 13‑15)

Resume rewritten with data‑driven impact statements (e.g., “Reduced MTTR from 15 min to 30 s, 96 % improvement”).

LeetCode practice: 50 Easy, 30 Medium, 10 Hard problems.

System design preparation using the SNAKE framework (Scenario, Necessary, Application, Kilobit, Evolve).

Interview experiences at ByteDance, Alibaba, Meituan, and Tencent, resulting in offers with salaries 35‑45 K (monthly) and total packages 35‑60 W.

Phase 4 – Onboarding (Month 16‑18)

Joined ByteDance as an SRE, handling on‑call, architecture design, and scaling. Reported outcomes: salary doubled, higher job satisfaction, clear career trajectory.

Practical Migration Paths

Ops → DevOps (9‑12 months, focus on CI/CD, containers, IaC). Expected salary 20‑35 K.

Ops → SRE (12‑18 months, deep programming, Kubernetes, observability, system design). Expected salary 25‑45 K.

Developer → SRE (6‑9 months, leverage existing coding skills, fill gaps in Linux, K8s, monitoring). Expected salary 30‑50 K.

Key Success Factors

Clear motivation and measurable goals.

Project‑driven learning (build tools, automate, deploy).

Continuous public output (technical blogs, open‑source repos).

Apply SRE principles in current job before switching.

Find mentors or community support.

Allocate ~20 h/week (≈1560 h total) and set milestone rewards.

Industry Outlook

Demand for SRE/DevOps continues to rise with cloud‑native adoption. Salaries for SREs in major Chinese internet firms range from 35‑60 W annual total compensation. Technical requirements increasingly include Python/Go programming, Kubernetes, observability stacks, and high‑availability design. Career paths can progress to architect, technical director, and CTO.

MonitoringPythonCI/CDKubernetesSRECareer transition
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.