Operations 44 min read

How I Doubled My Salary by Switching from Traditional Ops to SRE in 18 Months

Over 18 months, the author details a step‑by‑step transformation from a fire‑fighting traditional operations role to a high‑paying SRE/DevOps career, covering motivations, skill gaps, learning plans, project implementations, interview preparation, and real‑world outcomes, offering a practical roadmap for engineers seeking similar growth.

MaGe Linux Operations

Oct 4, 2025

How I Doubled My Salary by Switching from Traditional Ops to SRE in 18 Months

Introduction

"Transition to SRE, salary rose from 20K to 40K in 18 months" – this was a status I posted on my social feed at the beginning of 2023 after completing a successful move from traditional operations to SRE and receiving my desired offer. Four years of firefighting operations with no clear career path felt exhausting; the 18‑month journey involved doubts, anxiety, breakthroughs, and growth. Below I share the complete transformation method: why switch, how to prepare, how to implement, and how to interview.

Technical Background: What Are SRE and DevOps?

SRE vs DevOps Core Differences

Many think SRE and DevOps are the same, but they have distinct focuses.

DevOps (Development + Operations):

Position: A culture and practice emphasizing collaboration between development and operations.

Core Goal: Accelerate software delivery and shorten time‑to‑production.

Key Skills: CI/CD, automation, agile development, continuous improvement.

Typical Work: Build and maintain CI/CD pipelines, automate deployment and configuration, monitor and log, integrate toolchains. Most internet companies have DevOps roles.

SRE (Site Reliability Engineering):

Position: Google‑origin engineering practice that applies software‑engineering methods to operations problems.

Core Goal: Ensure system reliability while balancing stability and rapid iteration.

Key Skills: Programming, system design, reliability engineering, capacity planning.

Typical Work: Design high‑availability architecture, build SLO/SLI systems, capacity planning, on‑call incident response, develop ops platforms and tools. Adopted by Google, Facebook, Alibaba, ByteDance, etc.

Core Difference Comparison:

维度               | DevOps               | SRE
----------------|---------------------|-----------------
起源               | 敏捷运动               | Google工程实践
核心关注点          | 交付速度               | 系统可靠性
编程要求            | 中等（脚本为主）        | 高（需要开发能力）
系统设计能力          | 中等                  | 高
数学/算法要求        | 低                    | 中高（容量规划、性能分析）
on‑call要求        | 部分公司有            | 必须
典型技术栈          | CI/CD工具、IaC        | 编程语言、监控、分布式系统
职业发展天花板      | 高级DevOps、架构师   | SRE专家、技术总监、CTO

My Understanding:

DevOps is like “advanced ops”, emphasizing automation and collaboration.

SRE is like “software engineer + ops expert”, emphasizing solving problems with code.

DevOps has a lower entry barrier; SRE requires higher skills but offers a higher ceiling.

Why Switch? The Pain of Traditional Ops

Typical Day Before Transition (Traditional Ops):

08:30 - Start work, check monitoring and alerts
09:00 - Handle overnight alerts (high DB connections)
10:00 - Product requirement review (evaluate release risk)
11:00 - Manual deployment (30 min, stressful)
12:00 - Lunch
13:00 - Write weekly report
14:00 - Receive dev request: provision 3 servers
14:30 - Manually create servers in cloud console
15:00 - Configure servers (SSH, install software)
16:00 - Incident post‑mortem meeting (why monitoring didn’t pre‑alert?)
17:00 - Review next day’s release plan, prepare scripts
18:00 - End of day
19:00 - Dinner, receive disk‑full alert
19:30 - Log in to clean logs
20:00 - Finally off work

Weekend:
- On‑call, ready at any time
- Immediate incident handling

Feelings:
- 😫 Fire‑fighting, no sense of achievement
- 😫 Repetitive tasks
- 😫 Little technical accumulation
- 😫 Slow salary growth
- 😫 No career outlook

After Transition (SRE):

08:30 - Start work, view automated monitoring dashboard
09:00 - Code Review of team PRs (ops platform features)
10:00 - Develop auto‑scaling feature (write Go code)
12:00 - Lunch
13:00 - Technical design review: capacity planning
14:00 - Continue development, write unit tests
15:30 - Submit PR, trigger CI/CD pipeline
16:00 - Architecture discussion: design high‑availability for new service
17:00 - Write technical doc: "Capacity Planning Best Practices"
18:00 - End of day

On‑call (one week per month):
- Carry laptop, respond quickly (few incidents because system is stable)
- Resolve incidents within 5 min, write post‑mortem next day

Feelings:
- ✅ Writing code, creating value
- ✅ Continuous technical growth
- ✅ Strong sense of achievement (system stability improved)
- ✅ Salary doubled
- ✅ Clear career path

Core Reasons for Transition:

Career Development: Traditional ops ceiling is low; SRE/DevOps offers higher growth.

Salary: SRE salaries are 50‑100 % higher on average.

Technical Growth: SRE requires programming, system design, reliability engineering, capacity planning.

Work Experience: Move from repetitive tasks to value‑creating work.

Market Demand: Traditional ops roles shrink, SRE/DevOps demand surges.

SRE/DevOps Market Demand and Salary (2024 Data)

Job Demand (Lagou):

Traditional Ops: 4,500 positions (‑25 % YoY)
SRE: 3,800 positions (+45 % YoY)
DevOps: 8,200 positions (+38 % YoY)

Salary Comparison (Beijing, 3‑5 yr experience):

Traditional Ops: 15‑25K
DevOps: 20‑35K (‑30‑40 % higher)
SRE: 25‑45K (‑60‑80 % higher)

Top‑tier internet company SRE total packages: Alibaba P6 350‑500 k, ByteDance 3‑1 400‑600 k, Tencent T3‑1 350‑500 k, Meituan L6 350‑500 k.

Skill Requirements Comparison:

Traditional Ops: Linux, Shell scripts, monitoring tools, incident handling.

SRE/DevOps: Python/Go, Kubernetes & cloud‑native, CI/CD pipelines, system design & architecture, monitoring & observability, high‑availability design, capacity planning.

Conclusion: Transitioning to SRE/DevOps is not just a skill upgrade; it is a career leap.

Core Content: My 18‑Month Transformation Process

Phase 0 – Decision & Planning (Month 0, 1 month)

Self‑Assessment Before Transition

Spent a full month evaluating strengths and gaps.

Technical Ability:
✅ Linux (4 yr, proficient)
✅ Shell scripts (can write, not expert)
✅ Python (basic syntax only)
❌ Go (none)
✅ Docker (basic, no production use)
❌ Kubernetes (heard of, not used)
✅ Monitoring (Zabbix proficient, Prometheus unfamiliar)
❌ Programming depth (big weakness)

Soft Skills:
✅ Strong learning ability, willing to invest time
✅ Strong problem‑solving, good at fault isolation
✅ English: can read docs
✅ Communication: medium

Personal:
- Age 27, time to transition
- Single, no dependents
- Can spare 3‑4 h daily for learning
- Savings allow half‑year of lower income

Assessment:
✅ Need to transition (career bottleneck)
✅ Possible (young, time available)
⚠️ Biggest challenge: weak programming
⚠️ Secondary: no cloud‑native experience

Choosing SRE vs DevOps

Decided on SRE because:

DevOps Advantages:
- Lower barrier, many opportunities
- Focus on engineering practice and tools

DevOps Disadvantages:
- Lower technical depth
- Limited long‑term growth ceiling
- Skill gap with developers

SRE Advantages:
- High technical depth, large growth space
- Higher salary ceiling
- Closer to software engineering, can code
- Clear path: SRE → Architect → Technical Director

SRE Disadvantages:
- High barrier, requires programming
- Fewer positions (mainly big firms)
- On‑call pressure

My Choice: SRE
Reasons:
1. Want higher technical ceiling
2. Will invest time learning programming
3. Goal: join a top‑tier internet company

Creating the Transition Plan

Designed an 18‑month roadmap:

Phase 1 (Month 1‑6): Skill‑building
- Month 1‑2: Python fundamentals
- Month 3‑4: Kubernetes & cloud‑native
- Month 5‑6: Monitoring & observability

Phase 2 (Month 7‑12): Project practice
- Month 7‑8: Build SLO/SLI system
- Month 9‑10: Automation & CI/CD pipelines
- Month 11‑12: Develop internal developer platform (IDP)

Phase 3 (Month 13‑15): Interview prep & job search
- Month 13: Resume polishing, project梳理
- Month 14: Algorithm & system design practice
- Month 15: Apply, interview

Phase 4 (Month 16‑18): Onboarding & adaptation
- Month 16: Offer acceptance, handover
- Month 17‑18: Ramp up in new role

Budget:
- Time: 20 h/week (≈1,560 h total)
- Money: Courses ¥5,000, books ¥1,000
- Opportunity cost: possible short‑term salary dip

Phase 1 – Skill‑Building (Month 1‑6)

Month 1‑2: Python Breakthrough

Why Python first?

SRE must code; Python is most used in ops.

Rich ecosystem for automation.

Easier to pick up than Go.

Learning Path:

Week 1‑2: Python basics – data types, control flow, functions, OOP, exceptions, modules.
Practice: 20 Easy LeetCode problems.

Week 3‑4: Advanced Python – decorators, generators, context managers, multithreading/multiprocessing/asyncio, requests, paramiko, pandas.

Project 1: Batch server management tool (SSH, parallel command execution).

Code Example:

import paramiko
from concurrent.futures import ThreadPoolExecutor

class ServerManager:
    def __init__(self, servers):
        self.servers = servers

    def exec_command(self, server, command):
        """Execute command on a single server"""
        ip, user, password = server
        ssh = paramiko.SSHClient()
        ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        try:
            ssh.connect(ip, username=user, password=password, timeout=10)
            stdin, stdout, stderr = ssh.exec_command(command)
            output = stdout.read().decode()
            error = stderr.read().decode()
            return {"ip": ip, "output": output, "error": error, "success": len(error) == 0}
        except Exception as e:
            return {"ip": ip, "output": "", "error": str(e), "success": False}
        finally:
            ssh.close()

    def batch_exec(self, command, max_workers=10):
        """Execute command on all servers in parallel"""
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = [executor.submit(self.exec_command, s, command) for s in self.servers]
            return [f.result() for f in futures]

servers = [("192.168.1.10", "root", "pwd"), ("192.168.1.11", "root", "pwd")]
manager = ServerManager(servers)
results = manager.batch_exec("df -h | grep -w /")
for r in results:
    if r["success"]:
        print(f"✅ {r['ip']}:
{r['output']}")
    else:
        print(f"❌ {r['ip']}: {r['error']}")

Week 5‑6: Web development basics (FastAPI) – needed for building ops platforms.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uvicorn

app = FastAPI(title="Ops Platform API")

class DeployRequest(BaseModel):
    app_name: str
    version: str
    servers: list[str]

@app.post("/api/deploy")
async def deploy(request: DeployRequest):
    """Deploy application"""
    try:
        result = deploy_application(request.app_name, request.version, request.servers)
        return {"status": "success", "message": "Deploy succeeded", "result": result}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/api/status/{app_name}")
async def get_status(app_name: str):
    """Get app status"""
    status = get_app_status(app_name)
    return {"app": app_name, "status": status}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Month 1‑2 Summary:

✅ Mastered core Python syntax and common libraries.

✅ Developed two useful automation tools.

✅ Built programming mindset.

✅ Wrote five technical blogs.

⚠️ Code quality still needs improvement.

Month 3‑4: Kubernetes & Cloud‑Native

Learning Path:

Week 1‑2: K8s basics – Minikube locally, Alibaba Cloud K8s trial.
Core concepts: Pod, Service, Deployment, ConfigMap, Secret, PV, PVC, Namespace, Labels.

Practice:
1. Deploy Nginx
2. Configure Service & Ingress
3. Manage config via ConfigMap
4. Perform rolling updates & rollbacks

Week 3‑4: Advanced K8s – architecture, scheduler, CNI, CSI, RBAC.
Monitoring: Metrics Server, Prometheus + Grafana.
Logging: EFK stack.
Auto‑scaling: HPA, VPA, Cluster Autoscaler.

Project: Containerize a company application.
- Analyze architecture & dependencies.
- Write optimized Dockerfile.
- Create K8s YAML manifests.
- Deploy to test env, set up monitoring & alerts.
- Write ops documentation.

Month 3‑4 Summary:

✅ Mastered core K8s concepts and operations.

✅ Completed application containerization in test env.

✅ Understood cloud‑native philosophy.

✅ Published eight K8s articles.

⏭️ Next step: production rollout.

Month 5‑6: Monitoring & Observability

Learning Path:

Week 1‑2: Deep dive into Prometheus – architecture, data model, PromQL, exporter development, alert rules, Alertmanager.
Hands‑on: Build full monitoring stack (Node Exporter, kube‑state‑metrics, custom app exporter, MySQL/Redis/Nginx exporters).

Week 3‑4: ELK/EFK logging – Filebeat → Kafka → Logstash → Elasticsearch → Kibana.
Configure Filebeat for app logs, set up pipelines.

Week 5‑6: Distributed tracing with Jaeger – OpenTelemetry integration, request chain analysis, performance bottleneck identification.

Week 5‑6: Build unified observability platform integrating Metrics (Prometheus), Logs (ELK), Traces (Jaeger) with unified dashboards and smart alerting.

Month 5‑6 Summary:

✅ Established complete monitoring & alerting system.

✅ Implemented centralized log management and analysis.

✅ Integrated distributed tracing.

✅ Reduced mean‑time‑to‑detect from 15 min to 30 s.

✅ Wrote ten observability articles.

Phase 1 (Month 1‑6) Overall Summary

✅ Programming: Python from zero to tool development.

✅ Cloud‑Native: K8s from novice to production‑ready.

✅ Monitoring: Full observability stack built.

✅ Output: 23 technical blogs.

✅ Projects: 2 tools, 1 platform.

⏭️ Next phase: apply SRE principles in the company.

Phase 2 – Project Practice (Month 7‑12)

Month 7‑8: Build SLO/SLI System

SRE Core Practice:

What is SLO/SLI?
- SLI (Service Level Indicator): metric measuring service performance.
- SLO (Service Level Objective): target for SLI, e.g., 99.95% availability.
- SLA (Service Level Agreement): contractual guarantee.

Example:
SLI: API request success rate
SLO: 99.95% (≈21.6 min downtime per month)

Steps:
1. Identify key user journeys (e.g., browse, search, add‑to‑cart, checkout, view order).
2. Define SLIs (availability, latency).
3. Set SLO targets (P0 99.95%, P1 99.9%, P2 99.5%).
4. Implement monitoring (Prometheus queries).
5. Use error‑budget mechanism to balance releases vs reliability.

Result: Balanced fast iteration with reliability, data‑driven decisions, reduced noise alerts.

Month 9‑10: Automation & CI/CD

Full CI/CD Pipeline:

Goal: Automate from code commit to production release.

Architecture:
GitLab → GitLab CI → Docker Registry → ArgoCD → Kubernetes

Pipeline stages:
1. Build – Docker image, push to registry.
2. Test – unit tests, coverage.
3. Deploy‑staging – update image, health check.
4. Deploy‑production – canary release, automated verification.
5. Rollback on failure.

GitLab CI .gitlab-ci.yml snippet:

stages:
  - build
  - test
  - deploy-staging
  - deploy-production

variables:
  IMAGE_TAG: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

build:
  stage: build
  script:
    - docker build -t $IMAGE_TAG .
    - docker push $IMAGE_TAG
  only:
    - master
    - develop

test:
  stage: test
  script:
    - go test -v ./...
    - go test -cover ./... -coverprofile=coverage.out
    - go tool cover -func=coverage.out
  coverage: '/total:.*\s(\d+\.\d+)%/'

deploy_staging:
  stage: deploy-staging
  script:
    - kubectl set image deployment/app app=$IMAGE_TAG -n staging
    - kubectl rollout status deployment/app -n staging --timeout=5m
    - ./scripts/health_check.sh staging
  environment:
    name: staging
  only:
    - develop

deploy_production:
  stage: deploy-production
  script:
    - # Canary release
    - kubectl set image deployment/app-canary app=$IMAGE_TAG -n production
    - kubectl rollout status deployment/app-canary -n production
    - ./scripts/canary_analysis.sh
    - kubectl set image deployment/app app=$IMAGE_TAG -n production
    - kubectl rollout status deployment/app -n production
  environment:
    name: production
  when: manual
  only:
    - master

Result: Deployment time reduced from 30 min to 5 min, frequency increased from weekly to 5‑10 times per day, failure rate dropped from 5 % to 0.3 %.

Month 11‑12: Internal Developer Platform (IDP)

Background: Company had 50+ applications with fragmented deployment, monitoring, and log access.

Solution: Build a unified ops platform.

Application Management (CMDB)

One‑click deployment, canary, rollback

Monitoring dashboard (health, QPS, latency, error rate)

Log query (real‑time, historical search)

Ticket system (resource, permission, change requests)

Tech Stack: Frontend Vue3 + Element Plus, Backend FastAPI + Celery, PostgreSQL + Redis, Docker + Kubernetes.

Key Backend API Example:

from fastapi import FastAPI, Depends, HTTPException
from sqlalchemy.orm import Session
import kubernetes

app = FastAPI()

@app.get("/api/apps")
async def list_apps(db: Session = Depends(get_db)):
    """List applications"""
    apps = db.query(Application).all()
    return [app.to_dict() for app in apps]

@app.post("/api/deploy")
async def deploy_app(request: DeployRequest, db: Session = Depends(get_db)):
    """Deploy application"""
    if not check_permission(request.user, request.app_name):
        raise HTTPException(403, "No permission")
    task = DeployTask(app_name=request.app_name, version=request.version, env=request.env, user=request.user)
    db.add(task)
    db.commit()
    deploy_task.delay(task.id)
    return {"task_id": task.id, "status": "pending"}

@app.get("/api/logs/{app_name}")
async def get_logs(app_name: str, namespace: str = "production", lines: int = 100):
    """Fetch application logs"""
    k8s_client = kubernetes.client.CoreV1Api()
    pods = k8s_client.list_namespaced_pod(namespace=namespace, label_selector=f"app={app_name}")
    logs = []
    for pod in pods.items:
        pod_logs = k8s_client.read_namespaced_pod_log(name=pod.metadata.name, namespace=namespace, tail_lines=lines)
        logs.append({"pod": pod.metadata.name, "logs": pod_logs})
    return logs

Result: Platform adopted by >100 engineers, deployment time cut by 80 %, ops tickets reduced by 60 %.

Phase 2 (Month 7‑12) Overall Summary

✅ Established SLO/SLI system.

✅ Implemented full CI/CD pipeline.

✅ Developed internal platform used company‑wide.

✅ Team stability improved (incident time ↓70 %).

✅ Release efficiency ↑5×.

✅ Authored 15 technical articles.

✅ Presented at company tech conference.

⏭️ Gained solid SRE project experience for interviews.

Phase 3 – Interview Preparation & Job Search (Month 13‑15)

Month 13: Resume & Project Polishing

Resume Principles:

Use data: e.g., "Built Prometheus monitoring covering 200+ servers, reduced MTTR from 15 min to 30 s." Highlight business value: e.g., "CI/CD platform increased release frequency from weekly to daily, supporting rapid product iteration." Show technical depth: e.g., "Optimized K8s scheduler, improved cluster utilization from 60 % to 85 %."

Structure: Personal info, strengths, work experience (with quantified achievements), technical skills, education, open‑source contributions.

Month 14: Technical Prep

Focus areas:

Algorithms & data structures (LeetCode): 50 Easy, 30 Medium, 10 Hard (optional).

System design (SNAKE framework): Scenario, Necessary, Application, Kilobit, Evolve.

Linux & networking fundamentals.

Deep dive into projects – be ready to discuss architecture, challenges, solutions, and impact.

Month 15: Applications & Interviews

Target companies: top‑tier internet firms (Alibaba, ByteDance, Tencent, Meituan), unicorns, and foreign tech giants.

Interview record (selected):

ByteDance – Offer: 3‑1 level, 40K × 16.

Alibaba – Offer: P6, 35K × 15.5.

Meituan – Offer: L6, 38K × 16.

Tencent – Rejected on hard algorithm questions.

Key takeaways: Strong project experience and clear articulation win; algorithm skills still need improvement.

Phase 4 – Onboarding & Adaptation (Month 16‑18)

Month 16: Hand over current role, join ByteDance, familiarize environment.

Month 17: Rapid learning – new tech stack (Go‑centric), large‑scale systems, intensive training.

Month 18: Take ownership of a service, participate in on‑call, contribute to architecture design.

Outcomes: Salary doubled (20K → 40K), clear career path, significant technical growth.

Key Takeaways

Define clear motivation and goals.

Identify and fill skill gaps (programming, cloud‑native, monitoring).

Drive learning with real projects.

Apply SRE principles in current role.

Continuously output (blogs, open‑source) to build personal brand.

Find mentors or community support.

Set milestones and reward yourself.

Transitioning from traditional ops to SRE is challenging but highly rewarding, offering higher salary, broader technical scope, and clearer career advancement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Cloud Native CI/CD Operations SRE career transition

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Technical Background: What Are SRE and DevOps?

SRE vs DevOps Core Differences

Why Switch? The Pain of Traditional Ops

SRE/DevOps Market Demand and Salary (2024 Data)

Core Content: My 18‑Month Transformation Process

Phase 0 – Decision & Planning (Month 0, 1 month)

Self‑Assessment Before Transition

Choosing SRE vs DevOps

Creating the Transition Plan

Phase 1 – Skill‑Building (Month 1‑6)

Month 1‑2: Python Breakthrough

Month 3‑4: Kubernetes & Cloud‑Native

Month 5‑6: Monitoring & Observability

Phase 1 (Month 1‑6) Overall Summary

Phase 2 – Project Practice (Month 7‑12)

Month 7‑8: Build SLO/SLI System

Month 9‑10: Automation & CI/CD

Month 11‑12: Internal Developer Platform (IDP)

Phase 2 (Month 7‑12) Overall Summary

Phase 3 – Interview Preparation & Job Search (Month 13‑15)

Month 13: Resume & Project Polishing

Month 14: Technical Prep

Month 15: Applications & Interviews

Phase 4 – Onboarding & Adaptation (Month 16‑18)

Key Takeaways

MaGe Linux Operations

How this landed with the community

Was this worth your time?

0 Comments

Phase 0 – Decision & Planning (Month 0, 1 month)

Phase 1 – Skill‑Building (Month 1‑6)

Month 1‑2: Python Breakthrough

Month 3‑4: Kubernetes & Cloud‑Native

Month 5‑6: Monitoring & Observability

Phase 1 (Month 1‑6) Overall Summary

Phase 2 – Project Practice (Month 7‑12)

Month 7‑8: Build SLO/SLI System

Month 9‑10: Automation & CI/CD

Month 11‑12: Internal Developer Platform (IDP)

Phase 2 (Month 7‑12) Overall Summary

Phase 3 – Interview Preparation & Job Search (Month 13‑15)

Month 13: Resume & Project Polishing

Month 14: Technical Prep

Month 15: Applications & Interviews

Phase 4 – Onboarding & Adaptation (Month 16‑18)