How I Doubled My Salary by Switching from Traditional Ops to SRE in 18 Months
Over 18 months, the author details a step‑by‑step transformation from a fire‑fighting traditional operations role to a high‑paying SRE/DevOps career, covering motivations, skill gaps, learning plans, project implementations, interview preparation, and real‑world outcomes, offering a practical roadmap for engineers seeking similar growth.
Introduction
"Transition to SRE, salary rose from 20K to 40K in 18 months" – this was a status I posted on my social feed at the beginning of 2023 after completing a successful move from traditional operations to SRE and receiving my desired offer. Four years of firefighting operations with no clear career path felt exhausting; the 18‑month journey involved doubts, anxiety, breakthroughs, and growth. Below I share the complete transformation method: why switch, how to prepare, how to implement, and how to interview.
Technical Background: What Are SRE and DevOps?
SRE vs DevOps Core Differences
Many think SRE and DevOps are the same, but they have distinct focuses.
DevOps (Development + Operations):
Position: A culture and practice emphasizing collaboration between development and operations.
Core Goal: Accelerate software delivery and shorten time‑to‑production.
Key Skills: CI/CD, automation, agile development, continuous improvement.
Typical Work: Build and maintain CI/CD pipelines, automate deployment and configuration, monitor and log, integrate toolchains. Most internet companies have DevOps roles.
SRE (Site Reliability Engineering):
Position: Google‑origin engineering practice that applies software‑engineering methods to operations problems.
Core Goal: Ensure system reliability while balancing stability and rapid iteration.
Key Skills: Programming, system design, reliability engineering, capacity planning.
Typical Work: Design high‑availability architecture, build SLO/SLI systems, capacity planning, on‑call incident response, develop ops platforms and tools. Adopted by Google, Facebook, Alibaba, ByteDance, etc.
Core Difference Comparison:
维度 | DevOps | SRE
----------------|---------------------|-----------------
起源 | 敏捷运动 | Google工程实践
核心关注点 | 交付速度 | 系统可靠性
编程要求 | 中等(脚本为主) | 高(需要开发能力)
系统设计能力 | 中等 | 高
数学/算法要求 | 低 | 中高(容量规划、性能分析)
on‑call要求 | 部分公司有 | 必须
典型技术栈 | CI/CD工具、IaC | 编程语言、监控、分布式系统
职业发展天花板 | 高级DevOps、架构师 | SRE专家、技术总监、CTOMy Understanding:
DevOps is like “advanced ops”, emphasizing automation and collaboration.
SRE is like “software engineer + ops expert”, emphasizing solving problems with code.
DevOps has a lower entry barrier; SRE requires higher skills but offers a higher ceiling.
Why Switch? The Pain of Traditional Ops
Typical Day Before Transition (Traditional Ops):
08:30 - Start work, check monitoring and alerts
09:00 - Handle overnight alerts (high DB connections)
10:00 - Product requirement review (evaluate release risk)
11:00 - Manual deployment (30 min, stressful)
12:00 - Lunch
13:00 - Write weekly report
14:00 - Receive dev request: provision 3 servers
14:30 - Manually create servers in cloud console
15:00 - Configure servers (SSH, install software)
16:00 - Incident post‑mortem meeting (why monitoring didn’t pre‑alert?)
17:00 - Review next day’s release plan, prepare scripts
18:00 - End of day
19:00 - Dinner, receive disk‑full alert
19:30 - Log in to clean logs
20:00 - Finally off work
Weekend:
- On‑call, ready at any time
- Immediate incident handling
Feelings:
- 😫 Fire‑fighting, no sense of achievement
- 😫 Repetitive tasks
- 😫 Little technical accumulation
- 😫 Slow salary growth
- 😫 No career outlookAfter Transition (SRE):
08:30 - Start work, view automated monitoring dashboard
09:00 - Code Review of team PRs (ops platform features)
10:00 - Develop auto‑scaling feature (write Go code)
12:00 - Lunch
13:00 - Technical design review: capacity planning
14:00 - Continue development, write unit tests
15:30 - Submit PR, trigger CI/CD pipeline
16:00 - Architecture discussion: design high‑availability for new service
17:00 - Write technical doc: "Capacity Planning Best Practices"
18:00 - End of day
On‑call (one week per month):
- Carry laptop, respond quickly (few incidents because system is stable)
- Resolve incidents within 5 min, write post‑mortem next day
Feelings:
- ✅ Writing code, creating value
- ✅ Continuous technical growth
- ✅ Strong sense of achievement (system stability improved)
- ✅ Salary doubled
- ✅ Clear career pathCore Reasons for Transition:
Career Development: Traditional ops ceiling is low; SRE/DevOps offers higher growth.
Salary: SRE salaries are 50‑100 % higher on average.
Technical Growth: SRE requires programming, system design, reliability engineering, capacity planning.
Work Experience: Move from repetitive tasks to value‑creating work.
Market Demand: Traditional ops roles shrink, SRE/DevOps demand surges.
SRE/DevOps Market Demand and Salary (2024 Data)
Job Demand (Lagou):
Traditional Ops: 4,500 positions (‑25 % YoY)
SRE: 3,800 positions (+45 % YoY)
DevOps: 8,200 positions (+38 % YoY)Salary Comparison (Beijing, 3‑5 yr experience):
Traditional Ops: 15‑25K
DevOps: 20‑35K (‑30‑40 % higher)
SRE: 25‑45K (‑60‑80 % higher)Top‑tier internet company SRE total packages: Alibaba P6 350‑500 k, ByteDance 3‑1 400‑600 k, Tencent T3‑1 350‑500 k, Meituan L6 350‑500 k.
Skill Requirements Comparison:
Traditional Ops: Linux, Shell scripts, monitoring tools, incident handling.
SRE/DevOps: Python/Go, Kubernetes & cloud‑native, CI/CD pipelines, system design & architecture, monitoring & observability, high‑availability design, capacity planning.
Conclusion: Transitioning to SRE/DevOps is not just a skill upgrade; it is a career leap.
Core Content: My 18‑Month Transformation Process
Phase 0 – Decision & Planning (Month 0, 1 month)
Self‑Assessment Before Transition
Spent a full month evaluating strengths and gaps.
Technical Ability:
✅ Linux (4 yr, proficient)
✅ Shell scripts (can write, not expert)
✅ Python (basic syntax only)
❌ Go (none)
✅ Docker (basic, no production use)
❌ Kubernetes (heard of, not used)
✅ Monitoring (Zabbix proficient, Prometheus unfamiliar)
❌ Programming depth (big weakness)
Soft Skills:
✅ Strong learning ability, willing to invest time
✅ Strong problem‑solving, good at fault isolation
✅ English: can read docs
✅ Communication: medium
Personal:
- Age 27, time to transition
- Single, no dependents
- Can spare 3‑4 h daily for learning
- Savings allow half‑year of lower income
Assessment:
✅ Need to transition (career bottleneck)
✅ Possible (young, time available)
⚠️ Biggest challenge: weak programming
⚠️ Secondary: no cloud‑native experienceChoosing SRE vs DevOps
Decided on SRE because:
DevOps Advantages:
- Lower barrier, many opportunities
- Focus on engineering practice and tools
DevOps Disadvantages:
- Lower technical depth
- Limited long‑term growth ceiling
- Skill gap with developers
SRE Advantages:
- High technical depth, large growth space
- Higher salary ceiling
- Closer to software engineering, can code
- Clear path: SRE → Architect → Technical Director
SRE Disadvantages:
- High barrier, requires programming
- Fewer positions (mainly big firms)
- On‑call pressure
My Choice: SRE
Reasons:
1. Want higher technical ceiling
2. Will invest time learning programming
3. Goal: join a top‑tier internet companyCreating the Transition Plan
Designed an 18‑month roadmap:
Phase 1 (Month 1‑6): Skill‑building
- Month 1‑2: Python fundamentals
- Month 3‑4: Kubernetes & cloud‑native
- Month 5‑6: Monitoring & observability
Phase 2 (Month 7‑12): Project practice
- Month 7‑8: Build SLO/SLI system
- Month 9‑10: Automation & CI/CD pipelines
- Month 11‑12: Develop internal developer platform (IDP)
Phase 3 (Month 13‑15): Interview prep & job search
- Month 13: Resume polishing, project梳理
- Month 14: Algorithm & system design practice
- Month 15: Apply, interview
Phase 4 (Month 16‑18): Onboarding & adaptation
- Month 16: Offer acceptance, handover
- Month 17‑18: Ramp up in new role
Budget:
- Time: 20 h/week (≈1,560 h total)
- Money: Courses ¥5,000, books ¥1,000
- Opportunity cost: possible short‑term salary dipPhase 1 – Skill‑Building (Month 1‑6)
Month 1‑2: Python Breakthrough
Why Python first?
SRE must code; Python is most used in ops.
Rich ecosystem for automation.
Easier to pick up than Go.
Learning Path:
Week 1‑2: Python basics – data types, control flow, functions, OOP, exceptions, modules.
Practice: 20 Easy LeetCode problems.
Week 3‑4: Advanced Python – decorators, generators, context managers, multithreading/multiprocessing/asyncio, requests, paramiko, pandas.
Project 1: Batch server management tool (SSH, parallel command execution).
Code Example: import paramiko
from concurrent.futures import ThreadPoolExecutor
class ServerManager:
def __init__(self, servers):
self.servers = servers
def exec_command(self, server, command):
"""Execute command on a single server"""
ip, user, password = server
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
try:
ssh.connect(ip, username=user, password=password, timeout=10)
stdin, stdout, stderr = ssh.exec_command(command)
output = stdout.read().decode()
error = stderr.read().decode()
return {"ip": ip, "output": output, "error": error, "success": len(error) == 0}
except Exception as e:
return {"ip": ip, "output": "", "error": str(e), "success": False}
finally:
ssh.close()
def batch_exec(self, command, max_workers=10):
"""Execute command on all servers in parallel"""
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [executor.submit(self.exec_command, s, command) for s in self.servers]
return [f.result() for f in futures]
servers = [("192.168.1.10", "root", "pwd"), ("192.168.1.11", "root", "pwd")]
manager = ServerManager(servers)
results = manager.batch_exec("df -h | grep -w /")
for r in results:
if r["success"]:
print(f"✅ {r['ip']}:
{r['output']}")
else:
print(f"❌ {r['ip']}: {r['error']}")Week 5‑6: Web development basics (FastAPI) – needed for building ops platforms.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uvicorn
app = FastAPI(title="Ops Platform API")
class DeployRequest(BaseModel):
app_name: str
version: str
servers: list[str]
@app.post("/api/deploy")
async def deploy(request: DeployRequest):
"""Deploy application"""
try:
result = deploy_application(request.app_name, request.version, request.servers)
return {"status": "success", "message": "Deploy succeeded", "result": result}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/status/{app_name}")
async def get_status(app_name: str):
"""Get app status"""
status = get_app_status(app_name)
return {"app": app_name, "status": status}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)Month 1‑2 Summary:
✅ Mastered core Python syntax and common libraries.
✅ Developed two useful automation tools.
✅ Built programming mindset.
✅ Wrote five technical blogs.
⚠️ Code quality still needs improvement.
Month 3‑4: Kubernetes & Cloud‑Native
Learning Path:
Week 1‑2: K8s basics – Minikube locally, Alibaba Cloud K8s trial.
Core concepts: Pod, Service, Deployment, ConfigMap, Secret, PV, PVC, Namespace, Labels.
Practice:
1. Deploy Nginx
2. Configure Service & Ingress
3. Manage config via ConfigMap
4. Perform rolling updates & rollbacks
Week 3‑4: Advanced K8s – architecture, scheduler, CNI, CSI, RBAC.
Monitoring: Metrics Server, Prometheus + Grafana.
Logging: EFK stack.
Auto‑scaling: HPA, VPA, Cluster Autoscaler.
Project: Containerize a company application.
- Analyze architecture & dependencies.
- Write optimized Dockerfile.
- Create K8s YAML manifests.
- Deploy to test env, set up monitoring & alerts.
- Write ops documentation.Month 3‑4 Summary:
✅ Mastered core K8s concepts and operations.
✅ Completed application containerization in test env.
✅ Understood cloud‑native philosophy.
✅ Published eight K8s articles.
⏭️ Next step: production rollout.
Month 5‑6: Monitoring & Observability
Learning Path:
Week 1‑2: Deep dive into Prometheus – architecture, data model, PromQL, exporter development, alert rules, Alertmanager.
Hands‑on: Build full monitoring stack (Node Exporter, kube‑state‑metrics, custom app exporter, MySQL/Redis/Nginx exporters).
Week 3‑4: ELK/EFK logging – Filebeat → Kafka → Logstash → Elasticsearch → Kibana.
Configure Filebeat for app logs, set up pipelines.
Week 5‑6: Distributed tracing with Jaeger – OpenTelemetry integration, request chain analysis, performance bottleneck identification.
Week 5‑6: Build unified observability platform integrating Metrics (Prometheus), Logs (ELK), Traces (Jaeger) with unified dashboards and smart alerting.Month 5‑6 Summary:
✅ Established complete monitoring & alerting system.
✅ Implemented centralized log management and analysis.
✅ Integrated distributed tracing.
✅ Reduced mean‑time‑to‑detect from 15 min to 30 s.
✅ Wrote ten observability articles.
Phase 1 (Month 1‑6) Overall Summary
✅ Programming: Python from zero to tool development.
✅ Cloud‑Native: K8s from novice to production‑ready.
✅ Monitoring: Full observability stack built.
✅ Output: 23 technical blogs.
✅ Projects: 2 tools, 1 platform.
⏭️ Next phase: apply SRE principles in the company.
Phase 2 – Project Practice (Month 7‑12)
Month 7‑8: Build SLO/SLI System
SRE Core Practice:
What is SLO/SLI?
- SLI (Service Level Indicator): metric measuring service performance.
- SLO (Service Level Objective): target for SLI, e.g., 99.95% availability.
- SLA (Service Level Agreement): contractual guarantee.
Example:
SLI: API request success rate
SLO: 99.95% (≈21.6 min downtime per month)
Steps:
1. Identify key user journeys (e.g., browse, search, add‑to‑cart, checkout, view order).
2. Define SLIs (availability, latency).
3. Set SLO targets (P0 99.95%, P1 99.9%, P2 99.5%).
4. Implement monitoring (Prometheus queries).
5. Use error‑budget mechanism to balance releases vs reliability.Result: Balanced fast iteration with reliability, data‑driven decisions, reduced noise alerts.
Month 9‑10: Automation & CI/CD
Full CI/CD Pipeline:
Goal: Automate from code commit to production release.
Architecture:
GitLab → GitLab CI → Docker Registry → ArgoCD → Kubernetes
Pipeline stages:
1. Build – Docker image, push to registry.
2. Test – unit tests, coverage.
3. Deploy‑staging – update image, health check.
4. Deploy‑production – canary release, automated verification.
5. Rollback on failure.
GitLab CI .gitlab-ci.yml snippet: stages:
- build
- test
- deploy-staging
- deploy-production
variables:
IMAGE_TAG: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
build:
stage: build
script:
- docker build -t $IMAGE_TAG .
- docker push $IMAGE_TAG
only:
- master
- develop
test:
stage: test
script:
- go test -v ./...
- go test -cover ./... -coverprofile=coverage.out
- go tool cover -func=coverage.out
coverage: '/total:.*\s(\d+\.\d+)%/'
deploy_staging:
stage: deploy-staging
script:
- kubectl set image deployment/app app=$IMAGE_TAG -n staging
- kubectl rollout status deployment/app -n staging --timeout=5m
- ./scripts/health_check.sh staging
environment:
name: staging
only:
- develop
deploy_production:
stage: deploy-production
script:
- # Canary release
- kubectl set image deployment/app-canary app=$IMAGE_TAG -n production
- kubectl rollout status deployment/app-canary -n production
- ./scripts/canary_analysis.sh
- kubectl set image deployment/app app=$IMAGE_TAG -n production
- kubectl rollout status deployment/app -n production
environment:
name: production
when: manual
only:
- masterResult: Deployment time reduced from 30 min to 5 min, frequency increased from weekly to 5‑10 times per day, failure rate dropped from 5 % to 0.3 %.
Month 11‑12: Internal Developer Platform (IDP)
Background: Company had 50+ applications with fragmented deployment, monitoring, and log access.
Solution: Build a unified ops platform.
Application Management (CMDB)
One‑click deployment, canary, rollback
Monitoring dashboard (health, QPS, latency, error rate)
Log query (real‑time, historical search)
Ticket system (resource, permission, change requests)
Tech Stack: Frontend Vue3 + Element Plus, Backend FastAPI + Celery, PostgreSQL + Redis, Docker + Kubernetes.
Key Backend API Example:
from fastapi import FastAPI, Depends, HTTPException
from sqlalchemy.orm import Session
import kubernetes
app = FastAPI()
@app.get("/api/apps")
async def list_apps(db: Session = Depends(get_db)):
"""List applications"""
apps = db.query(Application).all()
return [app.to_dict() for app in apps]
@app.post("/api/deploy")
async def deploy_app(request: DeployRequest, db: Session = Depends(get_db)):
"""Deploy application"""
if not check_permission(request.user, request.app_name):
raise HTTPException(403, "No permission")
task = DeployTask(app_name=request.app_name, version=request.version, env=request.env, user=request.user)
db.add(task)
db.commit()
deploy_task.delay(task.id)
return {"task_id": task.id, "status": "pending"}
@app.get("/api/logs/{app_name}")
async def get_logs(app_name: str, namespace: str = "production", lines: int = 100):
"""Fetch application logs"""
k8s_client = kubernetes.client.CoreV1Api()
pods = k8s_client.list_namespaced_pod(namespace=namespace, label_selector=f"app={app_name}")
logs = []
for pod in pods.items:
pod_logs = k8s_client.read_namespaced_pod_log(name=pod.metadata.name, namespace=namespace, tail_lines=lines)
logs.append({"pod": pod.metadata.name, "logs": pod_logs})
return logsResult: Platform adopted by >100 engineers, deployment time cut by 80 %, ops tickets reduced by 60 %.
Phase 2 (Month 7‑12) Overall Summary
✅ Established SLO/SLI system.
✅ Implemented full CI/CD pipeline.
✅ Developed internal platform used company‑wide.
✅ Team stability improved (incident time ↓70 %).
✅ Release efficiency ↑5×.
✅ Authored 15 technical articles.
✅ Presented at company tech conference.
⏭️ Gained solid SRE project experience for interviews.
Phase 3 – Interview Preparation & Job Search (Month 13‑15)
Month 13: Resume & Project Polishing
Resume Principles:
Use data: e.g., "Built Prometheus monitoring covering 200+ servers, reduced MTTR from 15 min to 30 s." Highlight business value: e.g., "CI/CD platform increased release frequency from weekly to daily, supporting rapid product iteration." Show technical depth: e.g., "Optimized K8s scheduler, improved cluster utilization from 60 % to 85 %."
Structure: Personal info, strengths, work experience (with quantified achievements), technical skills, education, open‑source contributions.
Month 14: Technical Prep
Focus areas:
Algorithms & data structures (LeetCode): 50 Easy, 30 Medium, 10 Hard (optional).
System design (SNAKE framework): Scenario, Necessary, Application, Kilobit, Evolve.
Linux & networking fundamentals.
Deep dive into projects – be ready to discuss architecture, challenges, solutions, and impact.
Month 15: Applications & Interviews
Target companies: top‑tier internet firms (Alibaba, ByteDance, Tencent, Meituan), unicorns, and foreign tech giants.
Interview record (selected):
ByteDance – Offer: 3‑1 level, 40K × 16.
Alibaba – Offer: P6, 35K × 15.5.
Meituan – Offer: L6, 38K × 16.
Tencent – Rejected on hard algorithm questions.
Key takeaways: Strong project experience and clear articulation win; algorithm skills still need improvement.
Phase 4 – Onboarding & Adaptation (Month 16‑18)
Month 16: Hand over current role, join ByteDance, familiarize environment.
Month 17: Rapid learning – new tech stack (Go‑centric), large‑scale systems, intensive training.
Month 18: Take ownership of a service, participate in on‑call, contribute to architecture design.
Outcomes: Salary doubled (20K → 40K), clear career path, significant technical growth.
Key Takeaways
Define clear motivation and goals.
Identify and fill skill gaps (programming, cloud‑native, monitoring).
Drive learning with real projects.
Apply SRE principles in current role.
Continuously output (blogs, open‑source) to build personal brand.
Find mentors or community support.
Set milestones and reward yourself.
Transitioning from traditional ops to SRE is challenging but highly rewarding, offering higher salary, broader technical scope, and clearer career advancement.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
