Operations 19 min read

Comprehensive Guide to Modern IT Operations: Roles, Responsibilities, and Evolution

This article outlines the service‑centric principles of internet operations, details the various categories of work such as system, application, database, and security operations, and traces the evolution of operational practices from manual management to automated, platform‑driven workflows.

Efficient Ops

Jun 25, 2017

Comprehensive Guide to Modern IT Operations: Roles, Responsibilities, and Evolution

Operations Work Classification

Internet operations focus on service, emphasizing stability, security, and efficiency to ensure 24/7 high‑quality service for users.

Operations engineers strengthen the stability of underlying infrastructure, services, and online applications, conduct daily inspections to uncover hidden risks, optimize architecture to prevent common failures, and improve disaster‑recovery capability through multi‑data‑center integration.

Through monitoring, log analysis and other techniques, they quickly detect and respond to faults, reducing downtime and meeting availability targets while maintaining security across network boundaries, ACLs, traffic analysis, DDoS defense, OS and open‑source vulnerability patching, and application‑level XSS/SQL‑injection protection.

Operations also optimize performance—e.g., I/O tuning for databases, image compression to reduce bandwidth—and enhance internal release efficiency with tooling platforms.

System Operations

IDC Data‑Center Construction : Collect business requirements, forecast scale, evaluate network backbone, space, external lines, on‑site support, and build/maintain the data center.

Network Construction : Design and plan production network architecture, including data‑center, transport, and CDN networks, and perform daily network tuning.

LVS Load Balancing and SNAT : Deploy load‑balancing clusters as traffic entry points, provide high‑performance, high‑availability dispatch and unified attack defense.

CDN Planning and Construction : Manage third‑party and self‑built CDN, select nodes, monitor CDN health, and handle user hijacking incidents.

Server Selection, Delivery, and Maintenance : Test and select servers, reduce power consumption, increase rack density, and diagnose hardware faults.

OS and Kernel Selection & Maintenance : Choose and customize OS, optimize kernels, manage patches, build YUM repositories, and handle OS‑related incidents.

Asset Management : Record and manage physical resources (data‑center, network, cabinets, servers, ACLs, IPs), provide accurate information via APIs for automation.

Basic Service Construction : Design highly available DNS, NTP, SYSLOG services to avoid single points of failure.

Application Operations

Design Review : Participate in product design reviews to ensure high‑availability requirements are met.

Service Management : Define upgrade, rollback plans, track service dependencies, set stability metrics, and improve monitoring accuracy.

Resource Management : Manage server assets, data‑center distribution, bandwidth, and allocate resources according to service needs.

Routine Checks : Establish and continuously improve regular inspection points to identify and eliminate hidden risks.

Plan Management : Set thresholds for monitoring indicators, create and update incident response plans, and conduct regular drills.

Data Backup : Define backup strategies, ensure backup availability and integrity, and perform regular restore tests.

Database Operations

Design Review : Contribute DBA perspective on storage, schema, index, and SQL standards during product design.

Capacity Planning : Monitor database capacity limits, identify bottlenecks, and trigger optimization or scaling.

Backup & Disaster Recovery : Establish backup and DR strategies, regularly test recovery procedures.

Database Monitoring : Implement health and performance monitoring to detect issues promptly.

Database Security : Build account hierarchy, enforce strict permissions, manage offline backups, and apply encryption/obfuscation.

High Availability & Performance Optimization : Design failover schemes, introduce new storage, hardware, filesystem, and SQL optimizations while controlling cost.

Automation System : Develop automated deployment, scaling, sharding, permission management, backup/recovery, and SQL review tools.

Ops R&D : Build generic platforms for asset management, monitoring, and data‑permission systems, exposing APIs for higher‑level automation.

Monitoring System : Design and develop collection, alerting, storage, analysis, and visualization of server and business metrics.

Automated Deployment System : Provide PaaS‑style high‑availability platforms, improve deployment speed, and enhance resource utilization.

Operations Security

Security Policy Establishment : Create practical internal security policies.

Security Training : Deliver targeted training and assessments, establish security owners.

Risk Assessment : Conduct black‑box/white‑box testing to evaluate network, server, application, and data risks.

Security Construction : Harden weakest links, deploy security devices, patch promptly, scan source code, and apply data encryption/anonymousization.

Security Compliance : Meet external compliance requirements such as payment licensing.

Emergency Response : Build alert systems, collect third‑party findings, coordinate remediation, and perform post‑incident analysis.

Evolution of Operations Work

Early teams performed basic data‑center construction, network setup, and server procurement with minimal online service involvement.

As products matured, teams added server monitoring, LVS/Nginx layer‑4/7 operations, and began manual service changes or simple batch scripts, using open‑source tools like Nagios and Cacti.

Growth led to division into System Operations and Application Operations. Application teams took over online services, implementing monitoring, backup, and change management, while writing tools for bulk operations.

Increasing scale introduced multi‑data‑center disaster recovery and plan management. Open‑source monitoring could no longer meet performance needs, and manual processes became insufficient.

Security incidents prompted stronger defensive measures, resulting in five major work categories each requiring specialized talent.

System Operations now focus on infrastructure stability and resource delivery; Application Operations concentrate on service health and efficiency; Database Operations specialize in automation, performance, and security; Ops R&D and Ops Security provide platforms and tools to boost overall stability, efficiency, and safety.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations security System Administration

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.