Operations 21 min read

Mastering Internet Operations: Roles, Responsibilities, and Evolution

This article provides a comprehensive overview of internet operations, detailing how service‑centric stability, security, and efficiency are achieved through infrastructure management, monitoring, risk mitigation, and continuous optimization, while outlining the various operational roles, their duties, and the evolution of ops practices.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Internet Operations: Roles, Responsibilities, and Evolution

Internet operations focus on service‑centric stability, security, and efficiency, ensuring 24/7 high‑quality service for users.

Operations engineers strengthen the reliability of the underlying infrastructure, basic services, and online applications, performing routine inspections to identify potential issues, optimizing architecture to prevent common failures, and enhancing disaster‑recovery capabilities through multi‑data‑center integration.

By leveraging monitoring, log analysis, and other technical methods, they promptly detect and respond to service faults, reducing downtime and meeting availability targets.

In terms of security, they safeguard all layers of the service stack, ensuring users can safely and completely access online applications.

Security measures include network perimeter segmentation, ACL management, traffic analysis, DDoS mitigation, OS and open‑source software vulnerability scanning and patching, as well as protection against XSS and SQL injection. Additional practices cover security process definition, white‑box/black‑box code scanning, permission audits, intrusion detection, and business risk control.

Operations staff must keep services running in a secure, controllable state, protecting both company and user data while resisting malicious attacks.

Beyond stability and security, they also drive efficiency by optimizing I/O to improve database performance, compressing images to reduce bandwidth, and using tools to accelerate product releases and internal workflows.

Examples include I/O optimization for better database throughput and image compression to lower bandwidth usage, delivering maximum user value with minimal resources. They also enhance internal efficiency through various platform tools for faster deployment and operational tasks.

Operations Work Classification

As businesses grow, mature internet companies increasingly subdivide operations roles.

Initially, startups may have only system operations, but as service scale and quality demands rise, work is further divided.

The typical classification and responsibilities are illustrated below.

System Operations

System operations handle IDC, network, CDN, and basic services (LVS, NTP, DNS) construction, as well as asset management, server selection, delivery, and maintenance.

1. IDC Data Center Construction

Gather business requirements, forecast data‑center scale, and evaluate factors such as backbone network distribution, building design, Internet access, attack defense, expansion capacity, space reservation, dedicated lines, and on‑site support before selecting and building the data center.

2. Network Construction

Design and plan production network architecture, including data‑center, transport, and CDN networks, and perform daily network tuning.

3. LVS Load Balancing and SNAT Deployment

LVS serves as the traffic entry point, building load‑balancing clusters based on scale and demand, providing high‑performance, high‑availability dispatch and unified network‑layer attack protection.

SNAT offers centralized public‑network access for the data center, ensuring high performance and availability through clustered deployment.

4. CDN Planning and Construction

CDN work is split between third‑party services and self‑built solutions, involving vendor selection, node planning, monitoring, and fault handling such as user hijacking.

5. Server Selection, Delivery, and Maintenance

Conduct comprehensive testing of servers and components, reduce power consumption, and increase rack density.

Leverage business knowledge to promote new hardware and solutions, reduce server investment, and develop diagnostic tools for hardware failures.

6. OS and Kernel Selection & Maintenance

Select, customize, and optimize operating systems and kernels, manage patches, maintain YUM repositories, and address OS‑related incidents.

7. Asset Management

Record and manage physical assets such as data centers, networks, racks, servers, ACLs, and IPs, establishing accurate processes and providing API interfaces for automation.

8. Basic Service Construction

Design highly available architectures for DNS, NTP, SYSLOG, and other essential services to avoid single points of failure.

Application Operations

Application operations manage online service changes, monitoring, disaster recovery, and data backup, performing routine inspections and emergency handling.

1. Design Review

Participate in product design reviews, offering operational perspectives to ensure high availability requirements are met.

2. Service Management

Define upgrade, rollback plans, and execute changes, understand service dependencies, set stability metrics, improve monitoring accuracy, and respond promptly to incidents.

3. Resource Management

Manage server assets, assess data‑center distribution, network bandwidth, and allocate resources efficiently according to service needs.

4. Routine Inspection

Establish regular inspection points, continuously refine them, and promptly investigate and resolve discovered issues.

5. Contingency Planning

Set thresholds for monitoring metrics, develop response procedures, maintain and regularly exercise contingency documents.

6. Data Backup

Define backup strategies, ensure data availability and integrity, and conduct regular recovery tests.

Database Operations

Database operations handle storage design, schema and index planning, SQL optimization, change management, monitoring, backup, and high‑availability design.

1. Design Review

Participate early in product design to propose storage solutions, schema designs, SQL standards, and indexing strategies for high performance and availability.

2. Capacity Planning

Monitor database capacity limits, identify bottlenecks, and perform optimization or scaling before limits are reached.

3. Backup and Disaster Recovery

Establish backup and disaster‑recovery policies, regularly test restores, and ensure data reliability.

4. Database Monitoring

Implement health and performance monitoring to detect issues promptly.

Database Security: Build account systems, enforce strict permissions, manage offline backups, and reduce leakage risk.

5. High Availability & Performance Optimization

Design failover mechanisms, continuously optimize performance through hardware, storage, filesystem, and SQL improvements while controlling costs.

6. Automation System Development

Develop automated platforms for deployment, scaling, sharding, permission management, backup/recovery, SQL review, and failover.

7. Operations R&D

Design and develop generic operations platforms such as asset management, monitoring, and data‑permission systems, providing APIs for automation.

8. Operations Platform

Record services and relationships, enable automated routine tasks like machine management, restarts, renaming, initialization, domain handling, traffic switching, and contingency execution.

9. Monitoring System

Design and develop monitoring solutions that collect, alert, store, analyze, and visualize server and network metrics, improving timeliness, accuracy, and intelligence of alerts.

10. Automated Deployment System

Participate in building automated deployment tools, handling data, permissions, API, and web development, leveraging cloud computing to provide high‑availability PaaS platforms.

Operations Security

Operations security strengthens network, system, and business layers through regular scanning, penetration testing, tool development, and incident response.

1. Security Policy Development

Create practical, enforceable security policies aligned with internal processes.

2. Security Training

Provide targeted training and assessments, establishing security responsibility across the organization.

3. Risk Assessment

Conduct regular black‑box and white‑box testing to evaluate risks across networks, servers, applications, and user data.

4. Security Hardening

Reinforce weak points based on assessments, deploy defenses, patch promptly, and employ encryption, anonymization, and data deletion techniques.

5. Compliance

Handle external compliance requirements such as payment licensing.

6. Incident Response

Maintain an alert system, collect third‑party findings, coordinate remediation, assess impact, and investigate root causes.

Evolution of Operations Work

Early teams performed basic data‑center construction, network setup, and server provisioning with minimal online service involvement.

As products matured, requirements for service quality grew, leading to added responsibilities like server monitoring, LVS/Nginx management, and manual change processes.

Initial changes were manual per‑server or via simple scripts; monitoring focused on server health using tools like Nagios or Cacti.

Growth prompted division into system and application operations, with application ops taking over online services, monitoring, backups, and change management.

Further scaling introduced multi‑data‑center disaster recovery, extensive pre‑planning, and the need for automated platforms to handle complex service relationships.

Security incidents also increased, driving deeper investment in defensive measures and leading to five major operational categories each requiring specialized expertise.

System operations now concentrate on infrastructure stability and efficiency, while application operations focus on service performance; database operations specialize in automation, performance, and security; operations R&D and security provide platforms and tools to enhance overall reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsDevOpssecuritySystem AdministrationInfrastructure
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.