How NoOps Transforms Operations: Automating Service Management
The article outlines the NoOps philosophy of automating routine operational tasks, describes how a tech‑learning team builds self‑service platforms, leverages open‑source tools, and invests in research to boost efficiency, stability, and innovation in modern internet services.
New Book Coming
Our new book is about to be released; this article is a preview for feedback.
No Ops
Our public blog is NoOps.me, and our operations philosophy is Ops Make No Ops . We aim to automate daily operations as much as possible, reducing manual work and the constant high‑stress state of handling online incidents.
Our goal is to build a technical, learning‑oriented team that, through research and sharing, enhances the team's technical influence and makes operations more efficient. We provide SaaS, DaaS, PaaS, IaaS‑like services, encapsulating layers so developers can self‑service resource requests, deployments, monitoring, and disaster recovery, delivering stable, efficient, and secure internet services. This frees operations engineers to focus on automation frameworks, business architecture design, and performance optimization.
R&D Capability
Operations engineers must ensure 24/7 stable service. Because internet services evolve rapidly, ops staff need not only OS, networking, hardware, and open‑source software skills but also strong development abilities.
With development skills, ops can quickly understand delivered services, anticipate risks, and give precise operational advice, enabling targeted optimization rather than trial‑and‑error. Ideally, each ops engineer becomes a product architect, capable of planning and improving system architecture.
Development skills also let ops turn experience into tools and platforms, advancing automation. In hiring, we prioritize candidates who master at least one scripting language, understand data structures, and demonstrate hands‑on ability.
Unlike some companies where platform developers lack ops insight, our platform team works closely with ops staff on requirement gathering and design reviews, ensuring tools meet real needs.
Our platform team handles core functions and UI development, while ops engineers contribute to monitoring, deployment, environment initialization, and daily tool development. For example, in our deployment system, ops built the build module, client, and controller, while platform developers handled the web UI.
Platform Development Philosophy
We expose various APIs for ops or developers to create personalized tools. Previously, tasks like server provisioning, renaming, or rebooting required system engineers; now those actions are automated via APIs used by the service management platform, allowing application ops engineers to perform them directly. Over time, these permissions are also opened to developers.
Automation frees ops staff to focus on higher‑value work, improving efficiency. For instance, Hadoop developers use asset APIs to schedule storage decisions, and our deployment system integrates with monitoring APIs to coordinate deployment actions with monitoring state.
Learning and Sharing
We need talented ops engineers with development skills, but they dislike repetitive tasks. We therefore allocate 20‑40% of time for research and project development, encouraging exploration of new technologies.
Examples include using GPU‑accelerated cards for RSA in HTTPS to reduce CPU load and optimizing Nginx code, as well as studying virtualization, Docker, and big‑data techniques to improve deployment and detect network attacks.
We promote continuous learning, open communication, and sharing of experiences to benefit individuals, teams, and the industry, which is also the purpose of our book.
Open‑Source Software Usage
We adopt a pragmatic approach to open‑source tools, mastering them at the code level to both use and troubleshoot them effectively, without blindly recreating functionality.
We have used Puppet, Ansible, Zabbix, God, Docker, Mesos, Etcd, etc. Puppet was used only for its DSL in early deployment scripts, not for its client‑server configuration features. Ansible handles temporary batch operations via SSH, integrated with our service tree for convenient task execution.
Etcd registers LVS real servers; Nginx registers its status to Etcd, and LVS reads the list to update configuration and reload.
Initially we used Zabbix for monitoring, but when it could not meet our needs, we developed our own monitoring system.
Conclusion
We despise the term "Operator"; our goal is to provide automated platforms and tools that enable developers to self‑service product deployment—No Ops!
Follow the Xiaomi ops team by reading the original article.
Article edited and published by Wen Guobing.
How to Join the Community
InfoQ founder and CEO Huo Taiwen created the "InfoQ High‑Efficiency Ops" WeChat group (full, join group 2). Invitation‑only; to apply, add Xiao Tianguo on WeChat, mention the group name, and prepare a red envelope as a token.
Scan the QR code to follow us:
Image credit: basistechnologies.com, By Stuart Browne
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.