From Ops Engineer to Cloud Leader: 10 Years of Growth at Alibaba
This article chronicles a senior Alibaba technologist’s decade‑long journey through operations, monitoring, resource management, and product development, sharing practical insights on system automation, team leadership, career promotion, and the mindset needed to evolve from a junior engineer to a cloud‑native solutions architect.
About Me
Song Jian (nickname Song Yi) started working in 2008 and has spent more than 12 years focusing on operations. He joined Alipay in June 2010, working on monitoring, SRE, resource management, and operations products, witnessing the evolution from scripts to tool‑based and finally to automated intelligent operations.
2010.6‑2013.1 – Alipay (System Operations Department)
2013.2‑2015.12 – Technical Assurance (Unified BU for Alipay, Alibaba Cloud, Taobao, B2B, etc.)
2016.1‑present – Tianji (Responsible for the digital, automated, intelligent construction of Alibaba’s global data centers and operations system)
My Experience
1 Alipay
Keywords: open‑source monitoring, on‑call duty, emergency response
Joined the monitoring team when it was just being formed. Using Nagios, the first generation monitoring system for Alipay was built quickly. As business grew, a single Nagios instance could no longer scale, so Centreon was introduced to solve horizontal scaling. To reduce manual configuration, a template‑based approach and automatic server discovery were implemented, eventually automating the addition of monitoring and alerts for new machines.
SMS alerts were separated from business messages, leading to the procurement of dozens of SMS gateways and the development of a system that could both send and receive SMS to close alerts.
After a year of stabilizing the process, external on‑call staff were hired, trained, and a formal on‑call and emergency response workflow was established. Later, on‑call duties were internalized for security reasons, but the monitoring team continues to operate 24/7 from the Global Operations Command Center.
2 Technical Assurance
Keywords: monitoring unification, OD separation, resource management
In 2013 the department moved to the group level, and the first major project was the unification of monitoring across Alibaba (Alimonitor). The author was the sole monitoring engineer on the team, facing collaboration challenges that were eventually overcome through persistent communication.
Later, the OD (owner‑developer) separation project standardized more than ten products and hundreds of applications, improving tool stability. The author also built a resource lifecycle management system for the group’s testing environment, which tracked tens of thousands of servers, enforced expiration dates, and reclaimed idle resources, saving significant budget and supporting production during peak events.
3 Tianji
Keywords: StarAgent, Argus, cloud monitoring
In early 2016 the author moved to the product technology team to work on StarAgent, a foundational product that provides a command channel for server operations. He introduced a plugin platform and a web terminal (interactive and batch modes) to make command execution more efficient and secure.
After consolidating many agents, the Argus agent was created to unify monitoring across the group and later extended to public cloud customers.
The author now focuses on delivering end‑to‑end cloud monitoring solutions, covering cloud resources, hosts, business services, and network links.
My Growth
Describes a personal philosophy of “doing things → doing projects → building products”. Emphasizes the shift from “doing the right things” to “doing the right thing” driven by business value.
About Promotion
Discusses the importance of promotions in Alibaba, the pitfalls of over‑emphasizing level, and strategies for adjusting mindset after promotion failures.
About Transfer
Shares three notable moments when a transfer was considered, the reasons behind each, and advice on communicating with managers before making a move.
Doing Things
Highlights the need to connect daily tasks to larger quarterly and yearly goals, and the balance between the 99% (necessary work) and the 1% (breakthrough work).
Leading Teams
Outlines three principles for team leadership: defining a clear team mission, attracting talent, and collaborating across teams to achieve larger objectives.
Conclusion
The author reflects on a rewarding ten‑year journey at Alibaba, expressing gratitude to colleagues and confidence for the next decade.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
