Information Security 24 min read

Inside YY's Security Ops: Real-World Incident Stories and Architecture

This article shares YY's security operations journey, detailing real incident response scenarios, the evolution of their security infrastructure from 2012 onward, and the key factors considered when building a robust security ops system, including DDoS protection, WAF, vulnerability scanning, intrusion detection, and data‑driven automation.

Efficient Ops
Efficient Ops
Efficient Ops
Inside YY's Security Ops: Real-World Incident Stories and Architecture

1. Overview

The article is divided into three parts: personal security‑ops firefighting experiences that many security or ops engineers may relate to, YY's security‑ops development from 0‑1 (starting in 2012 and spanning four years), and the factors considered when building a security‑ops system.

2. YY's Security‑Ops Perspective

YY's core business is live streaming and games (e.g., Huya Live, YYlive, BIGO LIVE). Security issues are tightly linked to business, such as streamer IP hijacking, audio/video hijacking, and other platform‑specific attacks. The mission of security‑ops is to keep the business running securely and stably, covering servers, network, hardware, and the underlying platform. Quality, cost, and efficiency are essential, but without security they are meaningless.

Whether a dedicated security‑ops team is needed depends on company size and stage; early‑stage startups often rely on system admins or third‑party services.

Incident Scenarios

Intrusion : With tens of thousands of servers, even a 0.01% intrusion probability can cause several daily incidents. Intrusions can modify switch configurations, cause network outages, or lead to data leaks.

Attack : DDOS and CC attacks are common; a 15‑20 GB DDOS attack costs about 120 CNY, and a 20‑server cluster can be taken down with a few hundred yuan. High‑volume packet attacks can reach 700 kpps with cheap bandwidth.

Cracking : Mobile malware, audio/video hijacking, and other exploits target business workflows to achieve malicious goals.

Vulnerability : Open‑source component bugs (e.g., Heartbleed) require rapid assessment and patching; business‑level bugs such as audio hijacking or IP leakage also demand urgent fixes.

Data Breach : External intrusions or insider leaks can expose plaintext passwords and other sensitive data.

Hijacking : Hijacked traffic disrupts user experience, especially in audio/video services.

3. Firefighting Cases

Case 1: Server connections drop every 5 seconds

Thousands of servers in two data centers experienced intermittent connection failures. Initial suspicion was an attack, but limited visibility made diagnosis hard. The issue turned out to be a pulse‑style attack sending millions of packets to the iOS cluster, causing switch congestion.

Case 2: Nginx fork queue blockage

During heavy traffic, Nginx returned many error codes, and the fork queue became blocked, leading to request timeouts. Manual IP‑rate limiting was applied to keep services alive, though it risked affecting legitimate users.

Case 3: Gift‑seizing by bots during promotional events

Automated bots grabbed free tickets and gifts, causing massive loss of promotional resources. User‑behavior profiling and fingerprinting were later used to filter out non‑human participants.

Case 4: Rapid patching of critical vulnerabilities (e.g., Struts2, Heartbleed)

With tens of thousands of servers, coordinated patching is challenging. YY evaluated impact, prioritized upgrades, and balanced risk versus operational cost.

Case 5: Massive IP packet storms overwhelming switches

Compromised servers generated abnormal traffic, leading to switch overloads and network anomalies.

Case 6: Persistent malicious processes that cannot be killed

Advanced rootkits used watchdog mechanisms to respawn after termination, making removal difficult.

Case 7: OS session table exhaustion

High connection counts filled session tracking tables, degrading service performance; tuning system parameters raised the attack threshold but did not eliminate the issue.

4. YY Security‑Ops Development Timeline

Since December 2012, YY built a security‑ops system step by step:

2012‑2013: Launched distributed WAF across 8‑10 data centers and began vulnerability scanning.

2014: Implemented automatic malware analysis and a security‑incident response portal.

2015: Deployed anti‑brush mechanisms and cloud‑based DDoS protection (using DBDK).

2016: Introduced user‑behavior profiling, IP fingerprinting, device fingerprinting, and a big‑data platform for analytics.

5. Security‑Ops Achievements

YSRC Portal

A dedicated portal for collecting vulnerability reports and rewarding contributors.

YSEC_WAF

The WAF inspects every request, applying challenges or rate‑limiting, and returns standard 412 responses for blocked traffic. Real‑time interception data and visual dashboards were added for easier debugging.

Vulnerability Scanning Architecture

A custom scanning cluster performs port and application‑level checks across all servers, generating daily reports and prioritizing patches.

Intrusion Detection System

Network‑ and application‑layer IDS alerts are sent to the SOC backend and stored for analysis.

Vulnerability Alert Emails

Automated, UI‑enhanced email notifications inform developers of newly discovered vulnerabilities.

Cloud DDoS Technology

Since 2015, YY uses a cloud‑based DDoS mitigation system built on Intel DBDK. Traffic is mirrored from core switches, analyzed, and malicious flows are scrubbed before being re‑injected, ensuring legitimate traffic remains unaffected.

6. Importance of Reporting and Data

Big data is essential for security analytics. YY released a security big‑data report showing 400 TB of cleaned traffic, 2.2 billion attack events, and insights for future system improvements.

7. Factors Considered When Building the Security‑Ops System

Cost‑Benefit Ratio : Evaluate investment versus output, especially in early stages where resources are limited.

Key Contradictions : Identify the most pressing business‑security conflict and address it first.

Not Pursuing 100 % : Achieving 80 % coverage often solves the majority of problems with reasonable effort.

Phased Implementation : Incremental improvements yield higher ROI than attempting a perfect solution from the start.

Data‑Driven, Visualization, Automation : Quantify security work, provide dashboards, and automate scanning and mitigation to scale with growth.

Balancing Security and Business : Adjust security controls (e.g., phone‑binding for registration) to align with business growth and user experience.

incident responsesecurity operationsDDoS Protectionbig data analyticsVulnerability Scanning
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.