How Sogou Built a Scalable Big Data Platform: Lessons from a User Perspective
This article shares Sogou's journey in constructing a large‑scale big data platform, covering business overview, the evolution of its operations infrastructure, productization practices, security measures, and practical tips for medium‑size teams seeking to add value from data.
Feng Jin – Platform Technology Director at Sogou, responsible for search, operations, cloud, and big data product development, now leads the team building Sogou's big data foundation platform, data management, and applications.
Preface
The talk "Building Sogou's Big Data Platform from a User Perspective" explores how to find value in big data, especially for medium‑ or small‑scale company data teams, and outlines three parts: an overview of Sogou's big data business, an introduction to the basic operations platform, and productization practices.
Sogou Big Data Business Overview
Sogou is a typical big data company; its search engine processes hundreds of billions of records, now reaching about 2000 billion. Sogou Input Method serves over 400 million daily active users, handling massive concurrency on 4 GB memory machines.
The evolution of big data at Sogou follows four stages:
Early Hadoop era (pre‑2010): MapReduce for batch processing.
Hive emergence (around 2010): Lowered entry barrier for BI, data engineers, and SQL users.
Real‑time computing surge (around 2012): Driven by Alibaba's Double‑11.
Cloud‑native AI components (last two years): Machine‑learning and advertising algorithms become easier to deploy.
Sogou's own stages mirror this:
Search‑centric big data era: massive data collection, storage, sorting, and link analysis.
Hadoop‑aligned era: introduction of core products, reporting, and real‑time computation.
AI‑driven era (since 2016): new commercial demands after Sogou's IPO.
The platform consists of data sources, agents, storage, computation, and applications, with extensive optimizations in module connections, service monitoring, and resource scheduling.
Sogou Basic Operations Platform Introduction
Initially (around 2012) Sogou operated a few thousand machines using open‑source tools. As scale grew, the team shifted to a self‑developed platform to achieve:
Flexibility and integration with business, OA, finance, and budgeting systems, avoiding the need to open multiple tools for troubleshooting.
Usability by consolidating business‑view and user‑view functions into a Resource‑Operation‑Center (ROC), which manages clusters, bandwidth, storage, monitoring, security audit, procurement, and ticketing.
Key concepts include:
Transition from machine‑level to cluster‑level management, standardizing leaf nodes as minimal service units.
Implementation of daily‑operation components such as system reinstall, machine guarantee, IP/NAT configuration, and software environment provisioning.
Process and log definitions that enable lightweight integration of any module into the operations platform.
Monitoring is divided into two categories:
Black‑box monitoring : Simulates user interactions, supports plugins for TCP, MySQL, Redis, etc., with semantic‑based alert policies.
White‑box monitoring : Collects full system logs and metrics, enabling detailed fault analysis and flexible alert composition.
Alert strategies range from lightweight notifications (email, phone) to complex condition‑based rules that trigger when a certain proportion of machines fail.
Sogou Big Data Productization Practice
The team focuses on turning the platform into a product that delivers value across the company.
Key initiatives include:
Self‑developed Hadoop permission management for multi‑tenant security, supporting password and IP‑based authentication.
Strict data usage approval and supervision workflows.
Data encryption and masking for cross‑product data exchange.
To improve data discoverability, Sogou built a "data cloud" that automatically discovers datasets via side‑channel monitoring, tags them, and presents a searchable interface similar to a search engine.
Additional efforts:
Structured metadata (size, file count, update time) to enable quick data lifecycle management.
Path and dependency mapping for data processing tasks.
Data warehouse layer built on Hive/Spark/SQL for ad‑hoc analysis.
Public schema and demo data dictionaries to encourage reuse.
Mobile OA integration for data request approvals.
Product‑level solutions were created to lower the barrier for non‑technical users, including:
A SaaS‑style data solution with SDKs for easy integration.
One‑click data ingestion into the warehouse after cleaning.
A BigQuery‑like SQL engine for self‑service analytics.
Automated report generation tied to business metrics.
Finally, personalized data recommendations and push notifications were introduced to help employees focus on the most relevant data each day.
Note: This article is compiled from Professor Fang Hao's talk at GOPS 2018 Shenzhen. Editor: Huang Xiaoxuan
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.