Rethinking Operations: The “Third Kind” of SRE at Lianjia
The article shares the author’s experience transitioning from private to public and hybrid clouds at Lianjia, introduces a “third kind” of operations that blends traditional and internet‑based practices, and discusses containers, DNS‑based naming, and automation tools to build adaptable, cost‑effective infrastructure.
1. Introduction
The author, formerly a technical director at Sina and Huawei, joined Lianjia as SRE lead, bringing experience from private, public, and hybrid cloud environments to propose new operational ideas.
2. Lianjia’s “Third Kind” of Operations
Lianjia describes a “third kind” of intermediary that combines offline brokerage with online product features, positioning its operations between traditional IT and fully internet‑based models. This approach seeks a balanced, gray‑area operational state rather than an extreme.
3. New Technologies
3.1 Misconceptions About New Tech
Many assume containers or OpenStack are the only solutions for cloud computing, but the author argues that such views are overly simplistic.
3.2 Understanding Containers
Using the analogy of shipping containers, the author explains that containers provide a uniform size for transporting diverse workloads, allowing teams to manage heterogeneous services without handling each application’s internal details.
3.3 When to Use Containers
In large organizations with hundreds of teams using varied languages and protocols, containers help standardize deployment. However, if the environment is homogeneous (e.g., all PHP on Apache), containers may add unnecessary cost.
3.4 Container‑based Tomcat
For small teams, using Tomcat as a container can be simpler than Docker, leveraging CGroup for resource isolation while keeping familiar tooling like Ansible, Puppet, or SaltStack.
4. Choosing the Best Solution
The author emphasizes evaluating the trade‑offs of each technology for the specific team and workload, acknowledging that future advances may shift the optimal choice.
4.1 The “RASH” Solution
To mitigate RPC‑induced blocking, the team built a library preloaded via
LD_PRELOADthat intercepts network calls, routes them through a Socks4 proxy, and quickly fails slow responses, preventing cascading delays.
4.2 RASH Algorithm
The algorithm tracks average response times and limits concurrent connections based on a configurable timeout, dynamically adjusting queue depth to avoid overload.
5. Alternative Naming Services
5.1 Naming Service Overview
Traditional DNS acts as a basic naming service; the team also uses etcd and SkyDNS for service discovery.
5.2 DNS‑Based Naming
By mapping services to domain names, code changes are minimized. To handle DNS caching issues, a DNSMasq layer is added, allowing remote cache invalidation and fast updates.
5.3 Handling Ports
Since DNS cannot directly encode ports, the team adopts SRV records (where supported) or encodes the port in the first sub‑domain (e.g.,
3306.mysql.lianjia.com).
6. Configuration Management and Automation
The author, a translator of “Running Ansible,” compares Puppet, Ansible, and SaltStack, favoring Ansible’s thin abstraction for large‑scale Linux fleets, while noting its memory usage and static configuration limitations.
7. Conclusion
Real‑world operations exist on a spectrum between idealized extremes; teams must assess new technologies for cost, benefit, and fit, and ensure solutions are truly runnable in production.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.