Cloud Native 10 min read

Designing Scalable and Reliable Backend Services at English Fluently: Architecture, Service Discovery, Monitoring, and Autoscaling

This article shares the engineering team’s experience of building a high‑growth, reliable backend for English Fluently, covering inter‑service communication with gRPC, service discovery, Docker‑based deployment, health‑checking, monitoring, autoscaling, Kubernetes orchestration, and multi‑cell availability strategies.

Liulishuo Tech Team
Liulishuo Tech Team
Liulishuo Tech Team
Designing Scalable and Reliable Backend Services at English Fluently: Architecture, Service Discovery, Monitoring, and Autoscaling

English Fluently’s user base has been growing rapidly, and the engineering team needed to ensure stable and reliable service delivery. This short article outlines the challenges they faced and the solutions they adopted, providing useful references for interested readers.

Interoperability

The internal teams (algorithm, data, backend) use different programming languages. To simplify cross‑team service consumption, the team evaluated Thrift and gRPC, ultimately choosing gRPC for its ability to carry extra meta information such as traces. They piloted it on low‑traffic services before rolling it out to high‑traffic ones, encountering and quickly fixing memory leaks in Python and Java as well as a Ruby Unicorn fork incompatibility.

Service Discovery

When services scale across many machines, callers need a way to locate service instances. Two approaches were considered: a service registry (dynamic address lookup) and HAProxy‑based traffic forwarding. Because implementing a multi‑language registry was costly and gRPC developers had their own roadmap, the team adopted HAProxy configuration changes, later addressing dynamic IP changes as the number of services grew.

Standardization

Different teams relied on varied environments, making deployment scripts complex and upgrades error‑prone. The team moved to Docker for all services, enabling consistent CI pipelines. Images from the development registry are automatically synchronized to the production registry for release branches, while non‑production branches remain unsynced due to volume.

Health Checks

Increasing service count revealed unstable response times that could exhaust caller thread pools. The team introduced circuit breakers, timeout settings, and graceful degradation. Since HAProxy could only perform TCP health checks, each gRPC service now exposes a health‑check endpoint that a monitoring service polls; unhealthy instances are automatically restarted via a shared library.

Monitoring, Alerting, and Log Collection

To gain visibility into latency, error rates, and request volume, each service provides a /metrics endpoint collected by Prometheus and visualized in Grafana, with alerts routed to owners. Logs, including alerts and exceptions, are shipped via Fluentd to an Elasticsearch cluster for searchable analysis.

Elastic Scaling

Traffic spikes required dynamic resource allocation. The team leveraged AWS Auto Scaling Groups, which launch or terminate instances based on CPU/Memory thresholds and execute custom scripts. Prior Dockerization simplified instance startup.

Cluster Scheduling and Deployment

Auto Scaling introduced two problems: resource waste when each group runs a single service, and the operational overhead of manually specifying CPU, memory, and disk for each new service. Adopting Kubernetes solved both issues with built‑in resource management and scheduling, while Spinnaker handled continuous delivery.

Cellular Architecture

Multiple "cells"—independent Kubernetes clusters deployed in separate VPCs—provide isolation and clean architecture. Each cell contains several autoscaling groups with heterogeneous resource profiles, allowing services to be placed on appropriate machines via selectors.

Availability

Cells are spread across different availability zones within the same city, improving resilience. An internal load balancer distributes traffic among cells, and the design can be extended to multi‑region deployments using intelligent DNS routing. Stateful services (e.g., databases) require careful data‑sync handling, with a primary‑cluster redirect strategy for writes.

Robustness

Process failures trigger automatic restarts, node failures cause automatic replacement, and whole‑cluster issues can be mitigated by traffic shifting to healthy clusters, though stateful services are not yet fully covered.

Conclusion

The article summarizes the problems encountered and the architectural changes made over the past period, acknowledging ongoing improvement opportunities and inviting readers to join the team.

monitoringdockerMicroserviceskubernetesservice discoveryautoscalinggRPC
Liulishuo Tech Team
Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.