Backend Development 19 min read

How to Ensure Reliable Push Messaging in Live Online Classrooms

This article examines the challenges of message loss in live‑streamed online classrooms, analyzes why pushes can fail, and proposes a comprehensive reliability strategy—including TCP fundamentals, multi‑channel redundancy, sequence‑based ordering, hole‑pulling, and configurable back‑end mechanisms—to achieve near‑100% delivery.

Tencent IMWeb Frontend Team

May 25, 2021

How to Ensure Reliable Push Messaging in Live Online Classrooms

1. Background

With the rise of live streaming and video applications, messaging scenarios have become richer, especially in education where features such as sign‑in and quizzes demand high reliability; a missed sign‑in could mean a student does not receive the teacher's prompt.

In a live classroom, teachers can send interactive elements (fish‑cake, sign‑in, quiz cards) via a message channel that also carries chat, heartbeats, member‑list updates, and private messages.

The message channel is a core module that supports all classroom interactions (quiz cards, red packets, sign‑in, hand‑raise, start/stop commands, mute, etc.) and chat.

Besides interactive messages, the channel also handles frequent CGI requests such as heartbeat and member‑list updates, as well as non‑classroom notifications like internal private messages.

1.2 Why Push Messages May Be Lost

A message traverses N modules and network hops; if any module fails to send, retries are triggered with a maximum count. Too many retries can block subsequent messages, while too few cause loss. As modules process messages asynchronously, ordering can become inconsistent, leading to duplicate or out‑of‑order delivery.

The main causes of loss are multi‑node flow, network jitter, and single‑connection overload, which are difficult to avoid.

1.3 Scenarios Requiring Reliable Messages

In the Penguin Tutor platform, teachers and students interact via push messages. Reliable pushes (e.g., hand‑raise, quiz, sign‑in) must not be lost, whereas ordinary chat can tolerate occasional loss.

If a reliable push such as “start in‑class exercise” is received but the corresponding “stop” push is missed, the exercise overlay may block the live video, forcing the student to re‑enter the room.

Missing a fish‑cake red packet feels like missing a huge reward.

2. Improving Push Reliability

2.1 Design Thinking

2.1.1 TCP Reliable Transmission Analysis

TCP ensures reliability through:

Acknowledgment and retransmission: the receiver acknowledges packets; the sender retransmits if no ACK is received within a timeout.

Data checksum.

Reasonable segmentation and ordering: TCP fragments data according to MTU, buffers out‑of‑order packets, and reorders them before delivering to the application layer.

Flow control: the receiver can signal the sender to slow down.

Congestion control: the sender reduces rate when the network is congested.

These mechanisms work for a single TCP connection, but large data volumes can still cause timeouts, increased retransmissions, or complete loss if the connection breaks.

2.1.2 Thought Process

Beyond the connection layer, an application‑level reliability layer is needed. In the ACC (access) service and client, a lightweight reliable scheme includes acknowledgment, retransmission, intelligent merging, flow control, and congestion control. However, when multiple nodes are involved (A‑B‑C), reliability becomes significantly harder.

Storing messages at the source and letting the destination retry is more practical than making every intermediate node reliable.

2.2 Reliability Design Points

Point 1: Assign a continuously increasing seq to each message when stored. Clients can detect missing messages by gaps in the sequence and request retransmission.

Point 2: Balance loss‑avoidance and ordering. Some messages need ordering, others do not; the channel guarantees delivery, while the business layer handles ordering if required.

Point 3: Use three channels – push channel, hole‑pull (gap‑fill) channel, and suffix‑pull channel – to improve arrival rate.

Point 4: Prefer short‑connection HTTP for pulling messages; long‑connection push handles real‑time delivery.

Point 5: Hole‑pull strategy: wait at least 2 s before pulling a missing message; merge pulls that occur within 300 ms to reduce load.

Point 6: Make configurable parameters (hole‑pull wait time, merge window, batch size) adjustable from the backend with safe local defaults.

Point 7: Dispatch logic: maintain maxHandledSeq; when a message with seq = maxHandledSeq + 1 arrives, or when a hole‑pull/suffix‑pull callback returns, trigger delivery. Handle normal, pseudo‑delivery for FAILED status after waiting, and delayed delivery for other statuses.

2.2 PUSH SDK Reliable Push Module Overall Design

2.2.1 Message Queue Design

A reliable queue can be represented by a single JavaScript object. Each message is indexed by its seq, enabling fast lookup, de‑duplication, and ordered delivery. The queue retains a limited number of recent messages for deduplication and discards older ones in batches.

2.2.2 Message State Transitions

2.3 Front‑Back Overall Design

When a message is stored, it is written to both Kafka (for push) and Redis (for pull). Push messages are delivered via long‑connection channels; pull messages are fetched via short‑connection HTTP.

3. Testing and Results

3.1 Test Cases and Expectations

Various scenarios were simulated to verify hole‑pull, suffix‑pull, and normal pull behaviors, including timing thresholds (2 s wait, 300 ms merge window) and handling of network timeouts.

3.2 Test Methods

Because the logic is complex, automated unit tests are ideal, but semi‑manual testing is also used: logging, simple simulation tools for push and loss, and console inspection.

3.3 Production Effect

After deployment, the message arrival rate approaches 100%.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems backend design push notifications live streaming Message Reliability tcp

Written by

Tencent IMWeb Frontend Team

IMWeb Frontend Community gathering frontend development enthusiasts. Follow us for refined live courses by top experts, cutting‑edge technical posts, and to sharpen your frontend skills.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.