Understanding WebSocket Protocol and Its Application in Real‑Time Speech Recognition

The article explains why traditional polling methods fall short for real‑time data, introduces the WebSocket protocol’s full‑duplex handshake and heartbeat mechanisms, and demonstrates how a Java‑based WebSocket service efficiently streams audio to an ASR engine for low‑latency speech recognition.

HelloTech
HelloTech
HelloTech
Understanding WebSocket Protocol and Its Application in Real‑Time Speech Recognition

Because of business requirements, the author explored the WebSocket protocol and its use in real‑time speech recognition. The author found the topic interesting and worth sharing, as many people have never used this protocol.

The article is divided into three parts: 1) Introduce a real‑world scenario to analyze the drawbacks of traditional real‑time data transmission methods, leading to the need for WebSocket; 2) Explain the WebSocket protocol workflow, handshake process, and heartbeat mechanism; 3) Demonstrate how WebSocket is applied in a speech‑recognition business scenario and the value it brings.

1. Real‑World Scenario

When trading stocks, how is the price updated in real time? Two common approaches are WebSocket and gRPC server‑streaming. Before these, the classic method was either short polling or long polling over HTTP, both of which have serious drawbacks such as high latency or resource waste.

Short polling repeatedly requests the server at a fixed interval. If the interval is long, real‑time performance suffers; if it is short, the server is flooded with unnecessary requests.

Long polling lets the request wait on the server until new data is available or a timeout occurs, reducing request frequency but still holding server resources when no data is present.

Both methods can exhaust connection limits when many clients are involved.

Therefore, a full‑duplex communication protocol is needed that allows both client and server to send messages proactively without re‑establishing HTTP headers for each transmission. WebSocket satisfies these requirements.

WebSocket was created to address two pain points of HTTP: 1) HTTP’s passive request‑response model, which cannot push data without a client request; 2) HTTP’s stateless nature, which requires a new connection context for each request‑response cycle.

Although HTTP/1.1 supports persistent connections, they are “pseudo‑persistent” because the application layer still creates a new request‑response context each time.

2. WebSocket Protocol Introduction

The overall workflow of WebSocket is as follows: a TCP three‑way handshake establishes the connection, then an HTTP handshake upgrades the connection to WebSocket. After the handshake, the connection is no longer HTTP; both server and client can exchange messages via onMessage. Either side can close the connection, triggering the TCP four‑way termination.

Handshake Details

Heartbeat Mechanism

Two questions arise: why is an application‑level heartbeat needed when TCP already has keep‑alive? TCP keep‑alive only checks the transport layer, not the application state. The default TCP keep‑alive interval is 2 hours, which is too long for real‑time services. Additionally, network interruptions can cause “TCP dead” situations where the client is gone but the server still believes the connection is alive. Finally, Java’s abstraction makes direct manipulation of the transport layer inconvenient.

When a client disconnects abnormally, different handling strategies are required, as illustrated in the following table:

client

server handling

client handling

approach

Process killed

Trigger onerror and onclose /

Port closed, TCP four‑way termination; external TCP closure triggers exception

Power/network loss

Check last heartbeat timestamp

Trigger onclose Without heartbeat, a reconnection creates a new session while the dead TCP continues sending data; with heartbeat, the server can close the old session promptly

3. WebSocket in Speech Recognition

Business background: when a user opens the page, the microphone starts listening. After the wake‑up word is detected, the system can execute commands (unlock, navigation, etc.). The front‑end slices audio into packets and sends them through a single channel to an ASR engine, which may apply rule‑based filtering or model enhancement before producing the final command.

Early Pitfalls

Initially, the system tried a pure SOA approach with two architectures: (1) treating the audio stream as a continuous, context‑aware data flow (which was ignored), and (2) a design with high latency. Both approaches introduced many problems, such as ensuring ordered PCM packet delivery and load‑balancing PCM streams to the same ASR instance, leading to high cost.

Therefore, the SOA model was deemed unsuitable for this use case.

Overall System Design

The user first passes a handshake authentication layer, then audio is captured, sliced, and transmitted over a single channel to the ASR engine. As more audio arrives, the recognition result is continuously refined, and later can be corrected based on context. Results are stored for future model improvement. Two users are shown to illustrate that each has an independent channel and task, without interference.

WS‑Java Server Implementations (Two Options)

1) Tomcat WebSocket implementation using the @ServerEndpoint("url") annotation. Important callbacks: onOpen, onError, onClose, onMessage. Drawback: difficult to intercept the handshake for authentication.

2) Spring Boot WebSocket implementation. Important callbacks: afterConnectionEstablished, handleMessage, handleTransportError, afterConnectionClosed. Advantage: business logic can be added before or after the handshake, mapping directly to Tomcat’s callbacks.

Method Execution Order

In summary, the article provides a comprehensive overview of WebSocket’s advantages over traditional HTTP polling, details its handshake and heartbeat mechanisms, and demonstrates a practical implementation for real‑time speech recognition using Java‑based back‑end services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaSpring BootWebSocketreal-time communicationspeech recognition
HelloTech
Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.