How to Build a Robust Speech‑to‑Text Feature in React with Tencent ASR
This article walks through the complete front‑end architecture and implementation details for integrating Tencent Cloud speech‑to‑text into a React app, covering token authentication, SDK initialization, event handling, cursor‑aware text insertion, character limits, permission handling, error management, and state management with MobX.
Overall Architecture
The front‑end solution is split into three layers:
TencentASRComponent (speech‑recognition component)
State management (MobX store) with methods such as getSpeechToken(), setRecognitionText(), and a recognitionController Service layer evaluateService that fetches temporary credentials from the back‑end
Core technology stack:
SDK: Tencent Cloud Speech SDK for JavaScript ( tencentcloud-speech-sdk-js)
Framework: React + MobX
Browser API: getUserMedia for microphone permission
Encryption: CryptoJS for HMAC‑SHA1 signing
1. Token Authentication Process
The SDK requires a signed request, so a temporary STS credential is generated on the back‑end and returned as JSON:
{
"Credentials": {
"TmpSecretId": "",
"TmpSecretKey": "",
"Token": ""
},
"appId": ""
}Temporary keys are used because they have an expiration time, avoid leaking permanent secret keys, and allow dynamic permission control.
After receiving the token, the front‑end creates a signature with CryptoJS:
signCallback(signStr) {
const hash = CryptoJS.HmacSHA1(signStr, secretKey);
const bytes = Uint8ArrayToString(toUint8Array(hash));
return btoa(bytes);
}⚠️ CryptoJS's WordArray cannot be directly converted to Base64; it must be transformed to a Uint8Array first, otherwise server‑side verification fails.
2. Recognizer Initialization
Key parameters are passed when creating the recognizer:
const params = {
secretid,
secretkey,
token,
appid,
engine_model_type: '16k_zh', // Chinese 16k model
voice_format: 1, // PCM
word_info: 2, // Return word‑level timestamps
signCallback
};
this.recognizer = new WebAudioSpeechRecognizer(params, true);These settings balance recognition accuracy, speed, and downstream features such as highlighting.
3. Recognition Events – The Heartbeat of the Feature
OnRecognitionStart – start of recognition
OnSentenceBegin – record cursor position at the beginning of a sentence
OnRecognitionResultChange – interim result (continuously updated)
OnSentenceEnd – final result of a sentence
OnError – error handling
Handling interim and final results:
OnRecognitionResultChange(res) {
handleRecognitionResult(res.voice_text_str, false);
}
OnSentenceEnd(res) {
handleRecognitionResult(res.voice_text_str, true);
}The interim result must replace the previous interim text instead of being appended, otherwise duplicate text appears.
4. Text Insertion Algorithm Design
Because users may insert text before, in the middle, replace a selection, or move the cursor during recognition, a small but precise insertion strategy is required.
Step 1 – Record cursor position when a sentence begins:
OnSentenceBegin: () => {
this.recognitionStart = textarea.selectionStart;
};Step 2 – Replace the previous interim result with the new one:
handleRecognitionResult(result, finalize) {
const start = this.recognitionStart;
const previous = this.lastRecognitionResult;
const before = fullText.slice(0, start);
const after = fullText.slice(start + previous.length);
const newText = before + result + after;
}This guarantees no duplication, correct insertion position, and stability when the cursor moves.
5. Character Limit & User Experience Design
The product limits input to 500 characters. The implementation includes:
Real‑time remaining character count
Truncating interim results that exceed the limit
Automatically stopping recognition when the limit is reached
if (spaceAvailable <= 0) {
this.stopRecognition();
return;
}Instead of waiting for the sentence to finish, the system stops immediately to avoid the “said but not written” frustration.
6. Microphone Permission & Browser Compatibility
Permission request:
navigator.mediaDevices.getUserMedia({ audio: true })Typical errors:
NotAllowedError – user denied permission
NotFoundError – no microphone detected
NotReadableError – device occupied by another program
Compatibility check:
if (!navigator.mediaDevices?.getUserMedia) {
throw new Error('Browser does not support recording');
}Common real‑world issues include disabled microphones on Windows browsers, missing Chrome permission on macOS, and enterprise security policies blocking microphone access.
7. Cursor Visibility – Ensuring Users See the Inserted Text
A hidden “mirror” DOM is created to compute the caret’s vertical position:
getCaretTop(textarea, caretPos) {
const mirror = document.createElement('div');
// copy styles …
mirror.textContent = beforeCaret;
mirror.appendChild(marker);
return marker.offsetTop;
}Automatic scrolling when the caret is out of view:
if (caretTop > viewBottom) {
textarea.scrollTop = caretTop - textarea.clientHeight + lineHeight * 2;
}8. State Management with MobX
MobX handles three main concerns:
Maintain the recognized text ( recognitionText)
Expose a controller for external actions (e.g., stop recognition)
Synchronize character count in real time
setRecognitionController({
stop: this.stopRecognition
});This decouples the parent component from the recognizer implementation.
9. Error Handling – Essential Details
Typical error categories:
Token acquisition failure – abort and show “service error”
Browser not supported – suggest switching browsers
Permission denied – prompt user to enable microphone
Recognition runtime error – stop and display a clear message
Character limit exceeded – auto‑stop
Unexpected page exit – pause and destroy recognizer
OnError(error) {
this.setState({ error: error.message, isRecording: false });
}10. Use Cases & Future Extensions
Current applications:
Teacher evaluation system
Form input assistant
Meeting notes / voice memo
Potential extensions:
Multi‑language recognition
Offline recognition (when SDK supports it)
Command recognition (e.g., “next question”, “delete previous line”)
Voice waveform visualization for better UX
Conclusion
The real challenge of speech‑to‑text is not merely plugging in an SDK; it lies in robust cursor management, text structure handling, permission flow, graceful failure fallback, and edge‑case handling. Focusing on reliable token mechanisms, precise text‑insertion logic, permission handling, error experience, character limits, and cursor visibility turns a usable feature into a delightful one.
Code Mala Tang
Read source code together, write articles together, and enjoy spicy hot pot together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
