How to Build a Robust Speech‑to‑Text Feature in React with Tencent ASR

This article walks through the complete front‑end architecture and implementation details for integrating Tencent Cloud speech‑to‑text into a React app, covering token authentication, SDK initialization, event handling, cursor‑aware text insertion, character limits, permission handling, error management, and state management with MobX.

Code Mala Tang
Code Mala Tang
Code Mala Tang
How to Build a Robust Speech‑to‑Text Feature in React with Tencent ASR

Overall Architecture

The front‑end solution is split into three layers:

TencentASRComponent (speech‑recognition component)

State management (MobX store) with methods such as getSpeechToken(), setRecognitionText(), and a recognitionController Service layer evaluateService that fetches temporary credentials from the back‑end

Core technology stack:

SDK: Tencent Cloud Speech SDK for JavaScript ( tencentcloud-speech-sdk-js)

Framework: React + MobX

Browser API: getUserMedia for microphone permission

Encryption: CryptoJS for HMAC‑SHA1 signing

1. Token Authentication Process

The SDK requires a signed request, so a temporary STS credential is generated on the back‑end and returned as JSON:

{
  "Credentials": {
    "TmpSecretId": "",
    "TmpSecretKey": "",
    "Token": ""
  },
  "appId": ""
}

Temporary keys are used because they have an expiration time, avoid leaking permanent secret keys, and allow dynamic permission control.

After receiving the token, the front‑end creates a signature with CryptoJS:

signCallback(signStr) {
  const hash = CryptoJS.HmacSHA1(signStr, secretKey);
  const bytes = Uint8ArrayToString(toUint8Array(hash));
  return btoa(bytes);
}
⚠️ CryptoJS's WordArray cannot be directly converted to Base64; it must be transformed to a Uint8Array first, otherwise server‑side verification fails.

2. Recognizer Initialization

Key parameters are passed when creating the recognizer:

const params = {
  secretid,
  secretkey,
  token,
  appid,
  engine_model_type: '16k_zh', // Chinese 16k model
  voice_format: 1,               // PCM
  word_info: 2,                 // Return word‑level timestamps
  signCallback
};
this.recognizer = new WebAudioSpeechRecognizer(params, true);

These settings balance recognition accuracy, speed, and downstream features such as highlighting.

3. Recognition Events – The Heartbeat of the Feature

OnRecognitionStart – start of recognition

OnSentenceBegin – record cursor position at the beginning of a sentence

OnRecognitionResultChange – interim result (continuously updated)

OnSentenceEnd – final result of a sentence

OnError – error handling

Handling interim and final results:

OnRecognitionResultChange(res) {
  handleRecognitionResult(res.voice_text_str, false);
}

OnSentenceEnd(res) {
  handleRecognitionResult(res.voice_text_str, true);
}

The interim result must replace the previous interim text instead of being appended, otherwise duplicate text appears.

4. Text Insertion Algorithm Design

Because users may insert text before, in the middle, replace a selection, or move the cursor during recognition, a small but precise insertion strategy is required.

Step 1 – Record cursor position when a sentence begins:

OnSentenceBegin: () => {
  this.recognitionStart = textarea.selectionStart;
};

Step 2 – Replace the previous interim result with the new one:

handleRecognitionResult(result, finalize) {
  const start = this.recognitionStart;
  const previous = this.lastRecognitionResult;
  const before = fullText.slice(0, start);
  const after = fullText.slice(start + previous.length);
  const newText = before + result + after;
}

This guarantees no duplication, correct insertion position, and stability when the cursor moves.

5. Character Limit & User Experience Design

The product limits input to 500 characters. The implementation includes:

Real‑time remaining character count

Truncating interim results that exceed the limit

Automatically stopping recognition when the limit is reached

if (spaceAvailable <= 0) {
  this.stopRecognition();
  return;
}

Instead of waiting for the sentence to finish, the system stops immediately to avoid the “said but not written” frustration.

6. Microphone Permission & Browser Compatibility

Permission request:

navigator.mediaDevices.getUserMedia({ audio: true })

Typical errors:

NotAllowedError – user denied permission

NotFoundError – no microphone detected

NotReadableError – device occupied by another program

Compatibility check:

if (!navigator.mediaDevices?.getUserMedia) {
  throw new Error('Browser does not support recording');
}

Common real‑world issues include disabled microphones on Windows browsers, missing Chrome permission on macOS, and enterprise security policies blocking microphone access.

7. Cursor Visibility – Ensuring Users See the Inserted Text

A hidden “mirror” DOM is created to compute the caret’s vertical position:

getCaretTop(textarea, caretPos) {
  const mirror = document.createElement('div');
  // copy styles …
  mirror.textContent = beforeCaret;
  mirror.appendChild(marker);
  return marker.offsetTop;
}

Automatic scrolling when the caret is out of view:

if (caretTop > viewBottom) {
  textarea.scrollTop = caretTop - textarea.clientHeight + lineHeight * 2;
}

8. State Management with MobX

MobX handles three main concerns:

Maintain the recognized text ( recognitionText)

Expose a controller for external actions (e.g., stop recognition)

Synchronize character count in real time

setRecognitionController({
  stop: this.stopRecognition
});

This decouples the parent component from the recognizer implementation.

9. Error Handling – Essential Details

Typical error categories:

Token acquisition failure – abort and show “service error”

Browser not supported – suggest switching browsers

Permission denied – prompt user to enable microphone

Recognition runtime error – stop and display a clear message

Character limit exceeded – auto‑stop

Unexpected page exit – pause and destroy recognizer

OnError(error) {
  this.setState({ error: error.message, isRecording: false });
}

10. Use Cases & Future Extensions

Current applications:

Teacher evaluation system

Form input assistant

Meeting notes / voice memo

Potential extensions:

Multi‑language recognition

Offline recognition (when SDK supports it)

Command recognition (e.g., “next question”, “delete previous line”)

Voice waveform visualization for better UX

Conclusion

The real challenge of speech‑to‑text is not merely plugging in an SDK; it lies in robust cursor management, text structure handling, permission flow, graceful failure fallback, and edge‑case handling. Focusing on reliable token mechanisms, precise text‑insertion logic, permission handling, error experience, character limits, and cursor visibility turns a usable feature into a delightful one.

frontendReActMobXbrowser APITencent Cloudspeech-to-text
Code Mala Tang
Written by

Code Mala Tang

Read source code together, write articles together, and enjoy spicy hot pot together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.