Frontend Development 9 min read

How to Build Real‑Time LLM Streaming in the Browser with Fetch

This article explains the mechanism of HTTP API streaming for large language models and shows step‑by‑step how front‑end developers can use the Fetch API, readable streams, and incremental UI updates to deliver real‑time, progressive results while handling errors and connection interruptions.

Alibaba Cloud Developer

Nov 15, 2024

How to Build Real‑Time LLM Streaming in the Browser with Fetch

What Is HTTP API Streaming?

HTTP API streaming sends response data in chunks as soon as the large language model generates it, allowing the front‑end to display partial results without waiting for the complete response.

Basic Streaming Flow

Client Request: The front‑end sends a POST request with the prompt and parameters.

Server Processing and Chunked Response: The server begins generating text and streams each chunk to the client.

Client Receives and Processes Chunks: The client continuously reads each chunk from the stream.

Connection Close: After generation finishes, the server closes the connection.

Implementing LLM HTTP API Streaming

Below is a typical front‑end implementation using fetch to initiate a streaming call.

const fetchStreamData = async (prompt) => {
  const response = await fetch('https://api.openai.com/v1/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer YOUR_API_KEY`
    },
    body: JSON.stringify({
      model: 'gpt-4',
      prompt: prompt,
      stream: true // enable streaming
    })
  });

  if (!response.ok) {
    throw new Error('Network response was not ok');
  }

  const reader = response.body.getReader();
  const decoder = new TextDecoder('utf-8');
  let done = false;
  while (!done) {
    const { value, done: readerDone } = await reader.read();
    done = readerDone;
    const chunk = decoder.decode(value, { stream: true });
    console.log(chunk); // process each chunk
  }
};

Request Settings

Use fetch with stream: true to tell the server to stream.

The request body includes the model ID, prompt, and other required parameters such as the API key.

Reading Stream Data

Call response.body.getReader() to obtain a reader that can read the response chunk by chunk.

Use TextDecoder to decode byte data into text.

Processing Chunks

Repeatedly call reader.read() to get value (bytes) and done (stream end flag).

The decoded chunk can be displayed or processed immediately.

How the Front‑End Handles Streaming Responses

When the back‑end returns a streamed response, the front‑end can update the UI incrementally, handle interruptions, concatenate chunks, and improve user interaction.

1. Incremental UI Updates

const chatBox = document.getElementById('chat-box');
const updateChat = (text) => {
  chatBox.innerHTML += `<p>${text}</p>`;
};
while (!done) {
  const { value, done: readerDone } = await reader.read();
  const chunk = decoder.decode(value, { stream: true });
  updateChat(chunk);
}

2. Handling Interruptions or Errors

if (!response.ok) {
  console.error('Error with the request');
  return;
}
reader.read().then(processStream).catch(error => {
  console.error('Error while reading stream:', error);
});

3. Concatenating Stream Data

let fullResponse = '';
while (!done) {
  const { value, done: readerDone } = await reader.read();
  const chunk = decoder.decode(value, { stream: true });
  fullResponse += chunk; // build complete response
}

4. Auto‑Scroll and Interaction Optimisation

const scrollToBottom = () => {
  chatBox.scrollTop = chatBox.scrollHeight;
};
updateChat(chunk);
scrollToBottom(); // keep view at latest content

Advantages of Streaming Calls

Improved User Experience: Users see partial results instantly, reducing perceived latency.

Reduced Server Load: Streaming allows the server to send data incrementally instead of generating a large payload at once.

Enhanced Interactivity: Real‑time feedback enables richer conversational or assistive applications.

Conclusion

HTTP API streaming provides an efficient, real‑time interaction model for large language models. By processing streamed chunks, updating the UI incrementally, handling errors, and concatenating data, front‑end developers can deliver smoother experiences in chatbots, assistants, and other interactive applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Front-end Real-time JavaScript LLM HTTP streaming fetch API

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.