Artificial Intelligence 31 min read

Master Claude API: From Model Selection to Streaming Responses

This guide walks you through Claude Code model choices, secure API key handling, Python SDK setup, request construction, multi‑turn conversation management, system prompts, temperature tuning, response streaming, and extracting clean structured data such as JSON, all with practical code examples and diagrams.

Java One

Apr 8, 2026

Master Claude API: From Model Selection to Streaming Responses

Claude Code Model Overview

Claude Code provides three model variants— Opus , Sonnet , and Haiku —that trade off intelligence, cost, and latency. Opus delivers the highest reasoning capability but is the most expensive and has the highest latency. Haiku is the cheapest and fastest but offers lower intelligence. Sonnet sits between the two, offering a balanced choice for most development tasks. Teams often mix models in a single application, e.g., using Haiku for real‑time user interactions, Sonnet for core business logic, and Opus for complex reasoning.

API Request Lifecycle

The Claude request flow consists of five predictable stages:

Client request reaches the server.

Server forwards the request to the Anthropic API.

The model processes the request.

Server receives the model response.

Client receives the response.

Never call the Anthropic API directly from client‑side code because the API key would be exposed. All calls should be routed through a backend server that stores the key securely.

Obtaining an API Key

Log in to the Anthropic console, navigate to the API Keys section, and create a new key. Store the key in a .env file and add the file to .gitignore to keep it out of version control. ANTHROPIC_API_KEY="your-api-key-here" Load the environment variable in your code:

from dotenv import load_dotenv
load_dotenv()

from anthropic import Anthropic
client = Anthropic()
model = "claude-sonnet-4-6"

Installing the SDK and Making a Basic Request

Install the required packages: pip install anthropic python-dotenv Define a helper to create a message request:

def create_message(client, model, max_tokens, messages):
    return client.messages.create(
        model=model,
        max_tokens=max_tokens,
        messages=messages,
    )

Required fields:

model – the Claude model name (e.g., claude-sonnet-4-6).

max_tokens – an upper bound on the number of tokens the model may generate.

messages – a list of role‑annotated messages (user and assistant).

Example request:

message = client.messages.create(
    model=model,
    max_tokens=1000,
    messages=[{"role": "user", "content": "What is quantum computing? Answer in one sentence."}]
)
print(message.content[0].text)

Multi‑Turn Conversation

Claude does not retain conversation history, so the client must keep the full message list and resend it with each request.

messages = []

def add_user(messages, text):
    messages.append({"role": "user", "content": text})

def add_assistant(messages, text):
    messages.append({"role": "assistant", "content": text})

def chat(messages, **kwargs):
    resp = client.messages.create(model=model, max_tokens=1000, messages=messages, **kwargs)
    return resp.content[0].text

Typical flow:

Add the initial user question.

Call chat and append the assistant response.

Add a follow‑up question.

Call chat again with the updated messages list.

Simple Chatbot Loop

A minimal interactive chatbot can be built with a while True loop:

messages = []
while True:
    user_input = input("> ")
    add_user(messages, user_input)
    answer = chat(messages)
    add_assistant(messages, answer)
    print("---")
    print(answer)
    print("---")

System Prompts

A system prompt lets you steer Claude’s tone, style, or role. Example of a math‑tutor prompt:

system_prompt = """
You are a patient math tutor. Do not give the final answer directly; instead guide the student step by step.
"""
response = client.messages.create(
    model=model,
    max_tokens=1000,
    messages=messages,
    system=system_prompt,
)

Make the chat function accept an optional system argument so it can be reused across calls.

def chat(messages, system=None, temperature=1.0):
    params = {
        "model": model,
        "max_tokens": 1000,
        "messages": messages,
        "temperature": temperature,
    }
    if system:
        params["system"] = system
    resp = client.messages.create(**params)
    return resp.content[0].text

Temperature Parameter

The temperature (0‑1) controls randomness. Low values (<0.3) produce deterministic, factual output; medium values (0.4‑0.7) balance creativity and reliability; high values (>0.8) encourage diverse, creative responses.

def chat(messages, system=None, temperature=1.0):
    params = {
        "model": model,
        "max_tokens": 1000,
        "messages": messages,
        "temperature": temperature,
    }
    if system:
        params["system"] = system
    resp = client.messages.create(**params)
    return resp.content[0].text

Response Streaming

Enable streaming to receive partial tokens as they are generated, improving user experience.

# Standard streaming
stream = client.messages.create(
    model=model,
    max_tokens=1000,
    messages=messages,
    stream=True,
)
for event in stream:
    print(event)

For a simplified text‑only stream:

with client.messages.stream(model=model, max_tokens=1000, messages=messages) as stream:
    for text in stream.text_stream:
        print(text, end="")
    final_message = stream.get_final_message()

Structured Data Output

When you need raw JSON (or other structured formats) without surrounding markdown, use a pre‑filled assistant message and a stop sequence.

messages = []
add_user(messages, "Generate a very short EventBridge rule in JSON.")
add_assistant(messages, "```json")
json_output = chat(messages, stop_sequences=["```"])
print(json_output)

If the model does not respect the stop sequence, fall back to a system prompt that explicitly requests raw JSON, or post‑process the response with a regular expression:

import re, json
match = re.search(r'```json\s*(\{.*?\})\s*```', result, re.DOTALL)
if match:
    data = json.loads(match.group(1))
else:
    json_match = re.search(r'\{.*\}', result, re.DOTALL)
    data = json.loads(json_match.group()) if json_match else None

This technique works for any format where Claude naturally wraps the output (e.g., Python code, CSV, or plain lists). For non‑official models that ignore stop_sequences, you can instead use a system prompt such as:

system = "Only return JSON, without any explanations or markdown code fences."
result = chat(messages, system=system)

After receiving the raw text, clean up whitespace and parse:

clean_json = json.loads(result.strip())

Key Takeaways

Choose the appropriate Claude model (Opus, Sonnet, Haiku) based on intelligence, cost, and latency requirements.

Never expose the API key in client‑side code; always proxy requests through a secure backend.

Maintain the full message history client‑side to enable multi‑turn conversations.

Use system prompts to control tone, style, or output format.

Adjust temperature to trade off determinism versus creativity.

Enable stream=True for real‑time token delivery, and use the simplified text stream when only the final text is needed.

For structured outputs, combine a pre‑filled assistant message with stop_sequences or a strict system prompt, and apply post‑processing if necessary.

Python prompt engineering Streaming Structured Data Multi-turn Conversation Temperature Claude API

Written by

Java One

Sharing common backend development knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.