Master Claude API: From Setup to Advanced RAG, Prompts, and Streaming

This comprehensive guide walks you through Claude Code model selection, API authentication, request construction, multi‑turn conversation handling, system prompts, temperature tuning, streaming responses, and clean JSON extraction, providing practical Python examples for building robust AI‑powered applications.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
Master Claude API: From Setup to Advanced RAG, Prompts, and Streaming

1. Claude Code Model Overview

Claude Code currently offers three models—Opus, Sonnet, and Haiku—each balancing intelligence, cost, and latency differently. Opus is the most capable but expensive and slower; Haiku is the cheapest and fastest with lower intelligence; Sonnet provides a middle ground.

Model selection diagram
Model selection diagram

Most teams mix models: use Haiku for real‑time UI, Sonnet for core logic, and Opus for complex reasoning.

2. Accessing the API

The request lifecycle follows five predictable stages: client → server → Anthropic API → model processing → server response → client.

Client sends request to your server.

Server forwards it to the Anthropic API.

The model processes the input.

The API returns a structured response.

The client displays the result.

Five‑step request flow
Five‑step request flow

Never expose your API key in client‑side code; always route requests through a secure backend.

In early experiments I thought the client could call the API directly, but a local Claude server is required to safely proxy requests.

Making an API Call

Install the Anthropic SDK in a Jupyter notebook (or VS Code) and load your API key from a .env file to keep it out of version control. ANTHROPIC_API_KEY="your-api-key-here" Load the key and create a client:

from dotenv import load_dotenv
load_dotenv()
from anthropic import Anthropic
client = Anthropic()
model = "claude-sonnet-4-6"

Call client.messages.create() with three required parameters:

model – the Claude model name.

max_tokens – an upper bound on the generated token count.

messages – a list of role‑content dictionaries representing the conversation history.

message = client.messages.create(
    model=model,
    max_tokens=1000,
    messages=[{"role": "user", "content": "What is quantum computing? Answer in one sentence"}]
)
print(message.content[0].text)
Add error handling for network failures and rate limits.

Multi‑turn Conversations

Claude does not retain conversation state, so you must maintain a messages list yourself and resend the full history on each request.

def add_user_message(messages, text):
    messages.append({"role": "user", "content": text})

def add_assistant_message(messages, text):
    messages.append({"role": "assistant", "content": text})

def chat(messages):
    resp = client.messages.create(model=model, max_tokens=1000, messages=messages)
    return resp.content[0].text

Example workflow:

messages = []
add_user_message(messages, "Define quantum computing in one sentence")
answer = chat(messages)
add_assistant_message(messages, answer)
add_user_message(messages, "Write another sentence")
final_answer = chat(messages)

System Prompts

System prompts let you steer Claude’s behavior (tone, style, role). For a math tutor you might use:

system_prompt = """
You are a patient math tutor.
Do not give direct answers; guide the student step by step.
"""
message = client.messages.create(model=model, max_tokens=1000, messages=messages, system=system_prompt)

Passing system=None keeps the default behavior; you can optionally include the prompt.

Temperature Control

The temperature parameter (0‑1) adjusts randomness. Low values (<0.3) yield deterministic, factual output; high values (>0.8) produce creative, varied responses. Include it in the request dictionary:

def chat(messages, system=None, temperature=1.0):
    params = {"model": model, "max_tokens": 1000, "messages": messages, "temperature": temperature}
    if system:
        params["system"] = system
    resp = client.messages.create(**params)
    return resp.content[0].text

Streaming Responses

Enable stream=True to receive partial tokens as they are generated, improving UI responsiveness.

stream = client.messages.create(model=model, max_tokens=1000, messages=messages, stream=True)
for event in stream:
    print(event)

For a simplified text‑only stream:

with client.messages.stream(model=model, max_tokens=1000, messages=messages) as stream:
    for text in stream.text_stream:
        print(text, end="")
    final_message = stream.get_final_message()

Getting Clean Structured Data

When you need raw JSON (or code) without markdown wrappers, pre‑seed the assistant with the opening fence and use a stop sequence to cut off the closing fence.

messages = []
add_user_message(messages, "Generate a very short EventBridge rule in JSON")
add_assistant_message(messages, "```json")
json_output = chat(messages, stop_sequences=["```"])
print(json_output)

If the model or proxy does not honor stop sequences, you can enforce format with a system prompt or post‑process the response using regex to extract the JSON block.

import re, json
match = re.search(r'```json\s*(\{.*?\})\s*```', result, re.DOTALL)
if match:
    data = json.loads(match.group(1))
else:
    # fallback to first JSON object
    data = json.loads(re.search(r'\{.*\}', result, re.DOTALL).group())

This technique works for any structured output—Python code, CSV, or custom lists—by adjusting the opening fence and stop sequence accordingly.

Pythonprompt engineeringRAGStreamingAI developmentAnthropicClaude API
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.