Master Claude API: From Setup to Advanced RAG, Prompts, and Streaming
This comprehensive guide walks you through Claude Code model selection, API authentication, request construction, multi‑turn conversation handling, system prompts, temperature tuning, streaming responses, and clean JSON extraction, providing practical Python examples for building robust AI‑powered applications.
1. Claude Code Model Overview
Claude Code currently offers three models—Opus, Sonnet, and Haiku—each balancing intelligence, cost, and latency differently. Opus is the most capable but expensive and slower; Haiku is the cheapest and fastest with lower intelligence; Sonnet provides a middle ground.
Most teams mix models: use Haiku for real‑time UI, Sonnet for core logic, and Opus for complex reasoning.
2. Accessing the API
The request lifecycle follows five predictable stages: client → server → Anthropic API → model processing → server response → client.
Client sends request to your server.
Server forwards it to the Anthropic API.
The model processes the input.
The API returns a structured response.
The client displays the result.
Never expose your API key in client‑side code; always route requests through a secure backend.
In early experiments I thought the client could call the API directly, but a local Claude server is required to safely proxy requests.
Making an API Call
Install the Anthropic SDK in a Jupyter notebook (or VS Code) and load your API key from a .env file to keep it out of version control. ANTHROPIC_API_KEY="your-api-key-here" Load the key and create a client:
from dotenv import load_dotenv
load_dotenv()
from anthropic import Anthropic
client = Anthropic()
model = "claude-sonnet-4-6"Call client.messages.create() with three required parameters:
model – the Claude model name.
max_tokens – an upper bound on the generated token count.
messages – a list of role‑content dictionaries representing the conversation history.
message = client.messages.create(
model=model,
max_tokens=1000,
messages=[{"role": "user", "content": "What is quantum computing? Answer in one sentence"}]
)
print(message.content[0].text)Add error handling for network failures and rate limits.
Multi‑turn Conversations
Claude does not retain conversation state, so you must maintain a messages list yourself and resend the full history on each request.
def add_user_message(messages, text):
messages.append({"role": "user", "content": text})
def add_assistant_message(messages, text):
messages.append({"role": "assistant", "content": text})
def chat(messages):
resp = client.messages.create(model=model, max_tokens=1000, messages=messages)
return resp.content[0].textExample workflow:
messages = []
add_user_message(messages, "Define quantum computing in one sentence")
answer = chat(messages)
add_assistant_message(messages, answer)
add_user_message(messages, "Write another sentence")
final_answer = chat(messages)System Prompts
System prompts let you steer Claude’s behavior (tone, style, role). For a math tutor you might use:
system_prompt = """
You are a patient math tutor.
Do not give direct answers; guide the student step by step.
"""
message = client.messages.create(model=model, max_tokens=1000, messages=messages, system=system_prompt)Passing system=None keeps the default behavior; you can optionally include the prompt.
Temperature Control
The temperature parameter (0‑1) adjusts randomness. Low values (<0.3) yield deterministic, factual output; high values (>0.8) produce creative, varied responses. Include it in the request dictionary:
def chat(messages, system=None, temperature=1.0):
params = {"model": model, "max_tokens": 1000, "messages": messages, "temperature": temperature}
if system:
params["system"] = system
resp = client.messages.create(**params)
return resp.content[0].textStreaming Responses
Enable stream=True to receive partial tokens as they are generated, improving UI responsiveness.
stream = client.messages.create(model=model, max_tokens=1000, messages=messages, stream=True)
for event in stream:
print(event)For a simplified text‑only stream:
with client.messages.stream(model=model, max_tokens=1000, messages=messages) as stream:
for text in stream.text_stream:
print(text, end="")
final_message = stream.get_final_message()Getting Clean Structured Data
When you need raw JSON (or code) without markdown wrappers, pre‑seed the assistant with the opening fence and use a stop sequence to cut off the closing fence.
messages = []
add_user_message(messages, "Generate a very short EventBridge rule in JSON")
add_assistant_message(messages, "```json")
json_output = chat(messages, stop_sequences=["```"])
print(json_output)If the model or proxy does not honor stop sequences, you can enforce format with a system prompt or post‑process the response using regex to extract the JSON block.
import re, json
match = re.search(r'```json\s*(\{.*?\})\s*```', result, re.DOTALL)
if match:
data = json.loads(match.group(1))
else:
# fallback to first JSON object
data = json.loads(re.search(r'\{.*\}', result, re.DOTALL).group())This technique works for any structured output—Python code, CSV, or custom lists—by adjusting the opening fence and stop sequence accordingly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
