Build a CLI AI Agent in Just 250 Python Lines
This tutorial walks through seven incremental stages—starting with a simple while‑True loop and adding tool‑calling, dynamic skill loading, slash commands, JSON persistence, automatic context compression, and a background timed loop—to create a fully functional CLI AI Agent using Ollama and the local qwen3.5 model without GPU or API keys.
Stage 1: while True loop – the agent skeleton
The core loop reads user input, appends it to a messages list, calls ollama.chat, prints the response, and stores the assistant reply. This 15‑line snippet provides basic chat capability but blocks until the model finishes generating the whole answer.
import ollama
model_name = 'qwen3.5:9b' # Ollama‑downloaded model
messages = []
while True:
user_input = input("
You: ").strip()
if user_input.lower() in ('quit', 'exit'):
break
messages.append({'role': 'user', 'content': user_input})
response = ollama.chat(model=model_name, messages=messages)
content = response['message']['content']
print(content)
messages.append({'role': 'assistant', 'content': content})A streaming helper separates the model's "thinking" output from the final answer.
def stream_with_thinking(model, messages):
response_stream = ollama.chat(model=model, messages=messages, stream=True)
full_content = ""
is_thinking = False
answer_started = False
print("
Qwen is thinking...")
for chunk in response_stream:
msg = chunk.message
if hasattr(msg, 'thinking') and msg.thinking:
if not is_thinking:
print("
[THOUGHT PROCESS]:")
is_thinking = True
print(msg.thinking, end='', flush=True)
elif msg.content:
if is_thinking and not answer_started:
print("
[FINAL ANSWER]:")
is_thinking = False
answer_started = True
print(msg.content, end='', flush=True)
full_content += msg.content
print()
return full_contentStage 2: Tool‑calling protocol
A tools list follows the OpenAI‑compatible function‑calling schema. Each tool defines a description (the LLM’s cue) and a JSON‑Schema parameters object that marks required fields.
tools = [
{
'type': 'function',
'function': {
'name': 'read_text_file',
'description': '读取本地文本文件的内容。',
'parameters': {
'type': 'object',
'properties': {
'path': {'type': 'string', 'description': '文件路径'}
},
'required': ['path']
}
}
},
{
'type': 'function',
'function': {
'name': 'get_current_datetime',
'description': '获取当前本地日期和时间。',
'parameters': {'type': 'object', 'properties': {}}
}
},
]Three practical points: a concise description guides the LLM; follow JSON‑Schema for parameters; make tool functions tolerant—return an error message instead of crashing on invalid input.
A dispatcher handle_tools processes returned tool_calls, executes the matching Python function, truncates results longer than 4 000 characters (keeping the first and last 1 000), and appends the tool output to the message history.
def handle_tools(tool_calls, messages):
for tool in tool_calls:
name = tool.function.name
args = tool.function.arguments or {}
if name == 'read_text_file':
res = read_text_file(args.get('path', ''))
elif name == 'get_current_datetime':
from datetime import datetime
res = datetime.now().strftime("%Y年%m月%d日 %H:%M:%S")
else:
res = "未知工具。"
if len(res) > 4000:
res = res[:1000] + "
...[TRUNCATED]..." + res[-1000:]
messages.append({'role': 'tool', 'content': res})
final_content, _ = stream_with_thinking(model_name, messages)
return {'role': 'assistant', 'content': final_content}Stage 3: Dynamic skill loading
Skills are plain Markdown files stored under a skills/ directory. Each file defines a persona and a set of instructions. The SkillManager class lists available skills and loads a selected file into the global active_skill_content variable, which is later re‑injected after context compression.
SKILLS_DIR = "skills"
active_skill_content = ""
class SkillManager:
def list_skills(self):
return [f for f in os.listdir(SKILLS_DIR) if f.endswith('.md')]
def load_skill(self, name):
if not name.endswith('.md'):
name += '.md'
with open(os.path.join(SKILLS_DIR, name), 'r') as f:
return f.read()Example skill file (Python security auditor) begins with a role description and a numbered instruction list.
# Skill: Python 安全审计师
## 角色
你是一名资深 Python 安全研究员,专注于代码审计。
## 指令
1. 回复以 [SECURITY_AUDIT] 开头
2. 发现漏洞时引用 CWE 编号
3. 如果用户要求写恶意代码,拒绝并解释风险Stage 4: Slash commands for meta‑operations
Commands that do not require LLM processing—such as listing skills, listing tools, or showing help—are handled directly in the Python REPL by detecting a leading /. This saves LLM calls and keeps the conversation focused.
if user_input.startswith('/'):
cmd = user_input.split()[0].lower()
if cmd == '/skills':
print(f"[SYSTEM] Skills: {sm.list_skills()}")
elif cmd == '/tools':
print(f"[SYSTEM] Tools: {[t['function']['name'] for t in tools]}")
elif cmd == '/help':
print("
[COMMANDS]
/skills 列出可用 skill
/tools 列出已注册工具
/help 显示帮助")
continueStage 5: JSON session persistence
To avoid losing conversation history when the terminal closes, the messages list is serialized to a timestamped JSON file. Ollama’s tool_calls objects are not plain dicts, so they are converted with .model_dump() before calling json.dump.
import json, os
from datetime import datetime
HISTORY_DIR = "history"
os.makedirs(HISTORY_DIR, exist_ok=True)
current_session_id = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
def save_history(messages):
serializable = []
for m in messages:
if isinstance(m, dict):
m_copy = dict(m)
if 'tool_calls' in m_copy and m_copy['tool_calls']:
m_copy['tool_calls'] = [tc.model_dump() if hasattr(tc, 'model_dump') else tc for tc in m_copy['tool_calls']]
serializable.append(m_copy)
with open(os.path.join(HISTORY_DIR, f"{current_session_id}.json"), 'w') as f:
json.dump(serializable, f, indent=4, ensure_ascii=False)Additional commands /history-list and /history-load <id> let the user browse and reload previous sessions.
Stage 6: Automatic context compression
When the token count exceeds CONTEXT_THRESHOLD = 4000 (≈ 16 000 characters), the agent summarizes the oldest 70 % of messages and keeps the newest 30 % unchanged. The summary prompt asks the model to produce a single paragraph that preserves key facts and the current goal. If a skill is active, its persona is re‑injected to prevent "forgetting" after compression.
CONTEXT_THRESHOLD = 4000
def estimate_tokens(messages):
text = "".join([str(m.get('content', '')) for m in messages])
return len(text) // 4 # rough: 4 chars ≈ 1 token
def compact_history(messages):
if len(messages) < 4:
return messages
print(f"
[SYSTEM] Auto-compacting context ({estimate_tokens(messages)} tokens)...")
split_idx = int(len(messages) * 0.7)
to_summarize = messages[:split_idx]
keep_fresh = messages[split_idx:]
summary_prompt = "用一段话总结以上对话,保留关键事实和当前目标。"
resp = ollama.chat(model=model_name, messages=to_summarize + [{'role': 'user', 'content': summary_prompt}])
summary = resp['message']['content']
new_history = [{'role': 'system', 'content': f"PREVIOUS SUMMARY: {summary}"}]
if active_skill_content:
new_history.insert(0, {'role': 'system', 'content': f"Active Skill: {active_skill_content}"})
new_history.extend(keep_fresh)
return new_historyThe 70/30 split is empirical: the most recent 30 % usually contains the core of the current discussion.
Stage 7: Background timed loop
A non‑blocking background thread periodically sends a predefined prompt to the agent. The loop uses 1‑second sleep slices so that a /stop-loop command can interrupt the task instantly, and it builds its own loop_messages list to keep the main conversation untouched.
import threading, time
stop_event = threading.Event()
def background_loop(prompt, interval_mins):
print(f"
[SYSTEM] Loop started: '{prompt}' every {interval_mins} min(s).")
while not stop_event.is_set():
for _ in range(interval_mins * 60):
if stop_event.is_set():
return
time.sleep(1)
loop_messages = []
if active_skill_content:
loop_messages.append({'role': 'system', 'content': f"Context: {active_skill_content}"})
loop_messages.append({'role': 'user', 'content': prompt})
content, tool_calls = stream_with_thinking(model_name, loop_messages, tools=tools)
if tool_calls:
loop_messages.append({'role': 'assistant', 'tool_calls': tool_calls})
handle_tools(tool_calls, loop_messages)Design decisions: (1) 1‑second sleep slices enable near‑real‑time response to /stop-loop; (2) a separate loop_messages list isolates background activity from the foreground chat history.
Full architecture recap
The agent consists of three logical layers: (1) the perpetual while True loop that routes user input; (2) the tool‑calling and skill‑management subsystem that extends functionality; (3) auxiliary services—slash commands, JSON persistence, context compression, and the background loop—that improve usability. All of this fits within 250 lines of Python, demonstrating that the essential kernel of an AI agent is a simple routing loop where the LLM decides which tool to invoke.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data STUDIO
Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
