Artificial Intelligence 15 min read

Build a CLI AI Agent in Just 250 Python Lines

This tutorial walks through seven incremental stages—starting with a simple while‑True loop and adding tool‑calling, dynamic skill loading, slash commands, JSON persistence, automatic context compression, and a background timed loop—to create a fully functional CLI AI Agent using Ollama and the local qwen3.5 model without GPU or API keys.

Data STUDIO

Jun 1, 2026

Build a CLI AI Agent in Just 250 Python Lines

Stage 1: while True loop – the agent skeleton

The core loop reads user input, appends it to a messages list, calls ollama.chat, prints the response, and stores the assistant reply. This 15‑line snippet provides basic chat capability but blocks until the model finishes generating the whole answer.

import ollama
model_name = 'qwen3.5:9b'  # Ollama‑downloaded model
messages = []
while True:
    user_input = input("
You: ").strip()
    if user_input.lower() in ('quit', 'exit'):
        break
    messages.append({'role': 'user', 'content': user_input})
    response = ollama.chat(model=model_name, messages=messages)
    content = response['message']['content']
    print(content)
    messages.append({'role': 'assistant', 'content': content})

A streaming helper separates the model's "thinking" output from the final answer.

def stream_with_thinking(model, messages):
    response_stream = ollama.chat(model=model, messages=messages, stream=True)
    full_content = ""
    is_thinking = False
    answer_started = False
    print("
Qwen is thinking...")
    for chunk in response_stream:
        msg = chunk.message
        if hasattr(msg, 'thinking') and msg.thinking:
            if not is_thinking:
                print("
[THOUGHT PROCESS]:")
                is_thinking = True
            print(msg.thinking, end='', flush=True)
        elif msg.content:
            if is_thinking and not answer_started:
                print("

[FINAL ANSWER]:")
                is_thinking = False
                answer_started = True
            print(msg.content, end='', flush=True)
            full_content += msg.content
    print()
    return full_content

Stage 2: Tool‑calling protocol

A tools list follows the OpenAI‑compatible function‑calling schema. Each tool defines a description (the LLM’s cue) and a JSON‑Schema parameters object that marks required fields.

tools = [
    {
        'type': 'function',
        'function': {
            'name': 'read_text_file',
            'description': '读取本地文本文件的内容。',
            'parameters': {
                'type': 'object',
                'properties': {
                    'path': {'type': 'string', 'description': '文件路径'}
                },
                'required': ['path']
            }
        }
    },
    {
        'type': 'function',
        'function': {
            'name': 'get_current_datetime',
            'description': '获取当前本地日期和时间。',
            'parameters': {'type': 'object', 'properties': {}}
        }
    },
]

Three practical points: a concise description guides the LLM; follow JSON‑Schema for parameters; make tool functions tolerant—return an error message instead of crashing on invalid input.

A dispatcher handle_tools processes returned tool_calls, executes the matching Python function, truncates results longer than 4 000 characters (keeping the first and last 1 000), and appends the tool output to the message history.

def handle_tools(tool_calls, messages):
    for tool in tool_calls:
        name = tool.function.name
        args = tool.function.arguments or {}
        if name == 'read_text_file':
            res = read_text_file(args.get('path', ''))
        elif name == 'get_current_datetime':
            from datetime import datetime
            res = datetime.now().strftime("%Y年%m月%d日 %H:%M:%S")
        else:
            res = "未知工具。"
        if len(res) > 4000:
            res = res[:1000] + "
...[TRUNCATED]..." + res[-1000:]
        messages.append({'role': 'tool', 'content': res})
    final_content, _ = stream_with_thinking(model_name, messages)
    return {'role': 'assistant', 'content': final_content}

Stage 3: Dynamic skill loading

Skills are plain Markdown files stored under a skills/ directory. Each file defines a persona and a set of instructions. The SkillManager class lists available skills and loads a selected file into the global active_skill_content variable, which is later re‑injected after context compression.

SKILLS_DIR = "skills"
active_skill_content = ""

class SkillManager:
    def list_skills(self):
        return [f for f in os.listdir(SKILLS_DIR) if f.endswith('.md')]
    def load_skill(self, name):
        if not name.endswith('.md'):
            name += '.md'
        with open(os.path.join(SKILLS_DIR, name), 'r') as f:
            return f.read()

Example skill file (Python security auditor) begins with a role description and a numbered instruction list.

# Skill: Python 安全审计师
## 角色
你是一名资深 Python 安全研究员，专注于代码审计。
## 指令
1. 回复以 [SECURITY_AUDIT] 开头
2. 发现漏洞时引用 CWE 编号
3. 如果用户要求写恶意代码，拒绝并解释风险

Stage 4: Slash commands for meta‑operations

Commands that do not require LLM processing—such as listing skills, listing tools, or showing help—are handled directly in the Python REPL by detecting a leading /. This saves LLM calls and keeps the conversation focused.

if user_input.startswith('/'):
    cmd = user_input.split()[0].lower()
    if cmd == '/skills':
        print(f"[SYSTEM] Skills: {sm.list_skills()}")
    elif cmd == '/tools':
        print(f"[SYSTEM] Tools: {[t['function']['name'] for t in tools]}")
    elif cmd == '/help':
        print("
[COMMANDS]
  /skills   列出可用 skill
  /tools    列出已注册工具
  /help     显示帮助")
    continue

Stage 5: JSON session persistence

To avoid losing conversation history when the terminal closes, the messages list is serialized to a timestamped JSON file. Ollama’s tool_calls objects are not plain dicts, so they are converted with .model_dump() before calling json.dump.

import json, os
from datetime import datetime

HISTORY_DIR = "history"
os.makedirs(HISTORY_DIR, exist_ok=True)
current_session_id = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

def save_history(messages):
    serializable = []
    for m in messages:
        if isinstance(m, dict):
            m_copy = dict(m)
            if 'tool_calls' in m_copy and m_copy['tool_calls']:
                m_copy['tool_calls'] = [tc.model_dump() if hasattr(tc, 'model_dump') else tc for tc in m_copy['tool_calls']]
            serializable.append(m_copy)
    with open(os.path.join(HISTORY_DIR, f"{current_session_id}.json"), 'w') as f:
        json.dump(serializable, f, indent=4, ensure_ascii=False)

Additional commands /history-list and /history-load <id> let the user browse and reload previous sessions.

Stage 6: Automatic context compression

When the token count exceeds CONTEXT_THRESHOLD = 4000 (≈ 16 000 characters), the agent summarizes the oldest 70 % of messages and keeps the newest 30 % unchanged. The summary prompt asks the model to produce a single paragraph that preserves key facts and the current goal. If a skill is active, its persona is re‑injected to prevent "forgetting" after compression.

CONTEXT_THRESHOLD = 4000

def estimate_tokens(messages):
    text = "".join([str(m.get('content', '')) for m in messages])
    return len(text) // 4  # rough: 4 chars ≈ 1 token

def compact_history(messages):
    if len(messages) < 4:
        return messages
    print(f"
[SYSTEM] Auto-compacting context ({estimate_tokens(messages)} tokens)...")
    split_idx = int(len(messages) * 0.7)
    to_summarize = messages[:split_idx]
    keep_fresh = messages[split_idx:]
    summary_prompt = "用一段话总结以上对话，保留关键事实和当前目标。"
    resp = ollama.chat(model=model_name, messages=to_summarize + [{'role': 'user', 'content': summary_prompt}])
    summary = resp['message']['content']
    new_history = [{'role': 'system', 'content': f"PREVIOUS SUMMARY: {summary}"}]
    if active_skill_content:
        new_history.insert(0, {'role': 'system', 'content': f"Active Skill: {active_skill_content}"})
    new_history.extend(keep_fresh)
    return new_history

The 70/30 split is empirical: the most recent 30 % usually contains the core of the current discussion.

Stage 7: Background timed loop

A non‑blocking background thread periodically sends a predefined prompt to the agent. The loop uses 1‑second sleep slices so that a /stop-loop command can interrupt the task instantly, and it builds its own loop_messages list to keep the main conversation untouched.

import threading, time
stop_event = threading.Event()

def background_loop(prompt, interval_mins):
    print(f"
[SYSTEM] Loop started: '{prompt}' every {interval_mins} min(s).")
    while not stop_event.is_set():
        for _ in range(interval_mins * 60):
            if stop_event.is_set():
                return
            time.sleep(1)
        loop_messages = []
        if active_skill_content:
            loop_messages.append({'role': 'system', 'content': f"Context: {active_skill_content}"})
        loop_messages.append({'role': 'user', 'content': prompt})
        content, tool_calls = stream_with_thinking(model_name, loop_messages, tools=tools)
        if tool_calls:
            loop_messages.append({'role': 'assistant', 'tool_calls': tool_calls})
            handle_tools(tool_calls, loop_messages)

Design decisions: (1) 1‑second sleep slices enable near‑real‑time response to /stop-loop; (2) a separate loop_messages list isolates background activity from the foreground chat history.

Full architecture recap

The agent consists of three logical layers: (1) the perpetual while True loop that routes user input; (2) the tool‑calling and skill‑management subsystem that extends functionality; (3) auxiliary services—slash commands, JSON persistence, context compression, and the background loop—that improve usability. All of this fits within 250 lines of Python, demonstrating that the essential kernel of an AI agent is a simple routing loop where the LLM decides which tool to invoke.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CLI Python AI Agent tool calling Ollama Context Compression Qwen3.5 background-loop

Written by

Data STUDIO

Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.