Mastering AIOps: Prompt Engineering, Function Calling, RAG, Graph RAG, and Local LLM Deployment
This comprehensive guide explores AIOps techniques such as prompt engineering, chat completions, memory management, function calling, fine‑tuning, retrieval‑augmented generation (RAG), graph‑based RAG, and practical steps for deploying open‑source large language models locally, providing code examples and best‑practice recommendations for modern DevOps environments.
Prompt Engineering
Prompt Engineering refers to designing and optimizing input prompts for AI models to guide them toward more accurate, useful, or creative outputs. Although large models have strong language understanding, their behavior heavily depends on the prompt, making prompt design crucial.
In large models, input prompts are tokenized. For example, the sentence "Hello, world!" is split into four tokens: "Hello", ",", " world", and "!".
Tips: Each model has a maximum context length (e.g., GPT‑3.5: 4096 tokens, GPT‑4: up to 32768 tokens). The combined length of the prompt and output must stay within this limit.
Token count is affected by text length, punctuation, specialized terms, language, and programming syntax. When designing prompts, avoid assuming that longer input yields better answers.
Be specific : Replace vague instructions with clear, detailed requirements.
Prioritize : Order multiple requests by importance.
Use standard formats : Structure prompts with headings or markdown for clearer model parsing.
Decompose tasks : Break complex problems into step‑by‑step instructions.
Prompt Development Modes
Zero‑shot and Few‑shot
Zero‑shot prompts provide only a natural‑language description of the task, while few‑shot prompts include example inputs and outputs to improve model accuracy. Few‑shot is generally preferred for operational tasks.
Chain of Thought (CoT)
CoT prompts encourage the model to reason step‑by‑step before delivering the final answer, enhancing explanation and correctness.
Tips: Ensure the model can handle the required reasoning steps; otherwise, combine with function calling.
Prompt Chunking
When prompts exceed the model's context limit, split them into smaller chunks, process each chunk separately, and then combine the results.
Fixed‑length chunking (e.g., 1000 tokens per chunk).
Semantic chunking based on sentences, paragraphs, or sections.
Sliding‑window chunking with overlapping regions to preserve context.
Be aware of information loss, increased token cost, and potential inconsistencies when merging chunk results.
Chat Completions, Memory, JSON Mode
Chat Completions
Chat Completions is an API for multi‑turn conversations, supporting roles: system (sets behavior), user (input), and assistant (model output).
{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "system", "content": "You are a translation assistant."},
{"role": "user", "content": "Translate 'Artificial intelligence is a key future technology.'"},
{"role": "assistant", "content": "Artificial intelligence is one of the key technologies of the future."},
{"role": "user", "content": "Thanks!"}
],
"temperature": 0.7
}To maintain context across turns, append previous assistant messages to the messages list for each new request.
LangChain Memory
LangChain provides several memory types for managing conversation history:
ConversationBufferMemory : Stores all messages in order.
ConversationBufferWindowMemory : Keeps only the most recent N messages.
ConversationTokenBufferMemory : Truncates based on total token count.
ConversationSummaryMemory : Summarizes earlier messages to save tokens.
ConversationKGMemory : Builds a knowledge graph from the dialogue.
RedisChatMessageHistory : Persists messages in a Redis database (supports persistence).
Example using ConversationBufferMemory:
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
llm = ChatOpenAI(model_name="gpt-4o-mini", openai_api_key="...", base_url="https://api.siliconflow.cn/v1")
chain = LLMChain(llm=llm, prompt=prompt, memory=memory)
while True:
user_input = input("Enter question: ")
response = chain.invoke({"text": user_input})
print(response)JSON Mode
JSON Mode forces the model to output valid JSON, useful for downstream processing. Example:
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-0528-Qwen3-8B",
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": "Extract a JSON object with fields 'service_name' and 'action' from the user request."},
{"role": "user", "content": "Help me restart the pay service"}
]
)
print(response.choices[0].message.content)Output:
{"service_name": "pay", "action": "restart"}Note: Ensure the target model supports function calling and JSON mode.
Function Calling
Function Calling enables LLMs to invoke predefined tools or APIs based on user intent, allowing integration with external systems such as databases, monitoring tools, or custom scripts.
tools = [{
"type": "function",
"function": {
"name": "get_curr_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "description": "Unit"}
},
"required": ["location"]
}
}
}]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What's the weather in Shenzhen?"}],
tools=tools,
tool_choice="auto"
)The model returns a tool_calls object indicating the function name and arguments. The application parses the arguments, executes the corresponding function, and feeds the result back to the model for a final response.
Tips: Define clear JSON schemas for function parameters, handle errors gracefully, limit recursion depth, and support multi‑step tool usage.
Fine‑tuning
Fine‑tuning adapts a pre‑trained LLM to a specific domain by training on labeled examples. Example: training a log‑alert expert to classify log severity (P0, P1, P2).
Data Preparation
Convert raw logs to a JSONL format where each entry contains a system prompt, user log, and assistant label.
{
"messages": [
{"role": "system", "content": "You are a log‑alert expert. Classify the urgency as P0, P1, or P2."},
{"role": "user", "content": "[2024-08-07 12:00:00] Database connection failed. System is down."},
{"role": "assistant", "content": "P0"}
]
}Upload the JSONL file to a fine‑tuning platform (e.g., SiliconFlow) and start the training job.
from openai import OpenAI
client = OpenAI()
file = client.files.create(file=open("log.jsonl", "rb"), purpose="fine-tune")
job = client.fine_tuning.jobs.create(training_file=file.id, model="gpt-4o-mini-2024-07-18")
print(job.id)After training completes, use the fine‑tuned model for inference:
completion = client.chat.completions.create(
model="ft:your_fine_tuned_model",
messages=[
{"role": "system", "content": "Classify the urgency of the following log."},
{"role": "user", "content": "Disk I/O error"}
]
)
print(completion.choices[0].message.content) # Expected output: P0Retrieval‑Augmented Generation (RAG)
RAG combines information retrieval with LLM generation: first retrieve relevant documents from a vector store, then feed them to the model as context.
Indexing
Load documents, split them (e.g., by Markdown headings), embed with text-embedding-3-small, and store in Chroma vector DB.
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Load and split
with open("data.md", "r", encoding="utf-8") as f:
docs_text = f.read()
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "title"), ("##", "subtitle")])
chunks = splitter.split_text(docs_text)
# Embed and store
embeddings = OpenAIEmbeddings(openai_api_key="...", openai_api_base="https://vip.apiyi.com/v1", model="text-embedding-3-small")
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory="chroma_db")
vectorstore.persist()Retrieval and Generation
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain_openai import ChatOpenAI
from langchain import hub
retriever = vectorstore.as_retriever()
prompt = hub.pull('rlm/rag-prompt')
llm = ChatOpenAI(model="gpt-4o-mini", base_url="https://vip.apiyi.com/v1", api_key="...")
def format_docs(docs):
return "
".join([d.page_content for d in docs])
rag_chain = ({"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser())
print(rag_chain.invoke("Who maintains the payment service?"))Result: "The payment service is maintained by Xiao Zhang, contact 18888888888."
Limitations: chunking may break logical flow; RAG struggles with global or summarization queries.
Graph‑based RAG (Graph RAG)
Graph RAG enriches RAG by constructing a knowledge graph of entities and relationships, enabling the model to answer complex, relational queries.
Workflow
Build a knowledge graph from documents (entities, relations).
Perform graph‑aware retrieval that returns relevant entities and paths.
Combine retrieved graph context with original text for LLM generation.
Example using the graphrag CLI:
# Initialize workspace
python -m graphrag init --root ./data
# Place source files in ./data/input
# Build index
graphrag index --root ./data
# Query
graphrag query --root ./data --method basic -q "Who is responsible for the most services?"Output: "Xiao Zhang is responsible for the most services, handling payment_backend and payment_frontend."
Graph RAG can answer summarization and aggregation questions that traditional RAG cannot.
Local Deployment of Open‑Source LLMs
Running models locally avoids data leakage and reduces API costs. Tools such as Ollama, LM Studio, and FastChat simplify deployment.
Ollama Example
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull llama3.1
# Run the model
ollama run llama3.1Custom system prompt via a Modelfile:
FROM llama3.1
SYSTEM "You are an AI assistant specialized in DevOps and system administration. Provide concise, accurate, and actionable advice." ollama create devops_assistant -f Modelfile
ollama run devops_assistantAccess via OpenAI‑compatible API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3.1", "messages": [{"role": "user", "content": "Introduce yourself"}]}'Optionally, deploy Open WebUI for a web interface:
docker run -d -p 3000:8080 --name open-webui --restart always -v open-webui:/app/backend/data ghcr.io/open-webui/open-webui:mainConclusion
This article covered AIOps techniques that combine large language models with prompt engineering, function calling, memory management, fine‑tuning, RAG, Graph RAG, and local model deployment. Prompt design remains the foundation for guiding model behavior, while function calling and memory give models execution and state capabilities. RAG and Graph RAG enable accurate, knowledge‑grounded responses without full model retraining, and local deployment with tools like Ollama ensures data security and cost efficiency for enterprise DevOps workflows.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
