Artificial Intelligence 47 min read

Mastering AIOps: Prompt Engineering, Function Calling, RAG, Graph RAG, and Local LLM Deployment

This comprehensive guide explores AIOps techniques such as prompt engineering, chat completions, memory management, function calling, fine‑tuning, retrieval‑augmented generation (RAG), graph‑based RAG, and practical steps for deploying open‑source large language models locally, providing code examples and best‑practice recommendations for modern DevOps environments.

Ops Development Stories

Jul 14, 2025

Prompt Engineering

Prompt Engineering refers to designing and optimizing input prompts for AI models to guide them toward more accurate, useful, or creative outputs. Although large models have strong language understanding, their behavior heavily depends on the prompt, making prompt design crucial.

In large models, input prompts are tokenized. For example, the sentence "Hello, world!" is split into four tokens: "Hello", ",", " world", and "!".

Tips: Each model has a maximum context length (e.g., GPT‑3.5: 4096 tokens, GPT‑4: up to 32768 tokens). The combined length of the prompt and output must stay within this limit.

Token count is affected by text length, punctuation, specialized terms, language, and programming syntax. When designing prompts, avoid assuming that longer input yields better answers.

Be specific : Replace vague instructions with clear, detailed requirements.

Prioritize : Order multiple requests by importance.

Use standard formats : Structure prompts with headings or markdown for clearer model parsing.

Decompose tasks : Break complex problems into step‑by‑step instructions.

Prompt Development Modes

Zero‑shot and Few‑shot

Zero‑shot prompts provide only a natural‑language description of the task, while few‑shot prompts include example inputs and outputs to improve model accuracy. Few‑shot is generally preferred for operational tasks.

Chain of Thought (CoT)

CoT prompts encourage the model to reason step‑by‑step before delivering the final answer, enhancing explanation and correctness.

Tips: Ensure the model can handle the required reasoning steps; otherwise, combine with function calling.

Prompt Chunking

When prompts exceed the model's context limit, split them into smaller chunks, process each chunk separately, and then combine the results.

Fixed‑length chunking (e.g., 1000 tokens per chunk).

Semantic chunking based on sentences, paragraphs, or sections.

Sliding‑window chunking with overlapping regions to preserve context.

Be aware of information loss, increased token cost, and potential inconsistencies when merging chunk results.

Chat Completions, Memory, JSON Mode

Chat Completions

Chat Completions is an API for multi‑turn conversations, supporting roles: system (sets behavior), user (input), and assistant (model output).

{
  "model": "gpt-3.5-turbo",
  "messages": [
    {"role": "system", "content": "You are a translation assistant."},
    {"role": "user", "content": "Translate 'Artificial intelligence is a key future technology.'"},
    {"role": "assistant", "content": "Artificial intelligence is one of the key technologies of the future."},
    {"role": "user", "content": "Thanks!"}
  ],
  "temperature": 0.7
}

To maintain context across turns, append previous assistant messages to the messages list for each new request.

LangChain Memory

LangChain provides several memory types for managing conversation history:

ConversationBufferMemory : Stores all messages in order.

ConversationBufferWindowMemory : Keeps only the most recent N messages.

ConversationTokenBufferMemory : Truncates based on total token count.

ConversationSummaryMemory : Summarizes earlier messages to save tokens.

ConversationKGMemory : Builds a knowledge graph from the dialogue.

RedisChatMessageHistory : Persists messages in a Redis database (supports persistence).

Example using ConversationBufferMemory:

from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
llm = ChatOpenAI(model_name="gpt-4o-mini", openai_api_key="...", base_url="https://api.siliconflow.cn/v1")
chain = LLMChain(llm=llm, prompt=prompt, memory=memory)

while True:
    user_input = input("Enter question: ")
    response = chain.invoke({"text": user_input})
    print(response)

JSON Mode

JSON Mode forces the model to output valid JSON, useful for downstream processing. Example:

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528-Qwen3-8B",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "Extract a JSON object with fields 'service_name' and 'action' from the user request."},
        {"role": "user", "content": "Help me restart the pay service"}
    ]
)
print(response.choices[0].message.content)

Output:

{"service_name": "pay", "action": "restart"}

Note: Ensure the target model supports function calling and JSON mode.

Function Calling

Function Calling enables LLMs to invoke predefined tools or APIs based on user intent, allowing integration with external systems such as databases, monitoring tools, or custom scripts.

tools = [{
    "type": "function",
    "function": {
        "name": "get_curr_weather",
        "description": "Get current weather",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "description": "Unit"}
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather in Shenzhen?"}],
    tools=tools,
    tool_choice="auto"
)

The model returns a tool_calls object indicating the function name and arguments. The application parses the arguments, executes the corresponding function, and feeds the result back to the model for a final response.

Tips: Define clear JSON schemas for function parameters, handle errors gracefully, limit recursion depth, and support multi‑step tool usage.

Fine‑tuning

Fine‑tuning adapts a pre‑trained LLM to a specific domain by training on labeled examples. Example: training a log‑alert expert to classify log severity (P0, P1, P2).

Data Preparation

Convert raw logs to a JSONL format where each entry contains a system prompt, user log, and assistant label.

{
  "messages": [
    {"role": "system", "content": "You are a log‑alert expert. Classify the urgency as P0, P1, or P2."},
    {"role": "user", "content": "[2024-08-07 12:00:00] Database connection failed. System is down."},
    {"role": "assistant", "content": "P0"}
  ]
}

Upload the JSONL file to a fine‑tuning platform (e.g., SiliconFlow) and start the training job.

from openai import OpenAI
client = OpenAI()
file = client.files.create(file=open("log.jsonl", "rb"), purpose="fine-tune")
job = client.fine_tuning.jobs.create(training_file=file.id, model="gpt-4o-mini-2024-07-18")
print(job.id)

After training completes, use the fine‑tuned model for inference:

completion = client.chat.completions.create(
    model="ft:your_fine_tuned_model",
    messages=[
        {"role": "system", "content": "Classify the urgency of the following log."},
        {"role": "user", "content": "Disk I/O error"}
    ]
)
print(completion.choices[0].message.content)  # Expected output: P0

Retrieval‑Augmented Generation (RAG)

RAG combines information retrieval with LLM generation: first retrieve relevant documents from a vector store, then feed them to the model as context.

Indexing

Load documents, split them (e.g., by Markdown headings), embed with text-embedding-3-small, and store in Chroma vector DB.

from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Load and split
with open("data.md", "r", encoding="utf-8") as f:
    docs_text = f.read()
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "title"), ("##", "subtitle")])
chunks = splitter.split_text(docs_text)

# Embed and store
embeddings = OpenAIEmbeddings(openai_api_key="...", openai_api_base="https://vip.apiyi.com/v1", model="text-embedding-3-small")
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory="chroma_db")
vectorstore.persist()

Retrieval and Generation

from langchain.schema.runnable import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain_openai import ChatOpenAI
from langchain import hub

retriever = vectorstore.as_retriever()
prompt = hub.pull('rlm/rag-prompt')
llm = ChatOpenAI(model="gpt-4o-mini", base_url="https://vip.apiyi.com/v1", api_key="...")

def format_docs(docs):
    return "

".join([d.page_content for d in docs])

rag_chain = ({"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser())
print(rag_chain.invoke("Who maintains the payment service?"))

Result: "The payment service is maintained by Xiao Zhang, contact 18888888888."

Limitations: chunking may break logical flow; RAG struggles with global or summarization queries.

Graph‑based RAG (Graph RAG)

Graph RAG enriches RAG by constructing a knowledge graph of entities and relationships, enabling the model to answer complex, relational queries.

Workflow

Build a knowledge graph from documents (entities, relations).

Perform graph‑aware retrieval that returns relevant entities and paths.

Combine retrieved graph context with original text for LLM generation.

Example using the graphrag CLI:

# Initialize workspace
python -m graphrag init --root ./data
# Place source files in ./data/input
# Build index
graphrag index --root ./data
# Query
graphrag query --root ./data --method basic -q "Who is responsible for the most services?"

Output: "Xiao Zhang is responsible for the most services, handling payment_backend and payment_frontend."

Graph RAG can answer summarization and aggregation questions that traditional RAG cannot.

Local Deployment of Open‑Source LLMs

Running models locally avoids data leakage and reduces API costs. Tools such as Ollama, LM Studio, and FastChat simplify deployment.

Ollama Example

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull llama3.1
# Run the model
ollama run llama3.1

Custom system prompt via a Modelfile:

FROM llama3.1
SYSTEM "You are an AI assistant specialized in DevOps and system administration. Provide concise, accurate, and actionable advice."

ollama create devops_assistant -f Modelfile
ollama run devops_assistant

Access via OpenAI‑compatible API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.1", "messages": [{"role": "user", "content": "Introduce yourself"}]}'

Optionally, deploy Open WebUI for a web interface:

docker run -d -p 3000:8080 --name open-webui --restart always -v open-webui:/app/backend/data ghcr.io/open-webui/open-webui:main

Conclusion

This article covered AIOps techniques that combine large language models with prompt engineering, function calling, memory management, fine‑tuning, RAG, Graph RAG, and local model deployment. Prompt design remains the foundation for guiding model behavior, while function calling and memory give models execution and state capabilities. RAG and Graph RAG enable accurate, knowledge‑grounded responses without full model retraining, and local deployment with tools like Ollama ensures data security and cost efficiency for enterprise DevOps workflows.

RAG Function Calling aiops Graph RAG local LLM deployment

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.