Building an Enterprise Log MCP: A Hands‑On Guide

The article explains why AI alone cannot reliably analyze logs, proposes wrapping an enterprise Loki or Elasticsearch log system with a custom MCP that separates discovery and query layers, discusses transport and authentication choices, provides complete Python implementation, and shares three production lessons to ensure safe, scalable log querying.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
Building an Enterprise Log MCP: A Hands‑On Guide

Problem: AI lacks direct log access

When an on‑call engineer copies a few log lines into Claude Code, the AI can only reason about the fragment it sees, missing the root cause that may be hidden elsewhere in the log stream. The core issue is that AI has no built‑in channel to query the log system; engineers act as manual information middlemen.

Why official MCP implementations are insufficient

Tool granularity is too coarse. The official Loki MCP exposes only a single loki_query tool that requires the AI to write a full LogQL query without knowing which labels (e.g., app, env, namespace) exist. This forces guesswork and often returns empty results.

Authentication mechanisms do not match. Internal services may use LDAP SSO, per‑team API keys, internal OAuth, or cookie‑based sessions, while the official MCP only supports cloud‑oriented auth.

Query permissions are uncontrolled. Exposing a full‑access MCP to AI is equivalent to giving the AI DBA credentials, risking unrestricted reads, writes, or deletions.

Architecture of an enterprise log MCP

Before coding, the design separates the tool into two layers:

Discovery layer – tells the AI what labels, possible values, and active streams exist. Calls are lightweight and have negligible performance impact.

Query layer – performs actual log retrieval with time‑range limits, result caps, and direction control. This layer includes safety checks to avoid heavy scans.

Tool list for Loki

list_labels

– List all label names (GET /loki/api/v1/labels) list_label_values – List values for a specific label (GET /loki/api/v1/label/{name}/values) list_streams – List active log streams (GET /loki/api/v1/series) query_logs – Query logs over a time range (GET /loki/api/v1/query_range) instant_query – Execute LogQL at a single timestamp (GET /loki/api/v1/query)

Transport protocol choice

The MCP supports two transports: stdio (standard input/output) and Streamable HTTP . For internal deployments, stdio is preferred because the MCP server runs as a subprocess of Claude Code, credentials are passed via environment variables, no network ports are opened, and the attack surface is minimal.

Stdout mode avoids exposing an HTTP server, eliminates port conflicts, and lets Claude Code configure credentials in claude_desktop_config.json or settings.json.

HTTP mode is only needed when multiple engineers must share a single MCP instance or when the server must run on a bastion host behind network isolation, which introduces OAuth 2.1 complexity.

Authentication options

Typical internal Loki auth methods:

Basic auth – Authorization: Basic <base64> Bearer token – Authorization: Bearer <token> Multi‑tenant Grafana Loki – additional X-Scope-OrgID header

Typical internal Elasticsearch auth methods:

API key – Authorization: ApiKey <key> Basic auth – same header as Loki

Some deployments disable security entirely (not recommended).

All credentials are injected via environment variables, never hard‑coded.

Hands‑On: Implementing a Loki MCP

Environment preparation

Python >= 3.10
mcp[cli] == 1.27.2 (2026‑05‑29 release)
httpx == 0.28.x

Install dependencies:

pip install "mcp[cli]" httpx

Step 1 – Create the MCP server skeleton

# loki_mcp/server.py
import os, base64
from datetime import datetime, timedelta, timezone
import httpx
from mcp.server.fastmcp import FastMCP

mcp = FastMCP("loki-log-server")

# Load auth config from environment
LOKI_URL = os.getenv("LOKI_URL", "http://localhost:3100")
LOKI_USERNAME = os.getenv("LOKI_USERNAME", "")
LOKI_PASSWORD = os.getenv("LOKI_PASSWORD", "")
LOKI_BEARER_TOKEN = os.getenv("LOKI_BEARER_TOKEN", "")
LOKI_ORG_ID = os.getenv("LOKI_ORG_ID", "")

# Safety limits (configurable via env)
MAX_LIMIT = int(os.getenv("LOKI_MAX_LIMIT", "1000"))
MAX_HOURS = int(os.getenv("LOKI_MAX_HOURS", "24"))

def _build_headers() -> dict[str, str]:
    """Construct auth headers, priority: Bearer > Basic > none"""
    headers = {"Content-Type": "application/json"}
    if LOKI_BEARER_TOKEN:
        headers["Authorization"] = f"Bearer {LOKI_BEARER_TOKEN}"
    elif LOKI_USERNAME and LOKI_PASSWORD:
        credentials = base64.b64encode(f"{LOKI_USERNAME}:{LOKI_PASSWORD}".encode()).decode()
        headers["Authorization"] = f"Basic {credentials}"
    if LOKI_ORG_ID:
        headers["X-Scope-OrgID"] = LOKI_ORG_ID
    return headers

def _validate_time_range(start_hours: float, end_hours: float) -> tuple[str, str]:
    """Convert \"N hours ago\" to Loki nanosecond timestamps and enforce MAX_HOURS."""
    if start_hours < 0 or end_hours < 0:
        raise ValueError("时间范围不能是负数")
    if start_hours > MAX_HOURS:
        raise ValueError(f"最多只能查 {MAX_HOURS} 小时内的日志,当前请求 {start_hours} 小时")
    now = datetime.now(timezone.utc)
    start_dt = now - timedelta(hours=start_hours)
    end_dt = now - timedelta(hours=end_hours)
    start_ns = str(int(start_dt.timestamp() * 1e9))
    end_ns = str(int(end_dt.timestamp() * 1e9))
    return start_ns, end_ns

Step 2 – Discovery tools

@mcp.tool()
async def list_labels() -> dict:
    """Return all label names in Loki. Example: {"labels": ["app", "env", "namespace", "pod", "level"]}"""
    async with httpx.AsyncClient(timeout=10.0) as client:
        resp = await client.get(f"{LOKI_URL}/loki/api/v1/labels", headers=_build_headers(), verify=False)
        resp.raise_for_status()
        data = resp.json()
    return {"labels": data.get("data", [])}

@mcp.tool()
async def list_label_values(label_name: str) -> dict:
    """List all possible values for a given label."""
    if not label_name or not label_name.replace("-", "").replace("_", "").isalnum():
        raise ValueError(f"无效的 label 名称: {label_name!r}")
    async with httpx.AsyncClient(timeout=10.0) as client:
        resp = await client.get(f"{LOKI_URL}/loki/api/v1/label/{label_name}/values", headers=_build_headers(), verify=False)
        resp.raise_for_status()
        data = resp.json()
    return {"label": label_name, "values": data.get("data", [])}

@mcp.tool()
async def list_streams(match: str = "", start_hours: float = 1.0) -> dict:
    """List active streams with optional selector."""
    start_ns, end_ns = _validate_time_range(start_hours, 0)
    params = {"start": start_ns, "end": end_ns, "limit": "100"}
    if match:
        params["match[]"] = match
    async with httpx.AsyncClient(timeout=15.0) as client:
        resp = await client.get(f"{LOKI_URL}/loki/api/v1/series", headers=_build_headers(), params=params, verify=False)
        resp.raise_for_status()
        data = resp.json()
    return {"streams": data.get("data", []), "count": len(data.get("data", []))}

Step 3 – Query tools

@mcp.tool()
async def query_logs(logql: str, start_hours: float = 1.0, end_hours: float = 0.0, limit: int = 100, direction: str = "backward") -> dict:
    """Query logs over a time range."""
    if direction not in ("backward", "forward"):
        raise ValueError("direction 必须是 'backward' 或 'forward'")
    limit = min(limit, MAX_LIMIT)
    start_ns, end_ns = _validate_time_range(start_hours, end_hours)
    params = {"query": logql, "start": start_ns, "end": end_ns, "limit": str(limit), "direction": direction}
    async with httpx.AsyncClient(timeout=30.0) as client:
        resp = await client.get(f"{LOKI_URL}/loki/api/v1/query_range", headers=_build_headers(), params=params, verify=False)
        resp.raise_for_status()
        data = resp.json()
    result_streams = data.get("data", {}).get("result", [])
    logs = []
    for stream in result_streams:
        labels = stream.get("stream", {})
        for ts_ns, line in stream.get("values", []):
            ts_sec = int(ts_ns) / 1e9
            readable_ts = datetime.fromtimestamp(ts_sec, tz=timezone.utc).strftime("%Y-%m-%d %H:%M:%S.%f UTC")
            logs.append({"timestamp": readable_ts, "labels": labels, "line": line})
    return {"query": logql, "total": len(logs), "logs": logs}

@mcp.tool()
async def instant_query(logql: str, time_hours_ago: float = 0.0) -> dict:
    """Execute LogQL at a single timestamp."""
    if time_hours_ago > MAX_HOURS:
        raise ValueError(f"时间点不能超过 {MAX_HOURS} 小时前")
    now = datetime.now(timezone.utc)
    query_time = now - timedelta(hours=time_hours_ago)
    query_time_ns = str(int(query_time.timestamp() * 1e9))
    async with httpx.AsyncClient(timeout=15.0) as client:
        resp = await client.get(f"{LOKI_URL}/loki/api/v1/query", headers=_build_headers(), params={"query": logql, "time": query_time_ns}, verify=False)
        resp.raise_for_status()
        data = resp.json()
    return {"query": logql, "result_type": data.get("data", {}).get("resultType"), "result": data.get("data", {}).get("result", [])}

if __name__ == "__main__":
    # stdio mode – Claude Code launches this process as a subprocess
    mcp.run(transport="stdio")

Configuring Claude Code to load the MCP

{
  "mcpServers": {
    "loki-logs": {
      "command": "python",
      "args": ["-m", "loki_mcp.server"],
      "env": {
        "LOKI_URL": "http://your-loki-server:3100",
        "LOKI_BEARER_TOKEN": "your-token-here",
        "LOKI_ORG_ID": "your-org-id",
        "LOKI_MAX_LIMIT": "500",
        "LOKI_MAX_HOURS": "48"
      }
    }
  }
}

Using the MCP from Claude Code

After the configuration, the workflow inside Claude Code is:

/mcp            # list available MCP tools, verify "loki-logs" is loaded
最近一小时订单服务有哪些 ERROR 日志?帮我分析一下根因。

Claude Code will automatically call list_labelslist_label_values("app")query_logs, eliminating manual LogQL composition and Grafana copy‑paste.

Advanced lessons from production

Lesson 1 – AI may guess wrong LogQL syntax

When AI generated {app="payment-service", error="true"}, Loki returned no results because the log schema stores error inside the JSON payload, not as a label. The fix is to include schema hints in the tool docstrings or add a get_log_schema helper that returns label definitions and example queries.

@mcp.tool()
async def get_log_schema() -> dict:
    """Return label structure and common query examples for the current log system."""
    return {
        "label_structure": {
            "app": "服务名,如 order-service, payment-service, gateway",
            "env": "环境,值为 production 或 staging",
            "namespace": "K8s namespace",
            "level": "注意:level 不是 label,需要通过 json 解析:| json | level = \"error\""
        },
        "common_queries": [
            '{app="order-service"} | json | level = "error"',
            '{app="payment-service", env="production"} |= "timeout" | json',
            'count_over_time({app="order-service"} | json | level = "error" [5m])'
        ],
        "gotchas": [
            "日志内容是 JSON 格式,level/trace_id 等字段需要先 | json 再用",
            "时间戳在 Loki 里精度到纳秒,API 返回的是纳秒字符串"
        ]
    }

Lesson 2 – Unbounded queries can exhaust resources

Allowing AI to request "all ERROR logs for the past 24 h" caused Loki to scan tens of gigabytes, leading to timeouts and memory spikes. The solution is to hard‑code MAX_LIMIT and MAX_HOURS in the MCP and prevent the AI from overriding them. Additionally, a pre‑check using count_over_time can warn the AI when the expected result set exceeds a safe threshold.

# Pseudo‑code inside query_logs before execution
count_query = f'count_over_time(({logql})[{time_range}])'
count_result = await _loki_instant_query(count_query)
if count_result > WARN_THRESHOLD:
    return {"warning": f"预计返回 {count_result} 条日志,建议缩小时间范围或增加过滤条件", "suggestion": "缩小时间范围到 1 小时内,或增加 level='error' 过滤"}

Lesson 3 – StdIO mode offers unexpected benefits

Deploying the MCP as an HTTP service introduced OAuth management, firewall rules, and a single point of failure that affected all engineers. Switching back to stdio mode, where each engineer runs a local MCP subprocess, eliminated operational overhead, isolated failures, and simplified credential handling.

No extra service to maintain.

Queries are isolated per engineer; one bad query does not affect others.

Each engineer manages their own token, avoiding shared secrets.

StdIO is limited to environments where the machine can directly reach Loki/Elasticsearch HTTP endpoints; otherwise HTTP mode with a bastion host is required.

Conclusion

Tool layering (Discovery → Query) is more important than the sheer number of tools.

For small teams, stdio transport provides zero‑ops, isolation, and simple auth.

The MCP layer is the final security guard – enforce query caps, time‑range limits, and label whitelists in code.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonMCPElasticSearchLokiLogQLAI-assisted debugging
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.