Operations 15 min read

Open‑Source OCManager: A Smart Manager that Handles 7 Million Daily Alerts

OCManager, an open‑source integrated platform from OpenCloudOS, unifies cluster management, whole‑machine monitoring, and AI‑driven operations in a single web console, supporting millions of daily alerts, thousands of incidents, and multi‑OS environments with a four‑layer architecture and Docker‑based deployment.

Tencent Architect
Tencent Architect
Tencent Architect
Open‑Source OCManager: A Smart Manager that Handles 7 Million Daily Alerts

OCManager is an open‑source integrated platform that addresses the "cannot see, cannot manage, cannot diagnose, cannot fix" challenges of large‑scale Linux clusters by combining cluster management, whole‑machine monitoring, and AI‑powered operations into a single web console.

What is OCManager

When managing hundreds or thousands of servers, administrators often cobble together Prometheus/Grafana for monitoring, ELK for logs, a custom CMDB, and Ansible/SaltStack for batch execution. This approach works for small fleets but breaks down at the scale of tens of thousands of machines, where daily processing reaches 7 million CVE/package risk alerts, 2 000+ crash events, hundreds of software versions, and dozens of OS releases. OCManager, built by the OpenCloudOS team, provides a unified solution.

Four‑Layer Architecture

Each managed host runs an Agent plugin that reports data over an mTLS channel. The backend consists of tRPC‑Go microservices built on the Footstone foundation (machine management, permission verification, etc.) and stores data in MySQL, ClickHouse, Kafka, and Redis. The frontend uses Vue2 and TDesign to deliver all functions through a single web console.

Four Core Features

Integration & out‑of‑the‑box : All modules share a common machine view, permission system, and agent channel, eliminating the need for ad‑hoc stitching.

Full‑Web Management : Host onboarding, batch operations, dashboards, anomaly diagnosis, command execution, and AI dialogue are all performed in the web UI, improving efficiency, auditability, and traceability.

Multi‑OS Support : Phase 1 supports OpenCloudOS 9, OpenCloudOS 8, and the commercial TencentOS series; Phase 2 will add major third‑party open‑source OSes.

Large‑Scale Validation : Before open‑sourcing, OCManager ran for three years inside Tencent, serving over 3 million heterogeneous servers, processing 7 million daily alerts, and automatically analyzing more than 2 000 crash events.

Five Core Modules

2.1 Cluster Management

The foundation module, powered by Footstone, provides unified host onboarding, tag‑based grouping, bulk import/export, RBAC, and full‑lifecycle audit logs. For fleets exceeding one million machines, Footstone scales horizontally via cloud‑native mechanisms and maintains high‑concurrency long‑lived connections through mutual TLS.

2.2 Whole‑Machine Monitoring

The monitoring module is designed for system‑level troubleshooting, offering deep visibility beyond traditional metrics. It collects 26 core parameters across four dimensions:

CPU Load Decomposition : Shows CPU usage and average load for 1/5/15‑minute intervals, with TOP rankings to spot uneven multi‑core load.

Memory Fine‑Grained View : Tracks physical memory usage and "active" memory allocation at GB granularity to detect hidden leaks.

Device‑Level Disk I/O : Drills down to physical devices and logical partitions (e.g., vda, vdb1), reporting usage and read/write ratios.

Network‑Card Traffic Mapping : Provides per‑NIC (e.g., eth1, br0) inbound/outbound traffic, quickly identifying bandwidth spikes.

2.3 Command Assistant

Batch command execution is re‑engineered into standardized, reusable web‑based job templates, solving the "how many machines can a single command reach" problem.

Batch Target Selection : Select hosts directly from the host panel and dispatch commands to all selected machines in one step.

Parameterized Command Templates : Commands are presented as structured cards with name, type, scope, description, and annotated examples; placeholders like {arch} reduce input errors.

Secure Closed‑Loop : Commands pass a whitelist review, are sandboxed, and execution results are streamed back to the console and archived for full traceability.

2.4 OCAI (AI Ops)

OCAI is the most upgraded module in this release. It adds a web‑based interaction layer and merges a generic Q&A assistant with an intelligent diagnosis engine, all powered by a dedicated OpenCloudOS knowledge base.

Diagnosis + Execution Suggestions : Users describe an anomaly (e.g., "memory usage 98% urgent"), OCAI performs multi‑dimensional analysis, outputs a structured report with baseline assessment, root‑cause attribution (e.g., "CPU spike 76‑99% from 02:00‑07:00 due to Merge/Mutation tasks"), aggregated alerts, and prioritized remediation commands.

Explainable Reasoning : The full analysis chain is visible—from agent status verification to step‑by‑step validation of configuration, package paths, and service endpoints—ensuring transparency.

Unified Q&A & Deep Diagnosis : Routine technical queries (version features, compatibility) use conversational Q&A; complex fault scenarios trigger automatic diagnosis tasks driven by LangGraph + MCP, pulling logs, metrics, and configs.

Multi‑Turn Context & Knowledge Accumulation : Dialogue history is archived; new troubleshooting experiences enrich the OS‑specific knowledge base for future reuse.

Deployment

The recommended deployment method is a one‑click Docker setup.

Docker 20.10+ (including Compose v2) and Go 1.24+ are required.

The executing user must have Docker privileges (root or docker group) and sudo rights.

For functional verification, a server with at least 4 CPU / 8 GB RAM is sufficient; for full stack (MySQL, Redis, Kafka, ZooKeeper, ClickHouse, manager, data channels) 8 CPU / 16 GB is advised.

Avoid running scripts in the Docker data directory, as temporary binaries are written to manager/backend/bin/ and later cleaned.

# clone repository
git clone https://gitee.com/OpenCloudOS/ocmanager.git
cd ocmanager
# copy example env file
cp config/env.example config/.env
# edit configuration (at minimum set SERVER_HOST)
vi config/.env
# one‑click build and start all services
bash scripts/deploy.sh

After the services start, access the console at http://127.0.0.1:13070 using the default credentials admin / Admin123456@ and change the password immediately.

Optional OCAI‑Service Deployment

Enable the AI backend by setting ENABLE_OCAI_DEPLOY=true in config/.env and providing OCAI_JWT_SECRET and OCAI_LLM_DEFAULT_API_KEY, then run bash scripts/deploy.sh up to launch the OCAI service and Nginx reverse proxy.

Phase 2 Roadmap (Coming Soon)

Finer‑grained instance profiling and asset view.

Proactive health checks and baseline management.

Performance platform with baselines, regression testing, and tuning tools.

Dmesg log intelligent collection and analysis.

CrashBuddy – automated kernel crash analysis, bulk remediation, and retry mechanisms.

Extended MCP toolset exposing more OS capabilities to the agent ecosystem.

Broader adaptation to third‑party open‑source operating systems.

Community contributions are welcomed via PRs and Issues on the OCManager repository and its related sub‑projects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringDockercluster managementOpenCloudOSAI OpsOCManager
Tencent Architect
Written by

Tencent Architect

We share insights on storage, computing, networking and explore leading industry technologies together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.