Operations 8 min read

Automate Cluster Health Checks with Koalas: Cutting Big Data Downtime

The article introduces Koalas, an automated distributed diagnostic tool for TDH clusters that identifies and resolves computing environment issues—such as network, platform, and system problems—through one‑click checks, detailed reporting, and both preventive and diagnostic use cases.

StarRing Big Data Open Lab

Sep 26, 2016

Automate Cluster Health Checks with Koalas: Cutting Big Data Downtime

Introduction

The after‑sales team at Xinghuan Technology collects weekly statistics on customer support tickets and finds that, on average, 35% of issues stem from network problems, 35% from platform misconfigurations, 20% from system hardware or software faults, and the remaining 10% from user misuse.

These first three categories are classified as computing environment issues , which are unrelated to product quality but can severely affect TDH services. Rapidly diagnosing and eliminating the 90% of environment problems can dramatically reduce user‑facing incidents.

TDH is a complex system with stringent environment requirements and large clusters, making manual checks time‑consuming. To address this, TDH 4.6 introduces Koalas, a distributed automated inspection tool accessible via the Transwarp Manager monitoring interface.

Koalas Usage

Koalas can run 13 checks, including network throughput, hardware environment, and jar‑package redundancy detection. Users simply click the "Run" button for a specific check or "Run All" to execute every program. Results are displayed per server with red (critical), yellow (warning), or green (normal) indicators.

Configuration options such as timeout and target nodes are adjustable via the "Configure" button.

Detailed results can be reviewed on the "History" page, which provides logs, summary reports, execution statistics, parameters, and links to detailed logs for each node.

Koalas Application

Koalas is used in two scenarios: preventive checks before running cluster jobs to ensure a green status, and diagnostic checks after a job fails to quickly pinpoint environment‑related causes.

Problem description: A customer experienced poor performance with Transwarp Inceptor when counting rows in a 15‑million‑record table, taking over 100 seconds even after switching to an ORC table. Data ingestion to HDFS was also slow.

Root cause hypothesis: Slow HDFS writes suggest low network throughput between DataNodes, and the Inceptor aggregation (a Spark job) requires extensive shuffle, further implicating network bandwidth.

Diagnosis: Using Koalas' network throughput test, each node sent and received data simultaneously. The test flagged a yellow status, indicating an issue. Detailed logs showed that three nodes achieved 380‑500 MB/s, while node 172.16.2.101 only reached ~30 MB/s, creating a bottleneck despite all nodes having 10 GbE NICs.

The diagnosis confirmed the hypothesis: the performance issue was due to one server's low network throughput, not the query or storage format.

Afterword

In a small four‑node cluster, checking pairwise data transfer required 24 tasks, and the number of tasks grows combinatorially with cluster size, making manual testing impractical. Koalas automates execution and result visualization with a single click.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data automated checks Cluster Monitoring environment diagnostics Network Throughput

Written by

StarRing Big Data Open Lab

Focused on big data technology research, exploring the Big Data era | [email protected]

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.