Operations 9 min read

How We Rescued a ClickHouse Logging Cluster After Zookeeper‑Induced Read‑Only Failure

A production logging system became unavailable due to Kafka backlog alerts, prompting an investigation that uncovered read‑only ClickHouse tables caused by mismatched Zookeeper metadata after a TTL policy change, leading to a step‑by‑step recovery involving Zookeeper restarts, metadata fixes, and table reconstruction.

dbaplus Community

Mar 7, 2023

How We Rescued a ClickHouse Logging Cluster After Zookeeper‑Induced Read‑Only Failure

Phenomenon

Developers reported that the log query system was unavailable and the operations team received Kafka message backlog alerts.

Background

The logging pipeline was migrated from an ELK stack to a ClickHouse‑based architecture. The data flow includes:

Flume : Alibaba Cloud tool that forwards SLS logs to a Kafka topic.

Fluent‑bit/Fluentd : Early clusters used Fluentd; later switched to the more efficient Fluent‑bit.

Python split topic : Distributes Kubernetes‑collected logs to sub‑topics for downstream processing by Gohangout or Flink.

Gohangout : A lightweight Golang implementation of Logstash that consumes Kafka and writes to ES, ClickHouse, etc.

Flink : Jobs that read Kafka logs and write to ClickHouse, supporting batch and timed writes with custom logic.

ClickHouse : Column‑oriented OLAP DBMS known for fast queries and high compression.

ClickVisual : A lightweight visual platform for log query, analysis, and alerting, with a front‑end largely inspired by Kibana.

Analysis and Recovery

Investigation started by inspecting Flink job topics, revealing a consumer that was not processing messages, causing Kafka backlog.

Flink logs showed that ClickHouse tables had become read‑only, preventing writes. The same error appeared in Gohangout jobs.

ClickHouse uses ZooKeeper for metadata storage and cluster coordination. Both the ClickHouse and ZooKeeper clusters were three‑node deployments sharing the same nodes. ZooKeeper logs displayed errors indicating that snapshot files could not be loaded because a data line exceeded the 1 MiB limit.

The root cause was traced to a TTL policy change performed the previous day: the retention period was altered from six months to sixty days. The DDL executed was:

$ ALTER TABLE k8s_log.data_pipeline_k8s_log ON CLUSTER sre_ck_cluster MODIFY TTL timestamp + toIntervalDay(60);

This change caused a mismatch between ZooKeeper metadata (still reflecting the six‑month policy) and ClickHouse local metadata (reflecting the new sixty‑day policy), leading to the read‑only state.

Recovery steps included:

Restarting ZooKeeper nodes one by one; after moving recent snapshot and log files to a backup directory, the ZooKeeper cluster returned to normal.

Manually aligning ClickHouse local metadata with ZooKeeper by editing the *.sql files in the metadata directory to restore the original TTL.

Restarting ClickHouse nodes sequentially. The first node started successfully; the second failed repeatedly, and the third showed a “Too many parts (300)” error.

Deleting the problematic table’s data and metadata on the affected node, then rebuilding the table after removing its ZooKeeper entry.

After cleaning up the ZooKeeper metadata (using deleteall or delete commands) and recreating the table, the cluster returned to normal operation, and both Gohangout and Flink jobs resumed consuming and storing logs.

Remaining Issues

Validate the exact cause of the “Too many parts (300)” error.

Establish a safe procedure for modifying TTL policies.

Implement monitoring and alerting for ClickHouse and ZooKeeper cluster health.

Deepen understanding of ClickHouse and ZooKeeper metadata storage mechanisms and ClickHouse startup sequence.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Zookeeper Kafka ClickHouse TTL Cluster Recovery log system

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.