Databases 7 min read

Resolving ClickHouse “too many mutations” Errors by Cleaning Mutations and Switching to ReplacingMergeTree

The article describes a real‑world ClickHouse incident where excessive UPDATE‑style mutations caused a “too many mutations(1036)” error, explains the cluster’s configuration, and details a step‑by‑step recovery process that clears pending mutations and migrates tables to the ReplacingMergeTree engine to restore service.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Resolving ClickHouse “too many mutations” Errors by Cleaning Mutations and Switching to ReplacingMergeTree

During a routine afternoon, a large telecom operator reported that a ClickHouse cluster failed to insert data because the system returned the error too many mutations(1036) . Investigation revealed thousands of unfinished mutations in system.mutations , each generated by frequent ALTER TABLE … UPDATE statements.

The cluster runs ClickHouse version 23.9.1, with a 2‑replica, 11‑shard architecture storing roughly 4 TB of data. Business logic performs about 50 update operations per day, retrying until success, which quickly exhausts the default mutation limit of 1000.

Because the business workflow and ClickHouse version could not be changed, the team decided to adapt the schema by switching the affected tables to engines that handle updates more efficiently, such as ReplacingMergeTree , CollapsingMergeTree, or VersionedCollapsingMergeTree.

The recovery plan consists of four steps:

Clear pending mutations using KILL MUTATION commands.

Change the table engine to a more suitable one (the example uses ReplacingMergeTree).

Restart the service.

Verify that the issue is resolved.

1. Clear mutations

All 23,805 pending mutations were removed with a single KILL MUTATION statement:

-- kill 未完成的 mutation
KILL MUTATION WHERE database = 'default' AND table = 'table'

2. Update table engine

The article demonstrates creating a ReplacingMergeTree table and inserting data to show how the engine keeps only the latest version of rows with the same primary key.

-- 创建 ReplacingMergeTree 表
CREATE TABLE hackernews_rmt (
    id UInt32,
    author String,
    comment String,
    views UInt64
) ENGINE = ReplacingMergeTree
PRIMARY KEY (author, id);

-- 插入两行数据
INSERT INTO hackernews_rmt VALUES
   (1, 'ricardo', 'This is post #1', 0),
   (2, 'ch_fan', 'This is post #2', 0);

-- 再插入两行更新后的数据
INSERT INTO hackernews_rmt VALUES
   (1, 'ricardo', 'This is post #1', 100),
   (2, 'ch_fan', 'This is post #2', 200);

Queries using final automatically return the latest rows based on the id,author ordering key. Running OPTIMIZE merges partitions, discarding old versions.

3. Verify mutation count

SELECT count(*)
FROM clusterAllReplicas('default', system.mutations)
WHERE is_done = 0;

After the engine change, the mutation count dropped to zero and the business resumed normal operation, with a total downtime of about 30 minutes.

Summary

Monitor mutations continuously and analyze their root causes.

If frequent UPDATE/DELETE operations generate many mutations, consider switching to a more suitable table engine.

For schema‑change‑driven mutations, the engine switch may be unnecessary.

When designing update and delete strategies, keep the system’s capacity in mind and use lightweight deletes for occasional removals. The solution has remained stable for four months.

DatabaseClickHouseTroubleshootingReplacingMergeTreemutationsTable Engine
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.