Why ClickHouse Outperforms Elasticsearch in Real‑World Queries
This article compares Elasticsearch and ClickHouse across architecture, query capabilities, and performance using Docker‑compose stacks and Python SDK tests, demonstrating that ClickHouse often delivers superior speed, especially in aggregation and regex queries, while highlighting each system’s design trade‑offs.
Elasticsearch is a real‑time distributed search and analytics engine built on Lucene, often used together with Logstash and Kibana (the ELK stack) for end‑to‑end log and search analysis. ClickHouse, developed by Yandex, is a column‑oriented relational database for OLAP workloads, open‑sourced in 2016.
While Elasticsearch remains the most popular big‑data log and search solution, many companies (e.g., Ctrip, Kuaishou) have begun migrating to ClickHouse.
Architecture and Design Comparison
Elasticsearch relies on Lucene and focuses on search, using inverted indexes and Bloom filters. It achieves distributed search performance and high availability through sharding and replica mechanisms.
Client Node – handles API and data access, does not store or process data.
Data Node – stores data and builds indexes.
Master Node – coordinates the cluster, does not store data.
ClickHouse follows an MPP architecture for distributed ROLAP. Each node has equal responsibility and processes a portion of the data. It stores data column‑wise, uses compression, sparse indexes, log‑structured merge trees, and leverages CPU features such as SIMD. Coordination between nodes is handled by Zookeeper.
Both systems support Bloom filters for search acceleration.
Query Comparison Practical Test
The test setup includes four parts: an Elasticsearch stack, a ClickHouse stack, a data‑import stack using Vector.dev, and a test‑control stack using Jupyter notebooks.
1. Elasticsearch stack
<code>version: '3.7'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.4.0
container_name: elasticsearch
environment:
- xpack.security.enabled=false
- discovery.type=single-node
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
cap_add:
- IPC_LOCK
volumes:
- elasticsearch-data:/usr/share/elasticsearch/data
ports:
- 9200:9200
- 9300:9300
deploy:
resources:
limits:
cpus: '4'
memory: 4096M
reservations:
memory: 4096M
kibana:
image: docker.elastic.co/kibana/kibana:7.4.0
container_name: kibana
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
- 5601:5601
depends_on:
- elasticsearch
volumes:
elasticsearch-data:
driver: local</code>2. ClickHouse stack
<code>version: "3.7"
services:
clickhouse:
container_name: clickhouse
image: yandex/clickhouse-server
volumes:
- ./data/config:/var/lib/clickhouse
ports:
- "8123:8123"
- "9000:9000"
- "9009:9009"
- "9004:9004"
ulimits:
nproc: 65535
nofile:
soft: 262144
hard: 262144
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "localhost:8123/ping"]
interval: 30s
timeout: 5s
retries: 3
deploy:
resources:
limits:
cpus: '4'
memory: 4096M
reservations:
memory: 4096M
tabixui:
container_name: tabixui
image: spoonest/clickhouse-tabix-web-client
environment:
- CH_NAME=dev
- CH_HOST=127.0.0.1:8123
- CH_LOGIN=default
ports:
- "18080:80"
depends_on:
- clickhouse
deploy:
resources:
limits:
cpus: '0.1'
memory: 128M
reservations:
memory: 128M</code>3. Data import stack (Vector.dev)
<code>CREATE TABLE default.syslog(
application String,
hostname String,
message String,
mid String,
pid String,
priority Int16,
raw String,
timestamp DateTime('UTC'),
version Int16
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY timestamp
TTL timestamp + toIntervalMonth(1);
</code> <code>[sources.in]
type = "generator"
format = "syslog"
interval = 0.01
count = 100000
[transforms.clone_message]
type = "add_fields"
inputs = ["in"]
fields.raw = "{{ message }}"
[transforms.parser]
type = "regex_parser"
inputs = ["clone_message"]
field = "message"
patterns = ['^<(?P<priority>\d*)>(?P<version>\d) (?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z) (?P<hostname>\w+\.\w+) (?P<application>\w+) (?P<pid>\d+) (?P<mid>ID\d+) - (?P<message>.*)$']
[transforms.coercer]
type = "coercer"
inputs = ["parser"]
types.timestamp = "timestamp"
types.version = "int"
types.priority = "int"
[sinks.out_console]
type = "console"
inputs = ["coercer"]
target = "stdout"
encoding.codec = "json"
[sinks.out_clickhouse]
host = "http://host.docker.internal:8123"
inputs = ["coercer"]
table = "syslog"
type = "clickhouse"
encoding.only_fields = ["application", "hostname", "message", "mid", "pid", "priority", "raw", "timestamp", "version"]
encoding.timestamp_format = "unix"
[sinks.out_es]
type = "elasticsearch"
inputs = ["coercer"]
compression = "none"
endpoint = "http://host.docker.internal:9200"
index = "syslog-%F"
healthcheck.enabled = true
</code>Running the Docker compose brings up both stacks, after which Vector generates synthetic syslog data (100 k records) and writes it simultaneously to Elasticsearch and ClickHouse.
Query examples
All queries were executed ten times via the respective Python SDKs and response time distributions were recorded.
Return all records
<code># ES
{ "query": { "match_all": {} } }
# ClickHouse
SELECT * FROM syslog</code>Match a single field
<code># ES
{ "query": { "match": { "hostname": "for.org" } } }
# ClickHouse
SELECT * FROM syslog WHERE hostname='for.org'</code>Match multiple fields
<code># ES
{ "query": { "multi_match": { "query": "up.com ahmadajmi", "fields": ["hostname", "application"] } } }
# ClickHouse
SELECT * FROM syslog WHERE hostname='for.org' OR application='ahmadajmi'</code>Term (word) query
<code># ES
{ "query": { "term": { "message": "pretty" } } }
# ClickHouse
SELECT * FROM syslog WHERE lowerUTF8(raw) LIKE '%pretty%'</code>Range query (version >= 2)
<code># ES
{ "query": { "range": { "version": { "gte": 2 } } } }
# ClickHouse
SELECT * FROM syslog WHERE version >= 2</code>Exists query
<code># ES
{ "query": { "exists": { "field": "application" } } }
# ClickHouse
SELECT * FROM syslog WHERE application IS NOT NULL</code>Regex query
<code># ES
{ "query": { "regexp": { "hostname": { "value": "up.*", "flags": "ALL" } } } }
# ClickHouse
SELECT * FROM syslog WHERE match(hostname, 'up.*')</code>Aggregation count
<code># ES
{ "aggs": { "version_count": { "value_count": { "field": "version" } } } }
# ClickHouse
SELECT count(version) FROM syslog</code>Cardinality (distinct count)
<code># ES
{ "aggs": { "my-agg-name": { "cardinality": { "field": "priority" } } } }
# ClickHouse
SELECT count(DISTINCT(priority)) FROM syslog</code>Performance results show that ClickHouse consistently outperforms Elasticsearch in most query types, including regex and term queries, and excels dramatically in aggregation scenarios due to its columnar engine.
The total query time comparison further confirms ClickHouse’s speed advantage.
These tests were run without any specific optimizations or Bloom filter activation for ClickHouse, indicating its strong out‑of‑the‑box performance for many search‑related workloads. While Elasticsearch offers a richer query DSL and features not covered here, ClickHouse proves to be a highly capable alternative for log and analytics scenarios.
Conclusion
The comparative tests demonstrate that ClickHouse delivers superior performance for basic queries and aggregations, explaining why many enterprises are migrating from Elasticsearch to ClickHouse for large‑scale log and analytics use cases.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.