Databases 13 min read

Comparative Analysis of Elasticsearch and ClickHouse: Architecture, Query Performance, and Practical Benchmarks

This article compares Elasticsearch and ClickHouse by outlining their architectures, detailing deployment configurations, presenting benchmark queries and performance results, and concluding that ClickHouse generally outperforms Elasticsearch in many basic search and aggregation scenarios, while also noting each system's strengths and limitations.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
Comparative Analysis of Elasticsearch and ClickHouse: Architecture, Query Performance, and Practical Benchmarks

Elasticsearch is a real‑time distributed search and analytics engine built on Lucene, often used together with Logstash and Kibana (the ELK stack). ClickHouse, developed by Yandex, is a column‑oriented relational database for OLAP workloads that has become very popular in recent years.

Many companies, such as Ctrip and Kuaishou, are migrating their log‑analysis pipelines from Elasticsearch to ClickHouse due to performance and cost considerations.

Architecture and Design Comparison

Elasticsearch relies on Lucene’s inverted index and Bloom filters to solve search problems at scale. It uses a distributed architecture with shards and replicas, and nodes can assume different roles: client node (API access), data node (stores and indexes data), and master node (cluster coordination).

ClickHouse follows an MPP (Massively Parallel Processing) architecture for distributed ROLAP. Every node has equal responsibility and processes a portion of the data. Data is stored column‑wise, uses vectorized execution, log‑structured merge trees, sparse indexes, SIMD optimizations, and Zookeeper for coordination. ClickHouse also supports Bloom filters for search.

Query Comparison – Practical Test

To compare basic query capabilities, a Docker‑Compose test environment was built. The Elasticsearch stack consists of a single Elasticsearch container and a Kibana container:

version: '3.7'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.4.0
    container_name: elasticsearch
    environment:
      - xpack.security.enabled=false
      - discovery.type=single-node
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    cap_add:
      - IPC_LOCK
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
      - 9300:9300
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 4096M
        reservations:
          memory: 4096M

  kibana:
    container_name: kibana
    image: docker.elastic.co/kibana/kibana:7.4.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - 5601:5601
    depends_on:
      - elasticsearch

volumes:
  elasticsearch-data:
    driver: local

The ClickHouse stack includes a single ClickHouse container and TabixUI as a client:

version: "3.7"
services:
  clickhouse:
    container_name: clickhouse
    image: yandex/clickhouse-server
    volumes:
      - ./data/config:/var/lib/clickhouse
    ports:
      - "8123:8123"
      - "9000:9000"
      - "9009:9009"
      - "9004:9004"
    ulimits:
      nproc: 65535
      nofile:
        soft: 262144
        hard: 262144
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "localhost:8123/ping"]
      interval: 30s
      timeout: 5s
      retries: 3
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 4096M
        reservations:
          memory: 4096M

  tabixui:
    container_name: tabixui
    image: spoonest/clickhouse-tabix-web-client
    environment:
      - CH_NAME=dev
      - CH_HOST=127.0.0.1:8123
      - CH_LOGIN=default
    ports:
      - "18080:80"
    depends_on:
      - clickhouse
    deploy:
      resources:
        limits:
          cpus: '0.1'
          memory: 128M
        reservations:
          memory: 128M

Data ingestion uses Vector.dev (similar to Fluentd) to generate synthetic syslog data and feed both stacks. The ClickHouse table is created with:

CREATE TABLE default.syslog(
    application String,
    hostname String,
    message String,
    mid String,
    pid String,
    priority Int16,
    raw String,
    timestamp DateTime('UTC'),
    version Int16
) ENGINE = MergeTree()
    PARTITION BY toYYYYMMDD(timestamp)
    ORDER BY timestamp
    TTL timestamp + toIntervalMonth(1);

The Vector pipeline (vector.toml) defines sources, transforms, and sinks for both Elasticsearch and ClickHouse:

[sources.in]
  type = "generator"
  format = "syslog"
  interval = 0.01
  count = 100000

[transforms.clone_message]
  type = "add_fields"
  inputs = ["in"]
  fields.raw = "{{ message }}"

[transforms.parser]
  # General
  type = "regex_parser"
  inputs = ["clone_message"]
  field = "message"
  patterns = ['^<(?P
\d*)>(?P
\d) (?P
\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z) (?P
\w+\.\w+) (?P
\w+) (?P
\d+) (?P
ID\d+) - (?P
.*)$']

[transforms.coercer]
  type = "coercer"
  inputs = ["parser"]
  types.timestamp = "timestamp"
  types.version = "int"
  types.priority = "int"

[sinks.out_console]
  type = "console"
  inputs = ["coercer"]
  target = "stdout"
  encoding.codec = "json"

[sinks.out_clickhouse]
  host = "http://host.docker.internal:8123"
  inputs = ["coercer"]
  table = "syslog"
  type = "clickhouse"
  encoding.only_fields = ["application", "hostname", "message", "mid", "pid", "priority", "raw", "timestamp", "version"]
  encoding.timestamp_format = "unix"

[sinks.out_es]
  type = "elasticsearch"
  inputs = ["coercer"]
  compression = "none"
  endpoint = "http://host.docker.internal:9200"
  index = "syslog-%F"
  healthcheck.enabled = true

Benchmark queries were executed on both stacks (10 runs each) covering match_all, single‑field match, multi‑field match, term, range, exists, regex, and aggregation scenarios. Example queries:

Match all: ES {"query":{"match_all":{}}} vs ClickHouse SELECT * FROM syslog

Single‑field match: ES {"query":{"match":{"hostname":"for.org"}}} vs ClickHouse SELECT * FROM syslog WHERE hostname='for.org'

Range query: ES {"query":{"range":{"version":{"gte":2}}}} vs ClickHouse SELECT * FROM syslog WHERE version >= 2

Aggregation count: ES {"aggs":{"version_count":{"value_count":{"field":"version"}}}} vs ClickHouse SELECT count(version) FROM syslog

Distinct count: ES {"aggs":{"my-agg-name":{"cardinality":{"field":"priority"}}}} vs ClickHouse SELECT count(distinct(priority)) FROM syslog

Performance results (shown in the included images) indicate that ClickHouse consistently delivers lower latency than Elasticsearch for most queries, especially aggregation‑heavy workloads, while remaining competitive for regex and term queries.

The author notes that the tests were run without any tuning or enabling of ClickHouse Bloom filters, yet ClickHouse still outperformed Elasticsearch, demonstrating its suitability for many search‑oriented use cases.

Conclusion

The comparative tests show that ClickHouse excels in basic query and aggregation performance compared to Elasticsearch, explaining why many organizations are migrating their log‑analysis pipelines to ClickHouse. Elasticsearch still offers richer query features, but for the scenarios covered, ClickHouse provides superior speed.

Additional promotional content encouraging readers to like, share, and follow the author’s knowledge platform is present in the original article.

Big DataElasticsearchPerformance BenchmarkClickHouseDatabase Comparison
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.