Big Data 13 min read

Why ClickHouse Outperforms Elasticsearch in Log Analytics: A Practical Comparison

This article compares Elasticsearch and ClickHouse for log analytics by detailing their architectures, setting up Docker‑Compose stacks, ingesting synthetic syslog data with Vector, running equivalent queries, and measuring performance, revealing ClickHouse’s superior speed in most scenarios.

ITPUB

May 15, 2023

Why ClickHouse Outperforms Elasticsearch in Log Analytics: A Practical Comparison

Introduction

Elasticsearch (ES) is a real‑time distributed search and analytics engine built on Lucene. ClickHouse, developed by Yandex, is a column‑oriented relational OLAP database that has become popular for large‑scale analytical workloads.

Architecture Comparison

ES uses inverted indexes and Bloom filters to provide fast full‑text search, with shard and replica mechanisms for scalability and high availability. ClickHouse follows an MPP architecture where each node processes a partition of the data independently, stores data column‑wise, and leverages vectorized execution, log‑structured merge trees, sparse indexes and SIMD instructions. Both systems support Bloom filters.

Test Environment

Two Docker‑Compose stacks were used:

ES stack: a single‑node Elasticsearch container and a Kibana container.

ClickHouse stack: a ClickHouse server container and a TabixUI client container.

Docker‑Compose files (simplified):

version: '3.7'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.4.0
    container_name: elasticsearch
    environment:
      - xpack.security.enabled=false
      - discovery.type=single-node
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    cap_add:
      - IPC_LOCK
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
      - 9300:9300
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 4096M
        reservations:
          memory: 4096M

  kibana:
    image: docker.elastic.co/kibana/kibana:7.4.0
    container_name: kibana
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - 5601:5601
    depends_on:
      - elasticsearch

volumes:
  elasticsearch-data:
    driver: local

version: "3.7"
services:
  clickhouse:
    image: yandex/clickhouse-server
    container_name: clickhouse
    volumes:
      - ./data/config:/var/lib/clickhouse
    ports:
      - "8123:8123"
      - "9000:9000"
      - "9009:9009"
      - "9004:9004"
    ulimits:
      nproc: 65535
      nofile:
        soft: 262144
        hard: 262144
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "localhost:8123/ping"]
      interval: 30s
      timeout: 5s
      retries: 3
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 4096M
        reservations:
          memory: 4096M

  tabixui:
    image: spoonest/clickhouse-tabix-web-client
    container_name: tabixui
    environment:
      - CH_NAME=dev
      - CH_HOST=127.0.0.1:8123
      - CH_LOGIN=default
    ports:
      - "18080:80"
    depends_on:
      - clickhouse
    deploy:
      resources:
        limits:
          cpus: '0.1'
          memory: 128M
        reservations:
          memory: 128M

Data Ingestion Pipeline

Data was generated with Vector.dev (similar to Fluentd). The Vector configuration defines a syslog generator, parsing, type coercion, and sinks for console, ClickHouse, and Elasticsearch.

[sources.in]
  type = "generator"
  format = "syslog"
  interval = 0.01
  count = 100000

[transforms.clone_message]
  type = "add_fields"
  inputs = ["in"]
  fields.raw = "{{ message }}"

[transforms.parser]
  type = "regex_parser"
  inputs = ["clone_message"]
  field = "message"
  patterns = ['^<(?P<priority>\d*)>(?P<version>\d) (?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z) (?P<hostname>\w+\.\w+) (?P<application>\w+) (?P<pid>\d+) (?P<mid>ID\d+) - (?P<message>.*)$']

[transforms.coercer]
  type = "coercer"
  inputs = ["parser"]
  types.timestamp = "timestamp"
  types.version = "int"
  types.priority = "int"

[sinks.out_console]
  type = "console"
  inputs = ["coercer"]
  target = "stdout"
  encoding.codec = "json"

[sinks.out_clickhouse]
  type = "clickhouse"
  inputs = ["coercer"]
  host = "http://host.docker.internal:8123"
  table = "syslog"
  encoding.only_fields = ["application", "hostname", "message", "mid", "pid", "priority", "raw", "timestamp", "version"]
  encoding.timestamp_format = "unix"

[sinks.out_es]
  type = "elasticsearch"
  inputs = ["coercer"]
  compression = "none"
  endpoint = "http://host.docker.internal:9200"
  index = "syslog-%F"
  healthcheck.enabled = true

The pipeline is started with:

docker run \
  -v $(mkfile_path)/vector.toml:/etc/vector/vector.toml:ro \
  -p 18383:8383 \
  timberio/vector:nightly-alpine

ClickHouse Table Definition

CREATE TABLE default.syslog(
    application String,
    hostname String,
    message String,
    mid String,
    pid String,
    priority Int16,
    raw String,
    timestamp DateTime('UTC'),
    version Int16
) ENGINE = MergeTree()
    PARTITION BY toYYYYMMDD(timestamp)
    ORDER BY timestamp
    TTL timestamp + toIntervalMonth(1);

Query Equivalence

After data ingestion, equivalent queries were executed on both stacks.

Return all records

# Elasticsearch
{ "query": { "match_all": {} } }

# ClickHouse
SELECT * FROM syslog;

Match a single field

# Elasticsearch
{ "query": { "match": { "hostname": "for.org" } } }

# ClickHouse
SELECT * FROM syslog WHERE hostname='for.org';

Multi‑field match

# Elasticsearch
{ "query": { "multi_match": { "query": "up.com ahmadajmi", "fields": ["hostname", "application"] } } }

# ClickHouse
SELECT * FROM syslog WHERE hostname='for.org' OR application='ahmadajmi';

Term (word) search

# Elasticsearch
{ "query": { "term": { "message": "pretty" } } }

# ClickHouse
SELECT * FROM syslog WHERE lowerUTF8(raw) LIKE '%pretty%';

Range query (version >= 2)

# Elasticsearch
{ "query": { "range": { "version": { "gte": 2 } } } }

# ClickHouse
SELECT * FROM syslog WHERE version >= 2;

Exists query

# Elasticsearch
{ "query": { "exists": { "field": "application" } } }

# ClickHouse
SELECT * FROM syslog WHERE application IS NOT NULL;

Regex query

# Elasticsearch
{ "query": { "regexp": { "hostname": { "value": "up.*", "flags": "ALL", "max_determinized_states": 10000, "rewrite": "constant_score" } } } }

# ClickHouse
SELECT * FROM syslog WHERE match(hostname, 'up.*');

Aggregation – count of a field

# Elasticsearch
{ "aggs": { "version_count": { "value_count": { "field": "version" } } } }

# ClickHouse
SELECT count(version) FROM syslog;

Distinct count

# Elasticsearch
{ "aggs": { "my-agg-name": { "cardinality": { "field": "priority" } } } }

# ClickHouse
SELECT count(DISTINCT priority) FROM syslog;

Performance Results

Each query was executed ten times using the Python SDK for both stacks. ClickHouse consistently showed lower latency, especially for aggregation queries where its columnar engine excels.

The overall query‑time comparison confirms ClickHouse’s advantage.

Conclusion

The benchmark demonstrates that ClickHouse outperforms Elasticsearch in most basic query scenarios, with particularly strong performance for aggregations due to its columnar storage and vectorized execution. Elasticsearch offers a richer DSL and flexible schema, but for log‑analysis workloads that fit the tested patterns, ClickHouse provides faster execution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker Elasticsearch ClickHouse Log Analytics performance comparison

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.