Big Data 29 min read

Introduction to ClickHouse: Features, Architecture, Installation, Data Types, and Cluster Deployment

This article provides a comprehensive overview of ClickHouse, an open‑source column‑oriented MPP analytical database, covering its advantages and drawbacks, key features, typical use cases, data access flow, installation steps, core directories, indexes, data types, database and table engines, as well as detailed cluster architecture and deployment patterns.

Cloud Native Technology Community

Apr 13, 2022

Introduction to ClickHouse: Features, Architecture, Installation, Data Types, and Cluster Deployment

ClickHouse Overview

ClickHouse is an open‑source, column‑oriented MPP analytical database created by Yandex for OLAP and big‑data scenarios. It offers sub‑second query latency, a SQL‑like dialect, and strong real‑time query capabilities.

Advantages and Disadvantages

Extremely fast query speed – billions of rows per second per server.

Efficient multi‑threaded execution, exceeding 2 TB/s per query.

Compressed storage reduces I/O.

Powerful storage engine with full DBMS features.

High fault tolerance and availability via distributed clusters, asynchronous multi‑master replication, and no single‑point‑of‑failure design.

Does not support transactions and is less suited for row‑level queries or deletions.

Key Features

Designed as an analytical (OLAP) database, not a strict relational DB.

Complete DBMS functionality: databases, tables, DDL/DML, users, permissions, backup, recovery, and distributed management.

Column‑store architecture – only required columns are read, achieving better compression and avoiding full‑table scans.

Online real‑time queries without any preprocessing.

Supports batch updates and provides a rich set of SQL functions.

High‑availability support and operates out‑of‑the‑box without a Hadoop ecosystem.

Typical Use Cases

ClickHouse is well suited for advertising traffic, web and app click streams, finance, e‑commerce, security logs, telecom, online gaming, IoT, and any scenario requiring massive analytical processing of event data.

Data Access Flow

Server : Provides HTTP, data‑copy, and native TCP interfaces.

Parser : Converts SQL statements into an abstract syntax tree (AST).

Interpreter : Interprets the AST and creates an execution pipeline.

IStorage : Returns raw column data according to the AST, implementing DDL and read/write methods.

Block : Core data container holding columns, data types, and column names.

Column & Field : Column provides read capability; a Field is a single value within a column.

DataType : Handles serialization/deserialization for columns and fields.

Function : Supports ordinary functions and aggregate functions (e.g., COUNT).

Single‑Node Installation

Supported platforms: x86_64, AArch64, Power9.

# Official pre‑built binaries are compiled for x86_64 with SSE 4.2 support
grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
# DEB package installation
sudo apt-get install -y apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754

echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
sudo apt-get install -y clickhouse-server clickhouse-client
# Start the server (default port 9000)
sudo service clickhouse-server start
# Connect with the client
clickhouse-client
# Install a specific version
sudo apt-get install clickhouse-server=21.8.5.7 clickhouse-client=21.8.5.7 clickhouse-common-static=21.8.5.7
# ARM installation via wget
wget --progress=bar:force:noscroll "https://builds.dev.altinity.cloud/apt-repo/pool/main/clickhouse-client_21.8.5.7.dev_all.deb" -P /tmp/clickhouse_debs
wget --progress=bar:force:noscroll "https://builds.dev.altinity.cloud/apt-repo/pool/main/clickhouse-common-static_21.8.5.7.dev_arm64.deb" -P /tmp/clickhouse_debs
wget --progress=bar:force:noscroll "https://builds.dev.altinity.cloud/apt-repo/pool/main/clickhouse-server_21.8.5.7.dev_all.deb" -P /tmp/clickhouse_debs

Additional configuration may be required, such as increasing the file‑handle limit ( /etc/security/limits.d/clickhouse.conf) and setting up cron jobs ( /etc/cron.d/clickhouse-server).

Common Commands

# Basic MySQL‑like commands
SHOW databases;
SHOW tables;
USE database;
DESC table;
SELECT * FROM database.table;
CREATE TABLE insert_select_testtable (a Int8, b String, c Int8) ENGINE = MergeTree() ORDER BY a;
INSERT INTO insert_select_testtable (*) VALUES (1, 'a', 1);

Core Directories

Server configuration: /etc/clickhouse-server Data directory (default): /var/lib/clickhouse Table data: /var/lib/clickhouse/data/[database]/[table] Metadata: /var/lib/clickhouse-server/metadata Logs: /var/log/clickhouse-server (files clickhouse-server.err.log and clickhouse-server.log)

Executables: /usr/bin (clickhouse, clickhouse-client, clickhouse-server, clickhouse-compressor)

Indexes

ClickHouse provides two built‑in index types: sparse indexes and skip‑indexes, which record data interval information based on the number of rows covered.

Sparse Index

Created after sorting by primary or sorting key, default granularity is 8192 rows. It stores a small number of markers that map large data intervals, keeping the index file tiny and memory‑resident.

Skip Index

Generated from aggregated values (e.g., MAX/MIN) and stores interval summaries according to a configurable granularity, effectively merging sparse indexes.

Data Types

ClickHouse supports a wide range of types. Basic numeric types include UInt8‑UInt256, Int8‑Int256, Float32/Float64, Decimal, String, UUID, Date/Date32, DateTime64, etc. It also offers specialized types such as FixedString(N), LowCardinality(T), Array(T), Enum, Tuple, Nested, AggregateFunction, Nullable, and Domain (IPv4/IPv6).

# Example: creating a table with LowCardinality(String)
CREATE TABLE lc_t (
    id UInt16,
    strings LowCardinality(String)
) ENGINE = MergeTree() ORDER BY id;

# Example: a complex nested array
CREATE TABLE t_arr (
    arr Array(Array(Array(UInt32)))
) ENGINE = MergeTree ORDER BY tuple();
INSERT INTO t_arr VALUES ([[[12,13,0,1],[12]]]);
SELECT arr.size0, arr.size1, arr.size2 FROM t_arr;

# Example: Enum8
CREATE TABLE t_enum (x Enum8('hello' = 1, 'world' = 2)) ENGINE = TinyLog;

# Example: Tuple
SELECT tuple(1, 'a') AS x, toTypeName(x);

# Example: Nested structure
CREATE TABLE dept (
    name String,
    people Nested(
        id UInt8,
        name String
    )
) ENGINE = Memory;
INSERT INTO dept VALUES ('R&D', [1,2,3], ['Li','Zhang','Liu']);
SELECT name, dept.id, dept.name FROM dept;

Database Engines

Lazy : Stores data in memory for log tables and expires after a configurable interval.

Atomic (default): Provides non‑blocking DROP/RENAME, atomic EXCHANGE, and stores tables under a UUID‑based directory.

MySQL : Maps remote MySQL tables into ClickHouse for SELECT/INSERT, but does not support CREATE, ALTER, or RENAME.

Table Engines

Table engines define how data is stored, indexed, and accessed. They are grouped into four families:

Log family (TinyLog, Log, StripeLog): Simple on‑disk storage, no indexes, suitable for small or write‑once tables.

MergeTree family (MergeTree, ReplacingMergeTree, Distributed, etc.): Columnar storage with primary‑key ordering, supports most ClickHouse features.

Special family : Memory‑based, cache, or file‑based engines for specific scenarios.

Integration family : Engines that integrate external data sources (e.g., MySQL, Kafka).

Cluster Architecture

ClickHouse clusters provide high availability and load balancing. A cluster consists of multiple nodes, each hosting local tables (shards) and a distributed table that acts as a logical view.

Partitions

Partitions split a table vertically into separate directories; data within the same partition is merged together, while different partitions remain independent.

Distributed Clusters

Local tables store the actual data (one shard per node). Distributed tables do not store data; they forward queries to all local tables, aggregate results, and return them to the client. Data is sharded using a key (e.g., rand()).

Replication

Each shard can have one or more replicas. Replicas are synchronized via the ReplicatedMergeTree engine and Zookeeper, ensuring data safety and automatic recovery when a node fails.

Common Deployment Patterns

Sharding only : Simple Distributed + MergeTree setup. Fast queries but no fault tolerance; a single node failure leads to data loss.

Sharding + Replication : Distributed + MergeTree with replicated shards. Improves availability but still vulnerable if an entire shard loses all replicas.

Sharding + Replication + HA (recommended for production): Uses ReplicatedMergeTree + Distributed + Zookeeper. Provides automatic replica synchronization, high availability, and consistent reads/writes.

Production recommendations include at least two replicas per shard, external load balancing, 100+ concurrent connections, batch writes rather than tiny inserts, avoiding multiple ClickHouse instances on a single host, and a 5‑node Zookeeper ensemble separate from ClickHouse nodes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Database ClickHouse cluster OLAP Data Types Installation

Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.