Columnar Storage vs Row Storage: Overview, Write/Read Comparison, Pros, Cons, and Use Cases
This article explains the differences between row-based and column-based storage, comparing their write and read performance, outlining advantages and disadvantages, and describing suitable scenarios such as OLAP queries, column families, compression, and indexing, to help choose the appropriate storage model.
01. Overview
Currently there are two major big‑data storage approaches: row‑based storage and column‑based storage.
02. What is Columnar Storage?
Column‑based storage is the opposite of traditional row‑based storage in relational databases. The difference lies in how tables are organized.
Ø Row‑based storage stores a table as a sequence of rows.
Ø Column‑based storage stores a table as a sequence of columns.
The figure shows that in row storage the data of a whole row is kept together, whereas in column storage each column is stored separately, leading to distinct trade‑offs.
03. Write‑Side Comparison
1) Row storage writes a whole row in a single operation. When built on a file system, the write is atomic, guaranteeing data integrity.
2) Column storage must split a row into individual columns, resulting in many more write operations (the number of columns times more). This increases disk‑head movements and latency (typically 1 ms–10 ms), so write performance is worse than row storage.
3) Data modification follows the same pattern: row storage updates a single location, while column storage updates multiple column locations, again favoring row storage.
04. Read‑Side Comparison
1) Row storage reads an entire row even if only a few columns are needed, causing redundant data to be transferred and later filtered in memory.
2) Column storage reads only the required columns or column blocks, eliminating redundancy.
3) Because each column contains homogeneous data types, parsing is straightforward. Row storage mixes types within a row, requiring frequent type conversions that consume CPU cycles.
4) Compression and performance advantages of column storage are illustrated in the following figures.
06. Advantages and Disadvantages
Both storage formats have clear pros and cons.
1) Row storage writes quickly and ensures data integrity, but reading can produce redundant data, which may affect performance on large datasets.
2) Column storage has slower writes and weaker integrity guarantees, yet it excels at read‑heavy workloads where only a subset of columns is needed, making it ideal for big‑data analytics.
The characteristics of each format dictate their appropriate use cases.
07. Suitable Scenarios for Columnar Storage
1) OLAP queries often scan millions or billions of rows but only need a few columns (e.g., date, item, sales amount). Columnar databases can read just those columns, dramatically improving query efficiency compared to row‑based systems.
2) Many columnar databases support column families (or locality groups). Storing frequently accessed columns together allows a single read to retrieve multiple columns, reducing I/O.
3) Columns with high redundancy compress very well; for example, Google Bigtable achieves >15× compression on web‑page data.
4) Bitmap indexes can be built on low‑cardinality columns (e.g., gender) to enable fast count queries and further compression.
However, if queries frequently need whole rows or involve small data volumes, columnar storage may not be appropriate.
08. Final Summary
① Data can be stored by rows.
② Without indexes, queries cause massive I/O; indexes accelerate queries.
③ Building indexes and materialized views consumes significant time and resources.
④ To satisfy query demands, databases often need to be heavily scaled.
Key characteristics of columnar databases:
① Data is stored per column, each column isolated.
② Data itself acts as an index.
③ Only columns involved in a query are accessed, greatly reducing I/O.
④ Each column can be processed by a separate thread, offering high concurrency.
⑤ Uniform data types enable efficient compression algorithms (e.g., delta, prefix compression), improving storage and network bandwidth usage.
Source: blog.csdn.nept/Xingxinxinxin/article/details/80939277
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.