MariaDB ColumnStore: Future‑Ready OLAP Architecture, Limits & Migration Tips
This article shares practical experience deploying MariaDB ColumnStore for offline OLAP, covering its MPP‑based architecture, user‑experience benefits, high‑availability options, numerous syntax and data‑type limitations, and detailed scripts for schema conversion and incremental data extraction using pt‑archiver.
Background and Motivation
This article documents production‑grade usage of MariaDB ColumnStore for offline OLAP analysis, focusing on the reasons for selecting a Massively Parallel Processing (MPP) architecture over MapReduce‑style solutions.
OLAP Technology Streams
MapReduce – batch‑oriented processing exemplified by the Hadoop ecosystem.
MPP – column‑store systems such as Infobright, MariaDB ColumnStore (derived from InfiniDB), and Greenplum that store data column‑wise and execute queries in parallel across many nodes.
Why Choose MPP / ColumnStore
Supports the standard MySQL protocol, allowing existing MySQL clients (e.g., Navicat, SQLyog, WebSQL) to connect without learning a new language.
DBAs do not need to create indexes or rewrite complex SQL; the engine handles joins, aggregations, stored procedures, and UDFs automatically.
Optimized for large‑scale data warehousing and fast analytical queries while remaining easy to maintain.
Architecture Overview
UM (User Module) – receives client connections via the MySQL protocol, forwards SQL requests to PM modules, aggregates results, and returns the final result set.
PM (Performance Module) – stores data in a distributed columnar format, performs parallel scans, hash joins, and aggregations, and extracts data into memory for computation.
UM Processes
mysqld : connection authentication, SQL parsing, and plan generation.
ExeMgr : converts the query tree into a job list (filter, join, aggregation, sort).
DMLProc / DDLProc : dispatches DML/DDL to the appropriate PM; cpimport distributes source files to PM nodes.
PM Processes
PrimProc : executes queries, retrieves data blocks, and returns results to UM.
WriteEngineServer : coordinates DML, DDL, and bulk‑load operations on each PM.
ProcMgr : starts, monitors, and restarts all ColumnStore processes.
ProcMon : heartbeat monitoring between nodes.
High‑Availability Modes
External SAN fiber storage – DBRoot is mounted on a shared SAN; if a PM fails, the DBRoot can be remounted on a healthy node for automatic failover.
Stand‑by master‑slave – two identical MPP clusters ingest data in parallel; if one cluster fails, the other continues serving queries.
Observed Limitations (MariaDB 1.0 series, partially resolved in 1.1)
Unsupported CHARACTER SET and COLLATE clauses.
TEXT/BLOB types not supported before 1.1. timestamp columns cannot use DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP.
Maximum DECIMAL precision is 18. ROW_FORMAT=COMPACT not supported.
Maximum VARCHAR length is 8000. BIT type not supported.
Comments cannot contain single quotes.
Row size limit of 65535 bytes (UTF‑8 characters count as three bytes).
Reserved keywords (e.g., user, comment, match, key, update, status) cannot be used as identifiers.
Workarounds and Migration Scripts
A convert_schema.sh script automates schema conversion:
Export original table definitions.
Transform them by removing unsupported clauses and adjusting data types.
Apply the transformed DDL to ColumnStore.
# bash convert_schema.sh source_schema.sql columnstore_schema.sqlData extraction was performed with a modified Percona pt‑archiver workflow:
Full load uses SELECT * FROM table WHERE id>=1 to dump all rows.
Incremental load relies on an update_time column defined as
timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMPto fetch rows newer than the last extracted timestamp.
Because ColumnStore does not support REPLACE INTO or LOAD DATA … REPLACE, duplicate rows must be de‑duplicated after import.
Key modifications to the original pt‑archiver script:
Rename DELETE to DELETE111 to avoid accidental full deletions.
Append a suffix (e.g., desc1) to columns that clash with reserved keywords.
Change the field delimiter from a comma to an ampersand ( &) to avoid conflicts with data containing commas.
Add FIELDS TERMINATED BY for bulk LOAD DATA imports.
Remove FORCE INDEX and explicit ORDER BY so the optimizer can choose the best index (typically the update_time index for incremental extracts).
Performance Observations
DML operations (INSERT/UPDATE/DELETE) are slower in ColumnStore because the engine is optimized for bulk modifications rather than row‑level changes. Rewriting query plans to rely on the update_time index rather than forcing primary‑key scans yields faster incremental extracts.
References
ColumnStore naming conventions: https://mariadb.com/kb/en/library/columnstore-naming-conventions/
JIRA issues documenting limitations and bugs:
MCOL‑1022: https://jira.mariadb.org/browse/MCOL-1022
MCOL‑1036 (ORDER BY not in DISTINCT): https://jira.mariadb.org/browse/MCOL-1036
MCOL‑1056 (Memory overflow on large TEXT queries): https://jira.mariadb.org/browse/MCOL-1056
MCOL‑1080 (REPLACE INTO not supported): https://jira.mariadb.org/browse/MCOL-1080
Data ingestion guide: https://mariadb.com/kb/en/library/columnstore-data-ingestion/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
