Development Trends and Challenges of Large‑Scale Parallel Databases
Since the 1970s databases have become essential middleware, and modern large‑scale parallel databases, designed for extreme parallelism on clustered hardware, face trade‑offs in performance, scalability, and fault tolerance, prompting a shift toward cloud‑native, micro‑service architectures and new hardware such as SSDs and memory‑centric designs.
Since the 1970s, databases have been the most important middleware in computer software, separating data‑operation primitives from application code and allowing DBMS to manage data structures, access, and consistency, which greatly improves development and runtime efficiency. In the 21st century, with advances in software, hardware, and the Internet, database technology has entered a flourishing new stage, and this article focuses on one niche: the development trends of large‑scale parallel databases.
Definition and Architectural Characteristics of Parallel Databases
According to Wikipedia, a parallel database is a system that parallelizes operations such as data loading, index building, and query execution by using multiple CPUs and disks simultaneously to improve performance, with parallelism being the key keyword.
When building large‑scale clusters, two characteristics are usually considered: parallelism and distribution. Parallelism emphasizes multiple nodes working together on a single large problem under strict performance and latency requirements, whereas distribution emphasizes data or computation spread across nodes with transparency. Because their goals differ, parallel databases aim for extreme parallelism (even a single‑row query is dispatched to all nodes), while systems like HDFS distribute file blocks without requiring every node to participate.
Parallel databases are designed for specific workloads, thus they have a limited applicable environment. They rely on relational theory, making them suitable for structured data and complex multi‑table joins, pipelines, and analytical queries, but not optimal for simple key‑value access where NoSQL solutions excel. Loading data is labor‑intensive, so they are best used for workloads where data is read many times after loading.
The main issues of current parallel databases stem from their design goal of perfect parallelism, which tightly couples computation and storage. This tight coupling leads to poor robustness (a single node failure can severely degrade performance or halt the whole cluster) and limited scalability (requiring equally powerful nodes and costly data re‑hashing).
Key Technical Points of Parallel Databases
Parallel databases consist mainly of an execution engine, a storage engine, and management modules, each influencing product characteristics.
Because they run on large clusters, they must handle master‑slave node styles. Masters handle entry, metadata, SQL parsing, plan generation, scheduling, and two‑phase commit. Two approaches exist: dedicated master and master‑less.
Parallel databases derived from open‑source PostgreSQL usually have a dedicated master, which simplifies node parity but creates a master‑backup failover and limits dynamic scaling.
Parallel databases designed from scratch (e.g., Gbase 8a, Vertica) often have no dedicated master; the master role is taken by the node that a client connects to. This offers better scalability and high availability but can create a performance bottleneck if the master process slows down data nodes.
Both approaches have trade‑offs, but in large clusters the master‑less (or multi‑master) architecture shows clearer advantages, and I believe multi‑master will be the future direction.
In the storage engine, data distribution is critical. Hash‑based row distribution is a hallmark of parallel databases, providing precise control over data placement and rich statistics for query optimization.
This tight, non‑transparent coupling yields great benefits (efficient joins on co‑located tables) but also drawbacks (limited scalability and high‑availability). Some SQL‑on‑Hadoop solutions borrow this idea (e.g., HDFS colocation, Pivotal HAWG, Vertica VIVE). Solutions that do not handle hash distribution struggle with large‑table joins and resort to pre‑joins or shuffle mechanisms (e.g., Hive, SparkSQL).
Storage devices have evolved, and today most parallel databases use SAS disks, while HDFS uses cheaper SATA disks. The slower SATA disks become the biggest bottleneck, preventing simultaneous efficiency, scalability, and high availability. As SSDs (SATA‑interface) replace SAS disks, and as SSD cost drops while performance rises, parallel databases will undergo another major transformation.
Many high‑end parallel database appliances can already run on all‑SSD configurations without code changes, but to fully exploit SSD characteristics a redesign is needed because existing systems were built for rotating disks and contain complex mechanisms to turn random I/O into sequential I/O.
Looking ahead, memory‑centric architectures will dominate. Products like Rapids DB and SAP HANA store data directly in memory, using SSDs only for snapshots and logs, which solves the “fish‑and‑bear” dilemma of parallel databases. Although not yet universal, hardware advances will continue to drive software redesigns.
Using parallel databases requires awareness that a single node failure can cause a performance drop far beyond expectations, potentially leading to an “avalanche” where the whole cluster becomes overloaded and crashes.
This is a test result: with 24 nodes, the performance degradation when one node fails varies across products for both read and write workloads.
Figure 1: Test results of 24‑node failure scenario
Note that this is not a fully loaded (CPU‑ or I/O‑bound) scenario; the degradation is not proportional. Larger clusters (100 or 1000 nodes) will face similar issues, and adding nodes does not solve the problem, unlike Hadoop.
The root cause is hash distribution: it provides extreme parallelism (the “fish”) but destroys storage‑execution transparency (the “bear”), so failed nodes cannot offload their tasks to others.
Linear scalability is also affected; the world’s largest MPP production cluster has about 300 nodes, and many prefer fewer, more powerful “fat” nodes. Data movement during scaling remains costly.
In summary, three typical parallel‑database architecture styles can be observed:
On the left, represented by Gbase 8a, Vertica, and Greenplum, are classic MPP databases. Data is hash‑sharded, the storage and execution engines are tightly coupled, data is fully localized, full SQL is supported, and cost‑based optimization is used.
On the right, represented by Impala, is the typical SQL‑on‑HDFS model. The storage engine (HDFS) is completely transparent to the query engine, which mainly supports read‑only queries and lacks detailed statistics, relying on rule‑based optimization.
Figure 2: Three typical parallel‑database architecture styles
Between them lies a hybrid, which I call “MPP over HDFS”. Examples include Greenplum HAWG and Vertica VIVE. They still write data through their own storage engine, performing hash distribution and maintaining their own file and compression formats, while leveraging HDFS locality. This hybrid offers intermediate performance and scalability.
For example, HAWG moves Greenplum’s storage from local disks to HDFS via a custom HDFS interface (gphdfs), while Vertica’s VIVE uses the webhdfs interface. Their performance is typically about twice that of pure HDFS‑based solutions.
Future Outlook of Parallel Databases
How can we fully exploit parallel‑database strengths while avoiding their drawbacks? The new generation of cloud‑computing + micro‑service architectures, together with evolving hardware, offers a promising direction.
1. Cloud‑service model enables larger scale and better economics
In practice, cloud computing has quietly reshaped IT. Databases delivered as cloud services (private or public) such as AWS Redshift and OpenStack Trove (DBaaS) must support ever‑larger clusters, increasing technical difficulty but improving cost‑effectiveness and allowing specialized operations teams to leverage new technology benefits.
The 1970s introduced databases as middleware, enabling division of labor and efficiency. Today, cloud services further separate responsibilities: applications rent database services instead of owning dedicated clusters, allowing fine‑grained optimization and rapid problem resolution.
Alibaba Cloud emphasizes cloud‑native databases like OceanBase and ADS, providing multi‑tenant, load‑balancing, resource isolation, and quota management—features that were once peripheral but are now standard.
2. Component granularity refined through micro‑services
Software modules of parallel databases will be split into finer components. Beyond master and data nodes, future systems may have access nodes, coordinator nodes, metadata nodes, log nodes, security nodes, SQL parsing/optimizing nodes, load/export nodes, and pure data nodes, each deployable as independent containers.
These components can be packaged as Docker containers and orchestrated by Kubernetes or Mesos, providing high availability, loose coupling, and DevOps‑friendly rapid iteration.
Data‑node granularity can also be refined. Currently, a data node may manage terabytes of data, making it hard to treat as a micro‑service and requiring equally powerful servers. If a data node only managed a few megabytes, it could reside entirely in memory, be migrated easily, and recover quickly after failure.
Docker’s container model is ideal for such tiny data‑node services: an image can be pulled quickly, a container started, and the small data slice can be fetched from peers or persistent storage, with logs sent to a dedicated log server. After initialization, the node registers with service discovery and begins serving requests.
This micro‑data‑node design enables in‑memory caching and efficient resource management.
3. Fully leverage new hardware
Next‑generation architectures can be optimized for emerging hardware such as 3D‑stacked memory, flash cards, hardware compression cards, and InfiniBand. Targeted development can exploit their performance advantages and extend hardware lifecycles.
Figure 3: Conceptual diagram of next‑generation architecture
Key Takeaways:
Parallel databases have specific applicable scenarios.
All database components need cloud‑ification to achieve professional division of labor.
Micro‑service granularity for both components and data is a core characteristic of the next generation.
Hardware evolution will dramatically reshape software architectures.
Disclaimer: The content originates from public internet sources; the author remains neutral and provides it for reference and discussion only. Copyright belongs to the original author or institution; please contact for removal if infringement occurs.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.