Big Data 21 min read

Data Warehouse vs Data Lake: Definitions, Differences, and Architectural Considerations

Data warehouses store structured data centrally for reporting and analysis, while data lakes retain raw data in various formats, offering flexible, low‑cost, schema‑on‑read processing; the article explains their definitions, key differences, common misconceptions, and why many organizations now combine both to enable self‑service big‑data analytics.

Architects Research Society

May 15, 2021

Data Warehouse vs Data Lake: Definitions, Differences, and Architectural Considerations

Chapter 1 and Chapter 2 introduced the concept of data‑driven organizations and defined data operations in the context of a big‑data program. Now it is time to step back and explore some other fundamental but important concepts. At this point one of our most important tasks is to clearly describe the difference between a data warehouse and a data lake.

When I talk about self‑service data, the question inevitably arises: what is the difference between a data lake and a data warehouse? Do I need to choose one over the other or do I need both? What are the current best practices for establishing the relationship between a data warehouse and a data lake? This chapter answers these questions and more, explaining why, given the current maturity of various technologies, using a data lake to augment an existing data warehouse is the best choice.

Data Warehouse: A Basic Definition.

A data warehouse is a central repository of all data collected from an organization’s business systems. Data is extracted, transformed, and loaded (ETL) into the data warehouse, which supports reporting, analytics, and data‑mining applications on the extracted and managed data set (Figure 3‑1). The previous generation of data infrastructure was warehouse‑centric, based on technologies such as Teradata, Oracle, Netezza, Greenplum, and Vertica.

Figure 3‑1. A typical data warehouse

In the past, enterprises obtained raw data and processed data; they used engines such as Informatica to perform ETL from scratch, then loaded it into the data warehouse for business analysts or users. However, as data volumes grew, this approach created two problems: first, analysts could not access raw data and only used a subset extracted from the warehouse; second, only structured data could be processed in the warehouse. Deep‑learning applications or analytics that use unstructured information were not feasible. Both problems severely limited the breadth of data and processing.

In a warehouse‑centric world, if the data structures defined in the warehouse are not suitable for your analysis, or you want to analyze unstructured content, you are out of luck. You need to submit a data request to the data team, wait for them to obtain the raw data set, process it, and derive the information and structure you are interested in, then wait for the data team to load it back into the warehouse. This is a very slow process.

For all these reasons and more, having only a data warehouse to support a data‑driven enterprise in a modern data architecture is not optimal.

What is a Data Lake?

In 2010, James Dixon, co‑founder and CTO of Pentaho, first coined the term data lake. Dixon described a data lake as a repository that stores massive amounts of raw data in its original format until needed.

Data lakes address two shortcomings of data warehouses. First, data can be stored in structured, semi‑structured, or unstructured formats. Second, the schema is applied at read time rather than at load or write time. Therefore, if you need additional information or structure from the raw data, you can always change the schema, increasing organizational flexibility. This also means data is quickly available because it does not need to be processed before a processing engine can use it.

Because data lakes are cost‑effective, there is no need to discard or archive raw data; it should always be available for any user who wants to revisit it.

All these points amount to: economical storage of everything, processing capabilities for both structured and unstructured data, rapid data availability, and flexibility—essential for building a self‑service data infrastructure (see Figure 3‑2).

Figure 3‑2. Advantages of a Data Lake

Differences Between Data Lakes and Data Warehouses

More and more enterprises are extending their data warehouses with data lakes to achieve true self‑service big‑data. There are eight basic differences between a data lake and a data warehouse.

The most important one is:

Type of data entering

How much processing is done during ingestion

How many different types of processing can be performed on the data

Figure 3‑3 shows the main differences between data lakes and data warehouses.

Data warehouses remain popular because they are a very mature technology that has existed since the 1990s. They also integrate well with the tools business analysts and users are accustomed to when using dashboards or other mechanisms to gain insights from resident data. In fact, for some use cases data warehouses perform very well because the data is fully managed and structured, allowing fast answers to certain query patterns.

Figure 3‑3. Differences Between Data Lakes and Data Warehouses

Therefore, data warehouses continue to dominate most organizations. However, many enterprises encounter the obstacles mentioned earlier. To overcome these, they are extending their warehouses with data lakes so that their big‑data truly becomes self‑service. In many cases, the lake can serve as a staging area for the warehouse and then act as the more precise data to be analyzed.

When Facebook’s Data Warehouse Ran Out of Steam.

When I first joined Facebook in 2007, we had a single data warehouse managed and maintained by a central data team. This proved to limit the amount of actionable insight Facebook could derive from all the data it collected.

Although the data team performed ETL and loaded data into the warehouse, the data volume grew so quickly that after loading the planned version, the team had to discard the raw data.

By creating a Hadoop‑based data lake, Facebook could load all its data into a still‑centralized repository, and thanks to the schema‑on‑read characteristic, Facebook could add different processing engines to the lake, supporting various types of analysis, delivering data faster, and providing a flexible self‑service model for all employees without deleting or archiving any raw data.

Is It a Strategy or a Possible Strategy?

After defining the differences between data lakes and data warehouses, another question arises: do you really need both, or can you solve your self‑service data plan with just one? The answer now is not “no”. Given the maturity stage of the involved technologies, you need both.

Indeed, data warehouses are slowly accumulating many of the features first found in data lakes. However, economies of scale mean that even if you could do everything you need with a data warehouse for big data, it would become so expensive that you would soon wish you had also created a separate data lake. Moreover, because data warehouse technologies are architecturally coupled to schema‑defined storage, transforming them into a data lake is more difficult. They are also based on proprietary storage formats tightly coupled with their processing engines, whereas a data lake supports a more decoupled architecture, which is hard to achieve with traditional warehouses.

On the other hand, as Hadoop architectures mature in the open‑source world, they are also borrowing concepts popular in data warehouses, such as columnar formats and specialized processing engines for speed. In fact, over time these architectures have begun to converge, allowing a single architecture to provide the high performance, low cost, and high integration needed for modern big‑data analytics. Engines such as Presto and Impala were created specifically to address the performance issues traditionally faced by Hadoop and Hive data lakes.

Common Misunderstandings

When I talk about data lakes and data warehouses, I encounter some common misconceptions even among seasoned designers, architects, and engineers. Here are a few.

Data warehouses are dead

A common misconception is that once a data lake is in place, data warehouses are no longer needed; i.e., the data warehouse can be shut down after the lake reaches its final goal. While the industry is moving in that direction, today data lakes do not support everything that data warehouses do. Data warehouses are more mature in integration with BI, ETL, and other SQL‑based tools, whereas data lakes are still maturing in that regard.

Data warehouses will become data lakes

Others think that as data warehouses start adding data‑lake features, a full data lake is unnecessary. This is also a fallacy because of the inherent architectural coupling of warehouse technologies. Organizations that follow this path eventually realize that a separate data lake would have been more cost‑effective and flexible.

As Chapter 2 tells us, there are four types of analytics: descriptive, diagnostic, predictive, and prescriptive (see Figure 3‑4).

Figure 3‑4. Analytics Value Pyramid

Although SQL queries can cover some of these analytics, especially descriptive and diagnostic, they cannot cover all types. For companies that want to get the most out of a big‑data program, this is a serious limitation.

In a data‑lake architecture, different processing engines and different types of analytics are possible because data format considerations are separated from the processing engine.

Conversely, for a data warehouse, once data is stored in a proprietary format, it is not possible to use engines other than the proprietary SQL engine to process it.

For example, putting data into a Vertica data warehouse. After it is stored in Vertica’s columnar format, only Vertica’s processing engine can understand it. You cannot apply deep‑learning or machine‑learning toolkits directly to the data; you must extract it from Vertica and feed it into those libraries, which is a time‑consuming process requiring the data team’s help.

In a data lake, because the processing engine is decoupled from the data, you can choose which engine to analyze the data with. You can use Spark for machine learning, or Google’s TensorFlow on the same open‑format dataset, or apply natural‑language‑processing libraries.

Hard to Find Qualified Personnel

One of the biggest challenges an organization faces when building a data lake is finding qualified personnel. Your existing data team is likely very familiar with data warehouses, which have existed for a long time, are mature, and have strong integration with the data‑tool ecosystem.

But data‑lake architectures are still immature. The technology is continuously evolving, making it difficult to find specialized expertise. Typically, data‑lake projects are initiated by data‑engineering teams, which creates challenges for all use‑case owners who want to use a self‑service infrastructure.

Thus, building a data lake requires more people and expertise precisely because it is immature, and it has only been in public awareness for a little over six years. Operational tools for managing data lakes are also evolving, although they have made significant progress in recent years. Managing access control, monitoring costs, allocating billing to different teams based on usage, managing storage consumption, and other aspects have all improved. However, supporting a data lake from an operational perspective remains much more difficult for companies.

Summary

In summary, data lakes provide organizations with greater flexibility and agility—attributes essential for building data‑driven enterprises—at a relatively low cost compared to using only a data warehouse. In many ways, data warehouses are becoming the data marts of the past year. Conversely, data‑lake architectures are less mature and can involve significant operational complexity. However, the cloud plays an important role in reducing that complexity, so every enterprise can expect to include a data‑lake architecture in its data‑infrastructure strategy. With cloud‑based data‑lake SaaS platforms, organizations can eliminate operational complexity while enjoying substantial total‑cost‑of‑ownership benefits when building a data‑lake architecture. We will discuss this in more detail in Chapter 6.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Analytics Big Data Data Architecture

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.