Big Data 9 min read

Key Factors to Consider When Building Your Own Data Warehouse

This article examines essential considerations for selecting and designing a data warehouse—including data volume, scalability, on‑premises versus cloud options, pricing models, and ETL/ELT approaches—to help organizations choose the most suitable solution for their needs.

Architects Research Society
Architects Research Society
Architects Research Society
Key Factors to Consider When Building Your Own Data Warehouse

When clients ask which data warehouse is best for a growing company, we recommend modern solutions such as Redshift, BigQuery, or Snowflake that provide near‑real‑time data, low cost, and minimal infrastructure maintenance.

Data Volume

Estimate the amount of data you will process. For hundreds of terabytes or petabytes, a non‑relational database is advisable, while relational databases with strong query optimizers work well for smaller datasets that fit on a single node.

Loading TB‑scale Data from Postgres to BigQuery

Traditional RDBMSs (Postgres, MySQL, MSSQL) handle up to about 1 TB efficiently; beyond that performance may degrade.

Amazon Redshift, Google BigQuery, Snowflake, and Hadoop‑based solutions can support multiple petabytes.

On‑Premises vs. Cloud

Assess whether you have dedicated staff for database maintenance and support; this greatly influences the choice.

If you have dedicated resources, you have more options for self‑hosted solutions.

Without dedicated staff, modern cloud data warehouses (Redshift, BigQuery, Snowflake) are recommended because they handle deployment, scaling, replication, and encryption automatically.

Scalability

Databases should scale horizontally (adding machines) or vertically (adding resources to a single node). Redshift offers simple scaling by adding nodes; BigQuery provides up to 2,000 slots and can exceed that limit with multi‑tenant scaling; Snowflake separates storage and compute for independent scaling.

Horizontal scalability adds more machines; vertical scalability adds resources to a single node.

ETL vs. ELT

Snowflake, built on Amazon S3, stores data separately from compute, enabling seamless scaling for large data warehouses and supporting multiple virtual warehouses that operate concurrently while maintaining global transaction integrity.

Pricing

Self‑hosted Hadoop solutions incur VM or hardware costs. Cloud warehouses offer on‑demand pricing with distinct models:

Redshift: on‑demand (hourly per node), spectrum (pay per scanned byte), and reserved instances (up to 75% savings).

BigQuery: charges for storage, streaming inserts, and query byte scans; loading/exporting data is free; includes daily cost caps and long‑term pricing.

Snowflake: on‑demand pricing billed per second for compute and per‑TB‑month for storage, with storage and compute billed separately.

Conclusion

If total data is < 1 TB, rows per table < 500 M, and a single node suffices, use an indexed RDBMS (Postgres, MySQL, MSSQL).

If data is between 1 TB and 100 TB, consider modern warehouses (Redshift, BigQuery, Snowflake) or Hadoop/Hive/Spark/Impala if you have expertise and dedicated staff.

If data exceeds 100 TB, use BigQuery, Snowflake, Redshift Spectrum, or self‑hosted Hadoop equivalents.

Big DatascalabilityData Warehousecloudpricing
Architects Research Society
Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.