Key Factors to Consider When Building Your Own Data Warehouse
This article examines essential considerations for selecting and designing a data warehouse—including data volume, scalability, on‑premises versus cloud options, pricing models, and ETL/ELT approaches—to help organizations choose the most suitable solution for their needs.
When clients ask which data warehouse is best for a growing company, we recommend modern solutions such as Redshift, BigQuery, or Snowflake that provide near‑real‑time data, low cost, and minimal infrastructure maintenance.
Data Volume
Estimate the amount of data you will process. For hundreds of terabytes or petabytes, a non‑relational database is advisable, while relational databases with strong query optimizers work well for smaller datasets that fit on a single node.
Loading TB‑scale Data from Postgres to BigQuery
Traditional RDBMSs (Postgres, MySQL, MSSQL) handle up to about 1 TB efficiently; beyond that performance may degrade.
Amazon Redshift, Google BigQuery, Snowflake, and Hadoop‑based solutions can support multiple petabytes.
On‑Premises vs. Cloud
Assess whether you have dedicated staff for database maintenance and support; this greatly influences the choice.
If you have dedicated resources, you have more options for self‑hosted solutions.
Without dedicated staff, modern cloud data warehouses (Redshift, BigQuery, Snowflake) are recommended because they handle deployment, scaling, replication, and encryption automatically.
Scalability
Databases should scale horizontally (adding machines) or vertically (adding resources to a single node). Redshift offers simple scaling by adding nodes; BigQuery provides up to 2,000 slots and can exceed that limit with multi‑tenant scaling; Snowflake separates storage and compute for independent scaling.
Horizontal scalability adds more machines; vertical scalability adds resources to a single node.
ETL vs. ELT
Snowflake, built on Amazon S3, stores data separately from compute, enabling seamless scaling for large data warehouses and supporting multiple virtual warehouses that operate concurrently while maintaining global transaction integrity.
Pricing
Self‑hosted Hadoop solutions incur VM or hardware costs. Cloud warehouses offer on‑demand pricing with distinct models:
Redshift: on‑demand (hourly per node), spectrum (pay per scanned byte), and reserved instances (up to 75% savings).
BigQuery: charges for storage, streaming inserts, and query byte scans; loading/exporting data is free; includes daily cost caps and long‑term pricing.
Snowflake: on‑demand pricing billed per second for compute and per‑TB‑month for storage, with storage and compute billed separately.
Conclusion
If total data is < 1 TB, rows per table < 500 M, and a single node suffices, use an indexed RDBMS (Postgres, MySQL, MSSQL).
If data is between 1 TB and 100 TB, consider modern warehouses (Redshift, BigQuery, Snowflake) or Hadoop/Hive/Spark/Impala if you have expertise and dedicated staff.
If data exceeds 100 TB, use BigQuery, Snowflake, Redshift Spectrum, or self‑hosted Hadoop equivalents.
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.