Key Factors to Consider When Building Your Own Data Warehouse
This article examines the essential considerations for selecting a modern data warehouse—including data volume, staffing, scalability, and pricing models—while comparing on‑premise and cloud solutions such as Redshift, BigQuery, and Snowflake to help organizations make informed decisions.
Key Factors to Consider When Building Your Own Data Warehouse
When clients ask which data warehouse is best for a growing company, we evaluate their specific needs: near‑real‑time data, low cost, and minimal infrastructure maintenance. Modern cloud warehouses like Amazon Redshift, Google BigQuery, and Snowflake often fit these requirements.
Factors to Evaluate
Data volume
Dedicated personnel for support and maintenance
Scalability (horizontal vs. vertical)
Pricing model
Data Volume
Estimate the amount of data you will process. For datasets in the hundreds of terabytes or petabytes, a non‑relational database designed for massive scale is recommended. Relational databases with strong query optimizers work well for smaller datasets that fit on a single node.
Loading TB‑scale Data
Traditional RDBMSs (Postgres, MySQL, MSSQL) handle up to ~1 TB efficiently; beyond that performance may degrade.
Amazon Redshift, Google BigQuery, Snowflake, and Hadoop‑based solutions can comfortably support multiple petabytes.
On‑Premise vs. Cloud
Assess whether you have dedicated resources for database maintenance and support. If you do, you can consider self‑hosted options like Hadoop, Greenplum, or Spark SQL, which require significant engineering effort.
If you lack dedicated staff, cloud data warehouses (Redshift, BigQuery, Snowflake) remove the need to manage VMs, replication, or encryption; you can start using them with simple SQL commands.
Scalability
Scalability can be achieved horizontally (adding more machines) or vertically (adding resources to a single node). Redshift offers easy node scaling; BigQuery provides up to 2,000 slots and can scale beyond that with its multi‑tenant architecture. Snowflake separates storage and compute, allowing independent scaling of each.
ETL vs. ELT
Snowflake stores data on Amazon S3, decoupling storage from compute and enabling massive scalability for large data warehouses and analytics.
Pricing
Self‑hosted Hadoop solutions incur VM or hardware costs. Cloud warehouses use on‑demand pricing with distinct models:
Amazon Redshift : on‑demand hourly pricing, spectrum pricing (pay per scanned byte), and reserved instances (up to 75% savings).
Google BigQuery : charges for storage, streaming inserts, and query byte scans; loading and exporting data are free; offers cost‑control caps and long‑term pricing.
Snowflake : per‑second compute billing (minimum 60 seconds) and separate storage pricing (starting at $40/TB/month for standard tier).
Conclusion
General recommendations:
If total data < 1 TB, rows per table < 500 M, and a single node can hold the database, use an indexed RDBMS such as Postgres, MySQL, or MSSQL.
If data is between 1 TB and 100 TB, choose a modern cloud warehouse (Redshift, BigQuery, Snowflake) or, if you have expertise, a Hadoop‑based solution with dedicated staff.
If data exceeds 100 TB, adopt BigQuery, Snowflake, Redshift Spectrum, or a self‑hosted Hadoop‑compatible platform.
For more details, see the original article at architect.pub/choosing-between-modern-data-warehouses .
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.