Big Data 9 min read

Key Factors to Consider When Building Your Own Data Warehouse

This article examines the essential considerations for selecting a modern data warehouse—including data volume, staffing, scalability, and pricing models—while comparing on‑premise and cloud solutions such as Redshift, BigQuery, and Snowflake to help organizations make informed decisions.

Architects Research Society

Feb 18, 2023

Key Factors to Consider When Building Your Own Data Warehouse

When clients ask which data warehouse is best for a growing company, we evaluate their specific needs: near‑real‑time data, low cost, and minimal infrastructure maintenance. Modern cloud warehouses like Amazon Redshift, Google BigQuery, and Snowflake often fit these requirements.

Factors to Evaluate

Data volume

Dedicated personnel for support and maintenance

Scalability (horizontal vs. vertical)

Pricing model

Data Volume

Estimate the amount of data you will process. For datasets in the hundreds of terabytes or petabytes, a non‑relational database designed for massive scale is recommended. Relational databases with strong query optimizers work well for smaller datasets that fit on a single node.

Loading TB‑scale Data

Traditional RDBMSs (Postgres, MySQL, MSSQL) handle up to ~1 TB efficiently; beyond that performance may degrade.

Amazon Redshift, Google BigQuery, Snowflake, and Hadoop‑based solutions can comfortably support multiple petabytes.

On‑Premise vs. Cloud

Assess whether you have dedicated resources for database maintenance and support. If you do, you can consider self‑hosted options like Hadoop, Greenplum, or Spark SQL, which require significant engineering effort.

If you lack dedicated staff, cloud data warehouses (Redshift, BigQuery, Snowflake) remove the need to manage VMs, replication, or encryption; you can start using them with simple SQL commands.

Scalability

Scalability can be achieved horizontally (adding more machines) or vertically (adding resources to a single node). Redshift offers easy node scaling; BigQuery provides up to 2,000 slots and can scale beyond that with its multi‑tenant architecture. Snowflake separates storage and compute, allowing independent scaling of each.

ETL vs. ELT

Snowflake stores data on Amazon S3, decoupling storage from compute and enabling massive scalability for large data warehouses and analytics.

Pricing

Self‑hosted Hadoop solutions incur VM or hardware costs. Cloud warehouses use on‑demand pricing with distinct models:

Amazon Redshift : on‑demand hourly pricing, spectrum pricing (pay per scanned byte), and reserved instances (up to 75% savings).

Google BigQuery : charges for storage, streaming inserts, and query byte scans; loading and exporting data are free; offers cost‑control caps and long‑term pricing.

Snowflake : per‑second compute billing (minimum 60 seconds) and separate storage pricing (starting at $40/TB/month for standard tier).

Conclusion

General recommendations:

If total data < 1 TB, rows per table < 500 M, and a single node can hold the database, use an indexed RDBMS such as Postgres, MySQL, or MSSQL.

If data is between 1 TB and 100 TB, choose a modern cloud warehouse (Redshift, BigQuery, Snowflake) or, if you have expertise, a Hadoop‑based solution with dedicated staff.

If data exceeds 100 TB, adopt BigQuery, Snowflake, Redshift Spectrum, or a self‑hosted Hadoop‑compatible platform.

For more details, see the original article at architect.pub/choosing-between-modern-data-warehouses .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

scalability ETL Cloud pricing ELT

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.