Big Data 11 min read

Key Factors to Consider When Building Your Own Data Warehouse

This article examines essential considerations such as data volume, personnel support, scalability, and pricing models when selecting a data warehouse solution, comparing on‑premise options with modern cloud services like Redshift, BigQuery, and Snowflake for various workload sizes.

Architects Research Society
Architects Research Society
Architects Research Society
Key Factors to Consider When Building Your Own Data Warehouse

We have worked with many data warehouses, and when clients ask which is best for a growing company, we tailor recommendations based on their need for near‑real‑time data, low cost, and minimal maintenance, often suggesting modern cloud warehouses such as Redshift, BigQuery, or Snowflake.

Most modern data warehouse solutions are designed to work directly with raw data, allowing dynamic re‑transformation without re‑ingesting stored data.

In this article we explore the factors to consider when choosing a data warehouse:

Data volume

Dedicated personnel for support and maintenance

Scalability: horizontal and vertical

Pricing model

Data Volume

You need an estimate of the amount of data you will process. If the dataset ranges in the hundreds of terabytes or petabytes, a non‑relational database is strongly recommended because its architecture is built for massive scale.

Conversely, many relational databases have proven query optimizers; if your dataset fits on a single node, they remain a viable analytical warehouse option.

Loading TB‑scale Data from Postgres to BigQuery

Traditional RDBMSs such as Postgres, MySQL, and MSSQL perform well up to about 1 TB of analytical data; beyond that performance may degrade.

Amazon Redshift, Google BigQuery, Snowflake, and Hadoop‑based solutions can efficiently handle multiple petabytes.

On‑Premise vs. Cloud

Another important aspect is whether you have dedicated resources for database maintenance, support, and troubleshooting; this greatly influences the comparison.

If you have dedicated resources for support and maintenance, you have more options when choosing a database.

You can build your own big‑data warehouse using Hadoop or Greenplum, but these require substantial installation, maintenance engineering, and skilled personnel.

If you lack dedicated resources, we recommend modern cloud data warehouses such as Redshift, BigQuery, or Snowflake, which eliminate the need to manage deployment, hosting, VM sizing, replication, or encryption; you can start with simple SQL commands.

Scalability

When adopting a database you want sufficient scalability to support future growth, achievable either horizontally (adding more machines) or vertically (adding resources to a single node).

Horizontal scalability means adding more machines, while vertical scalability means adding resources to a single node to improve performance.

Redshift offers simple scaling: a few clicks let you add nodes and configure them to meet demand, handling up to ~100 TB per query.

BigQuery provides a multi‑tenant model with up to 2,000 slots (comparable to Redshift nodes) and can seamlessly scale beyond that when needed.

BigQuery relies on Google’s next‑generation distributed file system Colossus, allowing seamless scaling to dozens of petabytes without extra compute costs.

ETL vs. ELT: Considering Data Warehouse Evolution

Snowflake is built on Amazon S3 storage; its storage layer holds all data, tables, and query results, and is completely independent of compute, ensuring effortless scalability for large data warehouses and analytics.

Snowflake also offers multiple virtual warehouses of virtually any size and concurrency, allowing simultaneous operations on the same data while enforcing global transactional integrity.

Pricing

Self‑hosted options like Hadoop primarily incur VM or hardware costs; AWS EMR is a managed Hadoop offering to consider.

Redshift, BigQuery, and Snowflake all provide on‑demand pricing, each with its own model.

Amazon Redshift offers three pricing modes:

On‑demand: pay hourly based on node type and count, with rates varying by region and covering compute and storage.

Spectrum: pay only for the bytes scanned from Amazon S3 during queries.

Reserved instances: commit to multi‑year usage to save up to 75% versus on‑demand.

Google BigQuery charges for storage, streaming inserts, and query data scanned (loading and exporting are free). Pricing is based on per‑GB storage and per‑TB query bytes, with daily cost caps and long‑term storage options.

Snowflake’s on‑demand pricing is similar to BigQuery and Redshift Spectrum, but compute is billed per second (minimum 60 seconds). Storage and compute are billed separately.

Standard storage starts at $40 per TB per month; compute costs start at $2.00 per hour for the standard edition and $4.00 per hour for the enterprise edition.

Conclusion

General guidance for choosing a data warehouse:

If total data is well below 1 TB, each analytical table has fewer than 500 M rows, and the entire database fits on a single node, use an index‑optimized RDBMS such as Postgres, MySQL, or MSSQL.

If data ranges from 1 TB to 100 TB, adopt a modern data warehouse like Redshift, BigQuery, or Snowflake; alternatively, consider Hadoop/Hive, Spark SQL, or Impala if you have the expertise and dedicated staff.

If data exceeds 100 TB, choose BigQuery, Snowflake, Redshift Spectrum, or a self‑hosted Hadoop‑equivalent solution.

Big DatascalabilityData WarehouseETLcloudpricingELT
Architects Research Society
Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.