Big Data 15 min read

Selection and Comparison of Big Data Benchmark Standards with a Focus on TPC‑DS

This article reviews the evolution of big‑data management technologies, discusses the criteria for choosing appropriate big‑data benchmarks, compares existing benchmarks such as MapReduce tests, YCSB, BigBench and BigFrame, and provides an in‑depth analysis of the TPC‑DS benchmark and its certification status.

Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Selection and Comparison of Big Data Benchmark Standards with a Focus on TPC‑DS

With the commercialization of open‑source technologies such as Hadoop, Map/Reduce, Spark, HDFS, and HBase, big‑data management techniques have advanced rapidly. Big data is commonly described by the 3V characteristics—Volume, Velocity, and Variety—while experts also emphasize Value and Veracity as additional challenges, making the selection of an objective benchmark a critical research topic.

The Transaction Processing Performance Council (TPC) is the most recognized organization for standardizing database benchmark tests. Over the past two decades it has released benchmarks such as TPC‑A, TPC‑D, TPC‑H, and TPC‑DS, which are widely adopted in industry. Extensions like BigBench and BigFrame expand TPC‑DS, and the Apache community provides Map/Reduce‑specific tests such as TestDFSIO and teraSort. In China, a national big‑data benchmark is being developed by the China Academy of Information and Communications Technology together with the Chinese Academy of Sciences and other partners.

To help enterprises choose suitable big‑data benchmarks, this paper first analyses existing work, then outlines the essential elements a benchmark should possess, compares current benchmarks, and finally focuses on the TPC‑DS benchmark.

Selection of Big‑Data Benchmarks

When selecting a big‑data benchmark, enterprises should first consider the relevance of the benchmark to their own business scenarios.

1. Relevance to Business The benchmark’s application domain should resemble the company’s real workload; for example, a benchmark designed for social‑network workloads is not appropriate for banking systems. The data model used by the benchmark should also reflect current data‑warehouse trends, such as star‑schema models.

Other important factors include measurement selection, synthetic data generation, workload definition, auditability, robustness, SQL‑standard compatibility, and portability.

2. Realistic Synthetic Data The generated data should closely mimic real‑world data characteristics.

3. Scalable Workload Definition The benchmark should support different system scales, typically by adjusting a scale factor that changes data size and workload intensity.

4. Understandable Metrics The metrics should be easy for users to interpret, enhancing the benchmark’s credibility.

5. Objectivity and Fairness A neutral third‑party organization should design the benchmark, and independent audits should verify results, similar to how TPC benchmarks are governed.

6. Robustness The benchmark must resist cheating (e.g., disabling physical views) to ensure fair comparison.

7. SQL‑Standard Compatibility Broad support for ANSI SQL standards (SQL‑86, SQL‑92, SQL‑99, SQL‑2003) is essential for portability across commercial and open‑source DBMSs.

8. Generality/Portability The benchmark should define only the specification, allowing implementations on various platforms such as Map/Reduce, Spark, HDFS, or HBase.

Big‑Data Benchmark Comparison

After decades of research, traditional database benchmarks are mature, while big‑data benchmarks have emerged only recently, often extending or adapting existing standards.

1. Map/Reduce Performance Tests Benchmarks like MRBench, HiBench, TestDFSIO, and teraSort evaluate the performance of Map/Reduce clusters but are too simple to model complex applications.

2. YCSB / YCSB++ / LinkBench These target cloud services and social‑network workloads, measuring latency, scalability, and parallelism, but are highly specialized and not widely applicable.

3. BigBench An extension of TPC‑DS for retail, adding semi‑structured Web‑Log data and unstructured Reviews; it contains 30 queries and a customized data model.

4. BigFrame A benchmark generator that lets users create custom benchmarks; its relational model resembles BigBench and adds semi‑structured Tweets and graph data (Followee/Follower).

TPC‑DS

TPC‑DS is the next‑generation decision‑support benchmark introduced by TPC to replace TPC‑H.

1. TPC‑H Designed for retail decision support, it defines 8 tables, 22 queries, and follows SQL‑92. Its normalized schema limits its ability to test modern data‑warehouse features such as ETL and complex data models.

2. TPC‑DS Uses star and snowflake schemas with 7 fact tables and 17 dimension tables (average 18 columns each). It includes 99 SQL queries covering SQL‑99/2003 features, OLAP, reporting, and data‑mining workloads, with realistic data skew.

Key characteristics of TPC‑DS:

99 test cases following SQL‑99 and SQL‑2003 syntax, with complex queries.

Large data volumes that answer real business questions.

Workloads span analytical reporting, iterative OLAP, and data‑mining.

High I/O and CPU demands across almost all cases.

Researchers have summarized the distribution of these queries (see Table 1 in the original source). Detailed information is available at http://www.tpc.org/tpcds/ .

3. TPC‑DS Certification Status Despite its high standards, no vendor has yet received official TPC certification for TPC‑DS. Traditional DBMS vendors lack sufficient distributed processing capabilities, while newer platforms (Map/Reduce, Spark) have limited SQL compatibility, preventing official certification.

Conclusion

Big‑data benchmarks are essential for fairly and objectively evaluating the functionality and performance of various big‑data platforms, guiding the selection of suitable analytics solutions. While TPC‑DS is becoming the de‑facto standard, benchmarks must evolve with emerging applications, requiring close collaboration among governments, academia, and industry.

Disclaimer: The content originates from public internet sources; the author remains neutral and provides it for reference and discussion only. Copyright belongs to the original authors or institutions. Please contact us for removal if any infringement occurs.
Big DataSQLPerformance TestingBenchmarkdata managementTPC-DS
Art of Distributed System Architecture Design
Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.