Big Data 23 min read

Overview and Design of Google’s F1 Query: A Scalable Enterprise Data Processing System

The article reviews Google’s F1 Query paper, describing its architecture, three execution modes, data source handling, extensibility features such as UDF servers and TVFs, and performance optimizations that enable a unified, enterprise‑wide SQL engine for heterogeneous big‑data workloads.

Big Data Technology & Architecture

Jan 17, 2020

Overview

Google recently published the paper "F1 Query: Declarative Querying at Scale" which introduces a large‑scale data‑processing system called F1 Query. Unlike Presto, its focus is not merely on supporting many heterogeneous sources but on covering the full spectrum of enterprise data‑processing requirements.

Design Intent

F1 Query was driven by internal Google business needs where data is fragmented across relational databases, KV stores, and log files. The design emphasizes a cross‑data‑center cluster rather than a single machine, separating compute from storage to exploit high intra‑datacenter bandwidth.

Overall Architecture

The system consists of three vertical layers: F1 Client, F1 Cluster, and heterogeneous data sources. Inside the cluster the main components are: F1 Master: monitors and manages all F1 Workers. F1 Server: the front‑end that parses, optimizes, and compiles queries before handing execution to workers. Catalog Service: a metadata hub that presents all data sources as a unified view. Batch Metadata: stores metadata for batch‑execution tasks. UDF Server: a remote RPC service hosting user‑defined functions written in any language.

Because storage and compute are decoupled, both F1 Server and F1 Worker are stateless and can be horizontally scaled without redistributing data.

Query Execution

When a query arrives, the first F1 Server parses it, discovers the involved data sources, and may redirect the request to a nearer server. The optimizer then chooses one of three execution modes.

Interactive Execution Modes

Two interactive modes exist:

Centralized Execution : the receiving server executes the plan itself using a single‑threaded pull‑based model, suitable for small, low‑latency queries.

Distributed Execution : the server acts as a scheduler while a set of F1 Workers execute the plan in parallel, similar to Presto’s coordinator/worker architecture.

Fragments are created by the optimizer; each fragment runs on a worker and may exchange data via an Exchange operator that repartitions rows based on a hash of the join or aggregation key.

Batch Execution

For long‑running or high‑throughput workloads, queries are translated into MapReduce‑style stages. Each stage writes its output to Google’s Colossus file system, providing fault tolerance and allowing asynchronous client interaction.

Data Sources

F1 Query can query any registered source (Bigtable, Spanner, Google Spreadsheets, etc.) by abstracting them as relational tables. Users can define external tables with a DEFINE TABLE statement, which creates a temporary table visible only for the current session.

Extensibility

F1 Query supports custom data sources, user‑defined functions (UDF), aggregates (UDAF), and table‑valued functions (TVF). UDFs can be written in Lua, SQL, or any language via the UDF Server. TVFs accept a table as input and return a new table, enabling use cases such as machine‑learning pipelines.

DEFINE TABLE People(
   format = 'csv',
   path = '/path/to/peoplefile',
   columns = 'name:STRING, DateOfBirth:DATE'
);
SELECT Name, DateOfBirth FROM People WHERE Name = 'John Doe';

Performance Optimizations

Key optimizations include batch‑, asynchronous‑, and pipeline‑based processing to hide RPC latency, dynamic key‑range partitioning to avoid data skew, and broadcast hash joins for small inputs. Lookup joins are batched and deduplicated to reduce remote lookups.

Summary

F1 Query demonstrates an innovative approach by combining a single engine with three execution modes to satisfy all enterprise data‑processing needs, offering unified SQL access to heterogeneous sources, extensible UDF/TVF support, and sophisticated performance techniques such as dynamic key‑range partitioning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL UDF Data Partitioning distributed query engine execution modes F1 Query

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.