Backend Development 11 min read

Unified UDF Implementation on Cloud Platform: Architecture, Features, and Open‑Source Contributions

This article introduces a unified User‑Defined Function (UDF) solution on a cloud data platform, detailing its remote execution architecture, compatibility with Hive UDFs, resource isolation, hot‑update capabilities, internal platform implementation, open‑source contributions to PrestoDB, and future work plans.

DataFunSummit
DataFunSummit
DataFunSummit
Unified UDF Implementation on Cloud Platform: Architecture, Features, and Open‑Source Contributions

Lakehouse Analytics Service is a multi‑engine, multi‑tenant lakehouse service on Volcano Engine that allows users to run their own UDFs on the cloud platform.

Key user concerns include Hive UDF compatibility, hot‑update of UDF JAR packages, and isolation of jobs and resources to ensure security and performance.

To address these, a Remote UDF solution was built using a FAAS serverless service that provides unlimited scalable compute, enabling multi‑version management of UDF JARs via container images, and offering kernel and network isolation.

Clients invoke UDFs through a gRPC interface, allowing a single UDF implementation to be used across multiple engines such as Presto, Spark, or native C++ engines.

The architecture provides four main advantages: (1) strong security through sandbox isolation, (2) excellent scalability via horizontal resource expansion, (3) hot‑update of UDF JARs by mapping versions to container images, and (4) a unified interface description that supports multiple engines without requiring JNI wrappers.

Compared with local execution, Remote UDF offers remote resource scalability, independent environments, and hot‑update, while incurring network overhead and container start‑up latency, which can be mitigated by request merging and image pre‑warming.

Internally, ByteDance’s platform supports Hive UDF/UDAF in Presto both in local mode and via the Remote UDF framework, ensuring semantic consistency across engines and reducing user effort.

Open‑source contributions include adding Hive UDF/UDAF support to PrestoDB, implementing RemoteScalarFunctionImplementation, and exposing a FunctionNamespaceManager that loads Hive UDF classes and maps their data types to Presto types.

Future work aims to support Metastore‑based UDF/UDAF with dynamic metadata loading and hot‑loading, optimize Remote UDF performance in Presto (e.g., merge small pages, dictionary encoding, result caching), and add Remote UDAF support.

The article concludes with a Q&A covering registration of Hive UDFs, the lack of open‑source Spark Remote UDF implementation, and usage scenarios for Remote UDFs within the company.

serverlessHiveOpen SourceUDFCloud PlatformprestoRemote Execution
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.