Big Data 11 min read

Optimizing SparkSQL: ByteDance EMR’s Data Lake Integration and Multi‑Tenant Server

ByteDance’s EMR team details how they integrated data‑lake engines such as Hudi and Iceberg into SparkSQL, streamlined jar management, built a custom Spark SQL Server with Hive compatibility, multi‑tenant support, engine pre‑warming, and transaction capabilities, dramatically improving performance and resource efficiency for enterprise workloads.

ByteDance Data Platform
ByteDance Data Platform
ByteDance Data Platform
Optimizing SparkSQL: ByteDance EMR’s Data Lake Integration and Multi‑Tenant Server

Data Lake Engine Integration

Hudi, Iceberg and other data‑lake engines are increasingly used. ByteDance EMR needed to integrate them into SparkSQL, encountering many issues.

Iceberg integration : Users previously had to manually add many commands and locate the spark‑iceberg dependency. The solution was to pre‑install the required Iceberg JARs in Spark’s jars directory, allowing users to simply specify the catalog.

Hive compatibility : When using Iceberg, Spark can create tables readable by Presto/Trino, but Hive cannot read them unless the Spark configuration

iceberg.engine.hive.enable=true

is set. Adjusting Spark or Hive configuration resolves this.

Because Iceberg 0.12 does not support Spark 3.2, the team compiled a master‑snapshot version of Iceberg to work with Spark 3.2.

Spark SQL Server

Existing tools such as Spark Thrift Server or Kyuubi do not meet certain B‑end requirements, so ByteDance EMR built a custom Spark SQL Server.

Hive‑semantic compatibility : A SQL parser chain injects Hive SQL parsing into Spark, achieving full Hive‑SQL compatibility.

Pre‑initialization of Spark engines : Spark jobs are submitted to YARN in advance, keeping engines ready and reducing task‑submission latency.

Cross‑YARN‑queue submission : Users can specify the YARN queue for their jobs.

The server implements Thrift/JDBC interfaces, supports OpenLDAP, Kerberos, and offers session‑level and user‑level isolation.

<code>./bin/beeline -u "jdbc:hive2://emr-5fqkwudj144d2gc1k8hi-master-1/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=midas/ha;auth=LDAP" -n emr_dev -pEMR123456emr</code>
<code>./bin/beeline -u "jdbc:hive2://emr-master-2:10005/default;auth=LDAP" -n test_sub -pEMR123456emr</code>

Multi‑Tenant Isolation

Three isolation levels are provided:

Session : A new Spark engine is launched per connection and destroyed after disconnection.

User : Multiple engines are shared by the same user; engines persist until idle timeout.

Open : Engines are shared across all users, suitable for large accounts or clusters without strict permission management.

Transaction Support

Hive implements transactions via HiveServer2. SparkSQL, when integrated with Iceberg or Hudi, gains transactional capabilities without a separate transaction manager. Example Iceberg MERGE statement:

<code>MERGE INTO prod.nyc.taxis USING (SELECT * FROM staging.nyc.taxis) st ON pt.id = st.id WHEN NOT MATCHED THEN INSERT *</code>

By merging small broadcasts, the team avoided driver OOM issues.

Conclusion

As enterprise data‑warehouse needs become more complex, SparkSQL‑based architectures offer flexible, efficient alternatives to Hive. ByteDance EMR’s enhancements—data‑lake integration, custom SQL server, multi‑tenant isolation, and transaction support—demonstrate a scalable path forward.

TransactionSparkSQLmulti-tenantdata lakeIcebergHudiEMR
ByteDance Data Platform
Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.