Optimizing SparkSQL: ByteDance EMR’s Data Lake Integration and Multi‑Tenant Server
ByteDance’s EMR team details how they integrated data‑lake engines such as Hudi and Iceberg into SparkSQL, streamlined jar management, built a custom Spark SQL Server with Hive compatibility, multi‑tenant support, engine pre‑warming, and transaction capabilities, dramatically improving performance and resource efficiency for enterprise workloads.
Data Lake Engine Integration
Hudi, Iceberg and other data‑lake engines are increasingly used. ByteDance EMR needed to integrate them into SparkSQL, encountering many issues.
Iceberg integration : Users previously had to manually add many commands and locate the spark‑iceberg dependency. The solution was to pre‑install the required Iceberg JARs in Spark’s jars directory, allowing users to simply specify the catalog.
Hive compatibility : When using Iceberg, Spark can create tables readable by Presto/Trino, but Hive cannot read them unless the Spark configuration
iceberg.engine.hive.enable=trueis set. Adjusting Spark or Hive configuration resolves this.
Because Iceberg 0.12 does not support Spark 3.2, the team compiled a master‑snapshot version of Iceberg to work with Spark 3.2.
Spark SQL Server
Existing tools such as Spark Thrift Server or Kyuubi do not meet certain B‑end requirements, so ByteDance EMR built a custom Spark SQL Server.
Hive‑semantic compatibility : A SQL parser chain injects Hive SQL parsing into Spark, achieving full Hive‑SQL compatibility.
Pre‑initialization of Spark engines : Spark jobs are submitted to YARN in advance, keeping engines ready and reducing task‑submission latency.
Cross‑YARN‑queue submission : Users can specify the YARN queue for their jobs.
The server implements Thrift/JDBC interfaces, supports OpenLDAP, Kerberos, and offers session‑level and user‑level isolation.
<code>./bin/beeline -u "jdbc:hive2://emr-5fqkwudj144d2gc1k8hi-master-1/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=midas/ha;auth=LDAP" -n emr_dev -pEMR123456emr</code> <code>./bin/beeline -u "jdbc:hive2://emr-master-2:10005/default;auth=LDAP" -n test_sub -pEMR123456emr</code>Multi‑Tenant Isolation
Three isolation levels are provided:
Session : A new Spark engine is launched per connection and destroyed after disconnection.
User : Multiple engines are shared by the same user; engines persist until idle timeout.
Open : Engines are shared across all users, suitable for large accounts or clusters without strict permission management.
Transaction Support
Hive implements transactions via HiveServer2. SparkSQL, when integrated with Iceberg or Hudi, gains transactional capabilities without a separate transaction manager. Example Iceberg MERGE statement:
<code>MERGE INTO prod.nyc.taxis USING (SELECT * FROM staging.nyc.taxis) st ON pt.id = st.id WHEN NOT MATCHED THEN INSERT *</code>By merging small broadcasts, the team avoided driver OOM issues.
Conclusion
As enterprise data‑warehouse needs become more complex, SparkSQL‑based architectures offer flexible, efficient alternatives to Hive. ByteDance EMR’s enhancements—data‑lake integration, custom SQL server, multi‑tenant isolation, and transaction support—demonstrate a scalable path forward.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.