Boosting Data Sharing Architecture: JDBC Limits, DistCp Speed & Kerberos Trust
This article examines the evolution of a data‑sharing exchange platform—moving from slow JDBC‑based transfers to storage‑level copying, introducing a two‑stage DistCp workflow, and securing cross‑cluster access with Kerberos‑based trust managed by the Guardian component.
In the previous article we explored the basic architecture of a data‑sharing exchange platform, which met fundamental requirements but suffered from insufficient speed when using JDBC for data transfer.
(1) Level‑One Advancement
Because Inceptor stores data on HDFS, we considered bypassing the slow JDBC layer and copying data directly at the storage level, then recreating tables from the schema. Two new namespaces were added: tdc‑jobs for extraction tasks and dataplatform as a data‑transfer zone.
The metadata component records table schemas, and tenant data‑request descriptions include the required schema. The data flow changes from a simple JDBC operation to three steps:
Step 1: The workflow uses a data connector to TDH and runs an INSERT OVERWRITE SQL to export data to a specific HDFS location.
Step 2: A pod created in the tdc‑jobs namespace pulls the data from TDH and puts it into the tenant’s HDFS.
Step 3: In the tenant’s database, an external table is created based on the obtained schema, completing the task and sending a notification.
Although faster than the JDBC approach, large file transfers across clusters remain limited by network and I/O.
(2) Level‑Two Advancement
The speed issue can be addressed with Hadoop’s distcp , which leverages the cluster’s distributed capabilities to copy data directly between DataNodes. We launched a third solution: distcp tasks are started in both the platform‑level YARN and the tenant‑level YARN, with YARN managing the lifecycle of these tasks, pulling data into the tenant’s HDFS in two stages.
The two‑stage pull works because Kerberos authentication prevents direct access between TDH and the tenant. Distcp only supports Kerberos trust at the storage layer, so we configure mutual trust between the platform and TDH, and between the tenant and the platform, enabling secure data movement.
(3) Full‑Cloud Platform
For scenarios where data originates in the cloud rather than on‑premises, the architecture simplifies: the platform‑level data platform serves as the central data hub, as shown in the diagram below.
Authentication and Permissions
The Guardian security manager, provided by Xinghuan, handles user authentication and fine‑grained permission control, supporting Kerberos, multi‑level permissions, and domain trust.
All services run with Kerberos enabled for encrypted communication. Guardian’s plug‑in model lets each service define its own permission rules, such as table‑, row‑, or column‑level controls for Inceptor, with full audit capability.
Cross‑cluster trust (two‑way, outgoing, incoming) enables Kerberos authentication between clusters, allowing data to flow securely while keeping permission information within each domain.
Each cluster’s Guardian includes a preset dataadmin user that acts as a cross‑domain identity, proxies tenant access to TDH, and launches distcp tasks. This simplifies user management and reduces the number of security configuration files.
Summary
The article presented key design choices for a data‑sharing exchange architecture, including resource‑controlled namespaces, YARN queues, and high‑availability services. While the current solution is robust, further optimizations such as advanced scheduling and support for additional data types are possible.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
StarRing Big Data Open Lab
Focused on big data technology research, exploring the Big Data era | [email protected]
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
