Ctrip's Experience with Alluxio in Its Big Data Platform: Architecture, Transparent Access, Custom Authentication, CallerContext, and Dynamic Configuration
This article details how Ctrip, a leading travel company, leverages Alluxio as a distributed cache within its extensive big‑data infrastructure to improve data access speed, implement transparent storage access, support custom authentication and multi‑tenant features, enhance audit logging with CallerContext, and dynamically distribute client configurations via Kyuubi.
Ctrip, a leading travel‑industry enterprise, continuously optimizes its big‑data platform and uses Alluxio as a distributed caching layer to accelerate data reads and improve overall processing efficiency.
The platform comprises scheduling, reporting, real‑time query, metadata management, data quality, and data transfer services, supporting compute engines such as Spark, Hive, Presto/Trino, Kyuubi, and StarRocks; Alluxio caches data so that subsequent reads can be served directly from memory.
To achieve transparent access to underlying storage, Ctrip built a custom TripCustomFileSystem that replaces the default client, configures fs.hdfs.impl to point to the custom implementation, introduces an alluxio.use.alluxio.for.read switch, and provides a fallback to native HDFS. Challenges such as Spark‑Yarn delegation tokens, inconsistent modification times, and worker‑failure handling were resolved by rewriting token methods, preferring HDFS writes for consistency, and checking worker health before reads.
Performance tests on representative SQL workloads showed a 40‑50% speedup for read‑intensive queries, while compute‑heavy or small‑data queries saw limited gains, highlighting the importance of workload characteristics when using Alluxio caching.
For security, Ctrip implemented custom authentication (beyond SIMPLE and NOSASL) by adding password configuration, an AuthProvider with caching, and a custom authentication flow. Multi‑tenant support was enhanced by recording user names in read handlers to ensure correct identity propagation.
Kerberos ticket expiration issues were addressed by adding expiration policies (expireAfterWrite) to the cached FileSystem tickets, preventing authentication failures.
The team extended Alluxio’s audit logging (master_audit.log) with a personalized full‑link CallerContext, propagating context from client through Alluxio master to HDFS, and integrated these logs into Hive tables for easy querying and analysis.
Recognizing the limits of global configuration, Ctrip used Kyuubi to dynamically distribute Alluxio client settings per user, leveraging Spark’s spark.hadoop. prefix, Alluxio’s configuration merge, and Kyuubi’s per‑user property syntax, managed via the Qconfig service.
Future plans include broader Alluxio cache adoption, exploring Alluxio Fuse for AI workloads, scaling clusters and user coverage, and continuously improving the data‑ecosystem’s performance, stability, and security.
The article concludes with a Q&A discussing community interaction, PR review latency, and the need for faster feedback loops.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.