Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details
This article examines the evolution of user data, computing power, and models, and presents the design principles, key architectural factors, and practical implementation techniques for building a full‑stack machine learning platform that supports large‑scale data processing, distributed training, and low‑latency online serving.
The rapid accumulation of user data, rising user‑experience expectations, and breakthroughs in computing power and model architectures have made machine learning platforms the critical infrastructure that connects data, computation, and user interaction.
Full‑stack Machine Learning Platform – A platform must support the entire data‑science lifecycle, from data ingestion and feature engineering to model training, optimization, and online serving. Key design considerations include:
Seamless full‑stack user experience.
Decoupling engineering concerns from data‑science workflows.
Open, extensible architecture to accommodate diverse teams and tasks.
Key Architectural Factors
Legacy System Integration : Leverage existing storage, resource‑scheduling, and task‑scheduling systems to reduce implementation cost and improve user experience.
Data and Model Scale : Data volume, model complexity, and training‑time tolerance dictate technology choices and core capabilities.
Resource Management & Scheduling : Use Kubernetes or YARN to create resource queues and manage heterogeneous resources efficiently.
Pipeline : Organize data ingestion, feature processing, training, and serving into pipelines that share memory and avoid inefficient I/O.
Serving System : Provide stateless micro‑service‑based serving with low end‑to‑end latency, elastic scaling, model versioning, monitoring, and A/B testing.
Data Storage & Representation : Support dense and sparse formats, single‑node or distributed storage, and appropriate file formats (e.g., CSR, TFRecord).
Parallelism & Compute Acceleration : Utilize GPU CUDA, CPU libraries, and high‑speed networks (InfiniBand, RDMA) to accelerate training.
Architecture Design Example
For a CTR scenario with TB‑scale user/item data stored in HDFS, the platform integrates Hadoop and Spark:
Submit a Spark job via YARN.
Within Spark, call foreachPartition so each node holds an RDD partition in memory.
On each worker, invoke TensorFlow or Torch through JNI or a native process for model computation.
This turns Spark into a container scheduler, reusing Hadoop’s data locality and storage efficiencies. The Data Locality feature and projects such as TensorFlowOnSpark and Angel follow the same approach.
Custom data sources are supported by implementing FileReader and OutputWriter and registering them via org.apache.spark.sql.sources.DataSourceRegister :
def write(kryo: Kryo, out: Output): Unit = {
// custom serialization logic
}
def read(kryo: Kryo, in: Input): Unit = {
// custom deserialization logic
}Feature Processing Pipeline
Features are bucketed, standardized, and one‑hot encoded using Spark ML Pipelines, which provide:
In‑memory computation to accelerate feature processing.
Consistent feature maps across fit, transform, and model save phases.
Support for both batch and streaming updates of feature maps.
Distributed Training
Two main parallelism strategies are discussed:
Data Parallelism : Replicate the model on each worker, partition the data, compute local gradients, and aggregate them using Parameter Server or Ring All‑Reduce.
Model Parallelism : Split large models across workers. Linear models can be partitioned by feature dimension, while deep neural networks are partitioned by layers or cross‑layer partitions, requiring high‑bandwidth network communication.
For CTR models with billions of sparse features, two deployment schemes are highlighted:
Host‑Device : Store embeddings in host memory and copy them to the device for computation, reducing network traffic.
Embedding Hash Table (e.g., NVIDIA HugeCTR): Distribute embeddings across devices using open‑addressing hash tables, with reduce_scatter for model transfer and all_gather for aggregation.
Online Serving
Serving demands high complexity, low latency, high throughput, and elastic scaling. Kubernetes provides elastic scaling, while model quantization and multi‑level caching reduce latency. Proper cache design balances cost, capacity, speed, and update strategy.
References
Hidden Technical Debt in Machine Learning Systems
Data locality in Hadoop
TensorFlowOnSpark, Angel, Mindspore, HugeCTR
Spark ML Pipelines and data types
Performance comparison of storage formats for sparse matrices
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.