Artificial Intelligence 15 min read

Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details

This article examines the evolution of user data, computing power, and models, and presents the design principles, key architectural factors, and practical implementation techniques for building a full‑stack machine learning platform that supports large‑scale data processing, distributed training, and low‑latency online serving.

DataFunSummit

Feb 4, 2021

Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details

The rapid accumulation of user data, rising user‑experience expectations, and breakthroughs in computing power and model architectures have made machine learning platforms the critical infrastructure that connects data, computation, and user interaction.

Full‑stack Machine Learning Platform – A platform must support the entire data‑science lifecycle, from data ingestion and feature engineering to model training, optimization, and online serving. Key design considerations include:

Seamless full‑stack user experience.

Decoupling engineering concerns from data‑science workflows.

Open, extensible architecture to accommodate diverse teams and tasks.

Key Architectural Factors

Legacy System Integration : Leverage existing storage, resource‑scheduling, and task‑scheduling systems to reduce implementation cost and improve user experience.

Data and Model Scale : Data volume, model complexity, and training‑time tolerance dictate technology choices and core capabilities.

Resource Management & Scheduling : Use Kubernetes or YARN to create resource queues and manage heterogeneous resources efficiently.

Pipeline : Organize data ingestion, feature processing, training, and serving into pipelines that share memory and avoid inefficient I/O.

Serving System : Provide stateless micro‑service‑based serving with low end‑to‑end latency, elastic scaling, model versioning, monitoring, and A/B testing.

Data Storage & Representation : Support dense and sparse formats, single‑node or distributed storage, and appropriate file formats (e.g., CSR, TFRecord).

Parallelism & Compute Acceleration : Utilize GPU CUDA, CPU libraries, and high‑speed networks (InfiniBand, RDMA) to accelerate training.

Architecture Design Example

For a CTR scenario with TB‑scale user/item data stored in HDFS, the platform integrates Hadoop and Spark:

Submit a Spark job via YARN.

Within Spark, call foreachPartition so each node holds an RDD partition in memory.

On each worker, invoke TensorFlow or Torch through JNI or a native process for model computation.

This turns Spark into a container scheduler, reusing Hadoop’s data locality and storage efficiencies. The Data Locality feature and projects such as TensorFlowOnSpark and Angel follow the same approach.

Custom data sources are supported by implementing FileReader and OutputWriter and registering them via org.apache.spark.sql.sources.DataSourceRegister:

<span>def write(kryo: Kryo, out: Output): Unit = {</span>
<span>    // custom serialization logic</span>
<span>}</span>
<span>def read(kryo: Kryo, in: Input): Unit = {</span>
<span>    // custom deserialization logic</span>
<span>}</span>

Feature Processing Pipeline

Features are bucketed, standardized, and one‑hot encoded using Spark ML Pipelines, which provide:

In‑memory computation to accelerate feature processing.

Consistent feature maps across fit, transform, and model save phases.

Support for both batch and streaming updates of feature maps.

Distributed Training

Two main parallelism strategies are discussed:

Data Parallelism : Replicate the model on each worker, partition the data, compute local gradients, and aggregate them using Parameter Server or Ring All‑Reduce.

Model Parallelism : Split large models across workers. Linear models can be partitioned by feature dimension, while deep neural networks are partitioned by layers or cross‑layer partitions, requiring high‑bandwidth network communication.

For CTR models with billions of sparse features, two deployment schemes are highlighted:

Host‑Device : Store embeddings in host memory and copy them to the device for computation, reducing network traffic.

Embedding Hash Table (e.g., NVIDIA HugeCTR): Distribute embeddings across devices using open‑addressing hash tables, with reduce_scatter for model transfer and all_gather for aggregation.

Online Serving

Serving demands high complexity, low latency, high throughput, and elastic scaling. Kubernetes provides elastic scaling, while model quantization and multi‑level caching reduce latency. Proper cache design balances cost, capacity, speed, and update strategy.

References

Hidden Technical Debt in Machine Learning Systems

Data locality in Hadoop

TensorFlowOnSpark, Angel, Mindspore, HugeCTR

Spark ML Pipelines and data types

Performance comparison of storage formats for sparse matrices

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Resource Scheduling Machine Learning Platform data pipelines big data integration

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.