How TensorNet Supercharges Sparse Feature Training on TensorFlow
TensorNet is a TensorFlow‑based distributed training framework optimized for massive sparse‑feature models in advertising and recommendation, dramatically reducing parameter sync overhead, enabling near‑infinite feature dimensions, cutting training time from hours to minutes, and boosting inference performance by up to 35%.
What is TensorNet?
TensorNet is a distributed training framework built on TensorFlow, optimized for large‑scale sparse‑feature scenarios such as advertising recommendation. Its goal is to let TensorFlow users quickly train models with billions of sparse parameters.
Challenges of Training Large Sparse Feature Models
In advertising, search, and recommendation, deep models contain massive high‑dimensional discrete sparse features, leading to two main problems:
Huge training data (e.g., over 100 TB for 360 ad scenario).
Enormous model parameters (e.g., over 100 billion parameters).
Single‑machine training is slow; distributed training has become the industry standard.
Problems Using TensorFlow for Sparse Feature Models
TensorFlow, while popular, is not friendly to large sparse models because:
Supported feature dimension is limited by memory.
Distributed training synchronizes all parameters, causing huge communication overhead for sparse models.
TensorNet Overview
TensorNet reuses all TensorFlow capabilities while adding specific support for massive sparse features.
Key improvements:
Enables near‑infinite sparse feature dimensions.
Reduces synchronized parameter size to one‑ten‑thousandth (or one‑hundred‑thousandth) of the original, cutting training time from 3.5 hours to 25 minutes in a real 360 ad workload.
When combined with split‑graph inference, yields about 35 % online performance gain.
TensorNet Distributed Training Architecture
Supports both asynchronous and synchronous modes.
Asynchronous Architecture
In CPU‑only clusters, TensorNet uses separate parameter servers for sparse and dense parameters, embeds a sparse parameter server within each worker, distributes sparse parameters via a distributed hash table, and merges dense parameters into a single distributed array, reducing network requests.
Synchronous Architecture
Similar to TensorFlow’s MultiWorkerMirroredStrategy, but with a dedicated sparse parameter server and synchronization only for the sparse features present in the current batch, reducing communication to one‑ten‑thousandth (or one‑hundred‑thousandth) of the original.
Core Optimizations
The main optimization is minimizing the embedding tensor size. Instead of a gigantic embedding matrix covering all possible IDs, TensorNet builds a small embedding matrix sized to the batch, using a virtual sparse feature to map IDs to indices.
embedding_lookupDuring training, the workflow is:
Define the embedding matrix dimension as the maximum number of unique IDs in a batch.
Collect all IDs in the current batch.
Sort IDs and assign a continuous index (virtual sparse feature).
Fetch embedding vectors from the parameter server and place them into the batch‑sized matrix.
Use the virtual sparse feature as model input.
Inference Optimization
TensorNet changes only the first layer, so inference remains simple. The model is split into an offline training part (embedding_lookup_graph) and an online inference part (inference_graph) that consumes a pre‑exported sparse embedding dictionary.
Using split‑graph together with XLA AOT can improve online performance by about 35 %.
Open Source and Getting Started
TensorNet is open‑source and has been deployed in 360’s ad CTR prediction pipelines with significant results. The code, documentation, and tutorials are available at:
GitHub repository: https://github.com/Qihoo360/TensorNet
Quick start tutorial: https://github.com/Qihoo360/TensorNet/doc/tutorial/01-begin-with-wide-deep.ipynb
Additional docs: https://github.com/Qihoo360/TensorNet/README.md
Contact: Zhang Yansheng ([email protected]), Yao Lei ([email protected]).
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.