LeAP: Precise Model Slimming with Learnable Adaptive Permutation for Feature Selection

LeAP replaces costly permutation‑based feature importance with an end‑to‑end learnable module that shuffles features, applies adaptive gradient‑bias regularization, and achieves state‑of‑the‑art sparsity and inference speedups on large‑scale recommender models.

Bilibili Tech
Bilibili Tech
Bilibili Tech
LeAP: Precise Model Slimming with Learnable Adaptive Permutation for Feature Selection

Current Challenges in Deep‑Learning Recommender Feature Selection

Industrial recommendation models combine 1‑D dense statistics and high‑dimensional user‑behavior embeddings (up to 256‑D). Five challenges are identified:

Dimensional heterogeneity : Uniform penalties cause high‑dimensional features to generate larger gradients that resist sparsification, while useful 1‑D features are over‑sparsified.

Extreme sparsity : Long‑tail features appear with default values in >99 % of samples; many selectors treat low‑frequency activation as noise and discard them.

Permutation cost : Permutation importance requires shuffling each feature and re‑forwarding the model, which is infeasible on billions of samples.

Mask‑gate compensation : In joint training, shrinking mask values triggers downstream weight amplification, preventing mask scores from polarizing to 0 or 1.

Hyper‑parameter burden : Methods such as LPFS introduce multiple polarizing functions and more than five hyper‑parameters, complicating deployment.

LeAP – Learnable Adaptive Permutation

LeAP replaces discrete shuffling with a differentiable, end‑to‑end gate inserted after the feature‑concatenation layer.

Core Network Architecture

For each training batch, LeAP independently shuffles each feature to generate a noise replica that follows the original marginal distribution. The TensorFlow implementation:

# shuffle all feature
# hidden is the feature concat representation
def shuffle_all_features(hidden, fea_dim_range_dict):
    """Shuffle each feature across samples. Different features use different random permutations."""
    # Ensure original order after concatenation
    sorted_keys = sorted(fea_dim_range_dict.keys())
    shuffled_blocks = []
    for f in sorted_keys:
        d_start = fea_dim_range_dict[f]['dim_start']
        d_end   = fea_dim_range_dict[f]['dim_end']
        # Slice the column range for this feature
        mid_part = hidden[:, d_start:d_end]  # [B, (d_end - d_start)]
        # Generate a random row permutation for this feature
        B = tf.shape(mid_part)[0]
        perm_f = tf.random.shuffle(tf.range(B))  # [B], shuffle rows
        # Apply row shuffle
        mid_part_shuffled = tf.gather(mid_part, perm_f, axis=0)
        shuffled_blocks.append(mid_part_shuffled)
    # Concatenate shuffled blocks back to [B, D]
    hidden_shuffled = tf.concat(shuffled_blocks, axis=1)
    hidden_shuffled.set_shape([None, hidden.shape[1]])
    return hidden_shuffled

A learnable sigmoid gate blends the original and shuffled representations; a stop‑gradient operation blocks gradients from the noise branch.

Adaptive Gradient‑Bias Regularization

LeAP adds a regularization term based on the L2 distance between a feature and its shuffled version (Shuffle Divergence). Constant features receive a lower‑bound clipping. The per‑feature adaptive weight is the exponential moving average (EMA) of the divergence multiplied by a global sparsity hyper‑parameter λ, the only tunable scalar.

Total loss:

TotalLoss = TaskLoss + λ·∑_i w_i·||x_i – shuffle(x_i)||_2

Gradient decomposition yields two components:

Sensitivity : task‑driven importance of the feature.

Shuffle Divergence : magnitude of change after shuffling, which can bias gradients for high‑dimensional or dense features.

The regularization gradient is positively correlated with Shuffle Divergence, counteracting the bias from the task gradient. Consequently, useful features are driven toward gate value 1, redundant ones toward 0.

Theoretical Benefits

High interpretability : after training, the gate score equals the probability that the feature can be safely replaced by random noise.

Natural polarization : features whose sensitivity exceeds the regularization pressure converge to 1; others collapse to 0 without extra tricks.

Empirical Validation

On public benchmarks (Criteo [9], Avazu [7], MovieLens‑1M [6], AliCCP [8]) LeAP matches or exceeds state‑of‑the‑art pruning results across multiple pruning ratios.

In a Bilibili ranking model with >12,000 dimensions and a 2 TB checkpoint, LeAP identified >3,600 redundant dimensions (≈30 %) while preserving offline metrics.

Operational gains:

Inference efficiency : removal of 15 %–50 % redundant features reduced online inference resource consumption by 5 %–20 % without degrading core metrics such as play count or interaction volume.

Accelerated feature iteration : new features can be evaluated after a single LeAP run once the model has briefly converged, shortening experiment cycles.

Conclusion

LeAP transforms permutation‑based importance estimation into a differentiable gate that learns to replace features with noise, coupled with an adaptive gradient‑bias regularizer. This addresses dimensional heterogeneity and long‑tail sparsity, provides clear interpretability, and has been validated on both public datasets and large‑scale production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep Learningrecommendation systemsfeature selectionadaptive permutationLeAPsparse regularization
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.