ICML Tutorial Highlights: Deep Residual Nets, Stochastic Gradient, Deep RL

At the ICML pre‑conference tutorial, experts presented deep residual networks, stochastic gradient methods for large‑scale learning, and deep reinforcement learning, highlighting architectural innovations, optimization theory, noise‑reduction techniques, and practical considerations for building scalable, high‑performance AI models.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
ICML Tutorial Highlights: Deep Residual Nets, Stochastic Gradient, Deep RL

Deep Residual Network

He Kaiming introduced the deep residual network, a framework that enables training of very deep convolutional models by adding identity skip connections, allowing networks to scale from the previous 22‑layer limit to over 100 layers and beyond.

When network depth is small (<5 layers) standard back‑propagation suffices; for depths >10 layers initialization and batch normalization become important; >30 layers benefit from shortcut connections; >100 layers require identity skip connections; research on >1000 layers is ongoing.

Initialization and batch normalization follow the principles from LeCun et al. (1998): set weight variance to keep activation variance stable, adjust for ReLU, and normalize each layer to zero mean and unit variance to mitigate gradient vanishing and accelerate training.

The residual formulation replaces the direct mapping H(x) with H(x) = F(x) + x, where the network learns the residual function F(x) instead of the full transformation, simplifying optimization and allowing gradients to flow more easily through deep stacks.

Empirical analysis on 100‑plus layer models shows that residual networks do not add expressive power per se but enable deeper architectures, improve optimization by alleviating gradient diffusion, and often yield better generalization due to the “deep‑and‑thin” structure.

Extending residual nets beyond 1000 layers requires careful design of the identity mapping; without additive skip connections the model behaves multiplicatively, causing severe optimization difficulties.

Stochastic Gradient Methods for Large‑Scale Machine Learning

Leon Bottou, Frank E. Curtis, and Jorge Nocedal presented stochastic gradient (SG) methods, which update model parameters using gradients computed on a single sample rather than the full batch, offering higher efficiency at the cost of sublinear convergence.

Theoretical analysis shows convergence bounds for convex objectives under fixed or diminishing step sizes, and for non‑convex deep networks the expected gradient norm after k iterations is bounded when step sizes decay appropriately.

Practical improvements focus on noise reduction and incorporating second‑order information:

Variance‑reduced algorithms such as SVRG and SAGA maintain a table of per‑sample gradients to correct the stochastic direction, reducing variance and improving convergence.

Gradient accumulation and iterate averaging further stabilize updates.

Second‑order approximations (e.g., L‑BFGS) can be integrated by approximating the Hessian on mini‑batches.

These techniques are especially relevant for large‑scale deep learning where communication overhead in distributed settings can make pure SG less attractive.

Deep Reinforcement Learning

David Silver described deep reinforcement learning (RL) as a framework where an agent selects actions in states to maximize cumulative reward, with deep networks providing powerful state representations.

RL methods are categorized into:

Value‑based RL (e.g., Q‑learning) that learns optimal Q‑functions.

Policy‑based RL that directly optimizes the policy.

Model‑based RL that builds a model of the environment.

Deep Q‑Network (DQN) applies a convolutional network to map raw game frames to Q‑values, with extensions such as Double DQN (separating action selection from evaluation), Prioritized Replay, and Dueling Networks (splitting value and advantage streams) improving stability and performance.

Policy‑based approaches include Actor‑Critic methods where a critic estimates Q‑values while the actor updates the policy; A3C further parallelizes this by maintaining separate actor and critic networks.

For continuous action spaces, deterministic policy gradient (DPG) combines experience replay, a DQN‑style critic, and an actor that follows the gradient of the critic.

Multi‑agent scenarios can be addressed with methods like FSP, which learn a best response to the average policy of other agents.

Overall, the integration of deep learning with reinforcement learning is poised to become a major research focus, offering solutions to sequential decision problems beyond traditional supervised tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningResidual Networksstochastic gradient
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.