Towards Understanding the Inductive Bias of Normalization Methods in Deep Learning

Home
Happenings
Events
Towards Understanding the Inductive Bias of Normalization Methods in Deep Learning

Date8th Jun 2021

Time02:00 PM

Venue meet.google.com/bnu-iuvt-oph

PAST EVENT

Details

Abstract:

Multiple recent works have shown that sufficiently overparameterized neural networks have enough capacity to fit even randomly labeled data (Jacot et al., 2018; Zhang et al., 2017). However, these networks are known to generalize well on real-world data. The prevailing hypothesis to explain this phenomenon is that the training/optimization algorithms such as gradient descent have a bias towards simple solutions. This property is often called inductive bias, and has been an active research area over the past few years (Gunasekar et al., 2018; Lyu and Li, 2020; Neyshabur et al., 2015). In this work, we pose the question of inductive bias of gradient descent for some of the normalization methods such as Batch Normalization and Weight Normalization, that are prevalent in modern deep learning architectures.

In particular, we analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets when trained on exponential or cross-entropy loss. We analyze both standard weight normalization (SWN) and exponential weight normalization (EWN), and show that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate. We extend these results to gradient descent and establish asymptotic relations between weights and gradients for both SWN and EWN. We also show that EWN causes weights to be updated in a way that prefers asymptotic relative sparsity. For EWN, we provide a finite-time convergence rate of the loss with gradient flow and a tight asymptotic convergence rate with gradient descent. We demonstrate our results for SWN and EWN on synthetic data sets. Experimental results on MNIST and CIFAR-10 datasets support our claim on sparse EWN solutions, even with SGD. This demonstrates its potential applications in learning neural networks amenable to pruning.

Speakers

Depen Morwani (CS19S002).

Computer Science & Engineering