Batch Normalization in Neural Network Simply Explained

6 min readDec 3, 2023

Introduction

The Batch Normalization layer was a game-changer in deep learning when it was just introduced. It’s not just about stabilizing training but also a key tool that makes neural networks better, faster, and more effective.

Batch Normalization tackles challenges in training, enabling quicker convergence and facilitating the training of deeper networks. This article is your guide to understanding Batch Normalization — how it works, why it matters, and the impact it has on optimizing the performance of deep learning models. In this article, we will go through two major research papers about Batch Normalization.

Reference Papers:
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift — https://arxiv.org/pdf/1502.03167.pdf
How Does Batch Normalization Help Optimization? — https://arxiv.org/pdf/1805.11604.pdf

What is Covariate Shift?

Covariate shift means a change in the distribution of features (input variables) between the training and testing dataset in Machine Learning or Deep Learning models. In other words, it exists when the statistical properties of the input data change between the training and testing phases.

When a Machine Learning / Deep Learning model is trained on a dataset and then applied to another dataset with different statistical properties, the model might not perform well due to this Covariate Shift.

Covariate shift can be misleading because the models assume that the training and testing data come from the same distribution. When this assumption is violated, the model performance might deteriorate because the training process is trying to generalize patterns from the training data while a different distribution is present in the test data.

What is Internal Covariate Shift?

Internal covariate shift is a phenomenon specific to neural networks, where the distribution of each layer’s inputs changes as the parameters of the preceding layers are updated. In simpler terms, it refers to the changing distribution of the inputs to each layer within a neural network during the training process.

This phenomenon is termed “internal” covariate shift to distinguish it from the standard covariate shift that refers to changes in the input distribution between training and testing datasets.

As a neural network learns and updates its weights through backpropagation, the distribution of inputs to each layer can shift. This shift can make subsequent layers’ learning more challenging because they have to continually adapt to these changing distributions, slowing down the overall training process. Here is an illustration of Internal Covariate Shift:

As in the illustration, the distribution of the input feature remains unchanged, while the distribution of the input of subsequent layers will change after each batch (feed-forward & backpropagation).

As the above illustration, in batch 2, the distribution of the input data of each layer may change again and differ from that in batch 1. For example, in hidden layer 1, the distribution of input data changes from Distribution Y to Distribution B in different batches.

This is the first hidden layer in the networks, it may not be difficult for the layer to adapt to the change in distribution of input data. As the layer goes deeper (i.e. layer 20 / layer 50), it would be difficult for these layers to adapt to the changes in distribution and learn from the data. As a result, this may slow down the training process and hinder model performance.

Batch Normalization

Batch Normalization was introduced in 2015 to mitigate the effects of Internal Covariate Shifts by normalizing the inputs to each layer within batches in the training stage. This normalization stabilizes the distributions of inputs, resulting in more stable and faster training of neural networks as well as accelerated convergence.

How it works?

General Normalization

Normalization refers to a technique aimed at standardizing or scaling the input data. To be specific, it adjusts the inputs of a dataset to a scale with a mean of 0 and a standard deviation of 1.

Within-Batch Normalization

In the deep-learning approach, datasets are divided into batches for training steps. Therefore, this is called Batch Normalization. Batch Normalization is applied before inputting the dataset in each hidden layer in Fig 1(a). For all k features, normalization is applied as Fig 1(b).

Figure 5. Batch Normalization Steps (a) & Batch Normalization Formula (b)

Notation (From left to right):
x_(k): k-th feature (column) of the data
x_hat_(k): Normalized data in x_(k)
E[x_(k)]: Expectation / Mean of x_(k)
Var[x_(k)]: Variance of x_(k)

Batch Normalization operates by subtracting the mean/expectation (Shift) and dividing by the standard deviation (Scale). While this normalization step aids in stabilizing and accelerating training, this could result in information loss (Shift & Scale).

De-Normalization

To avoid information loss, the researchers introduce two parameters (Gamma and Beta) in the step of de-normalization as follows:

Notation (From left to right):
y_(k): k-th feature (column) of the data
gamma_(k): model parameters to learn about the Scale
x_hat_(k): Normalized data in normalization steps
beta_(k): model parameters to learn about the Shift

Gamma_(k) and Beta_(k) are part of the model parameters to be learned during the training process. With extra parameters, we can capture the information loss in the normalization steps while the normalization step would stabilize the distribution of the input data in each layer.

Combining Both Steps

Figure 7. Batch Normalization and de-Normalization from the research paper

Remarks: What we do not include in the article:

Statistical Inference of Batch Normalization in Neural Network and Convolutional Neural Network.

Re-thinking

In 2019, another research team published another paper to re-think the relationship between Batch Normalization and Optimization.

How Does Batch Normalization Help Optimization? — https://arxiv.org/pdf/1805.11604.pdf

Figure 8. Charts from the research paper.

In this paper, they compared VGG networks trained without BatchNorm (Standard), with BatchNorm (Standard + BatchNorm) and with explicit “covariate shift” added to BatchNorm layers (Standard + “Noisy” BatchNorm).

“In the latter case, the research team induced distributional instability by adding time-varying, non-zero mean and non-unit variance noise independently to each batch normalized layer. The chart above showed that internal covariate shift is not directly affecting the training process.”
— From original research paper.

With explicit covariate shifts, the model still performs better. This disproved that batch normalization is solving the issue of internal covariate shift.

Why is Batch Normalization helpful for Optimization?

The research team found out that Batch Normalization is smoothing the loss function. From the chart below, we see that the loss landscape (loss surface) is smoothened (less spike), which makes the optimization much easier and faster. Similar smoothening happens in gradient as well.

Figure 9. Charts from the research paper. Smoothened Loss Landscape (left), Gradient (Middle) and Smoothing Effectiveness (Right)

When the loss landscape is smoothened, we can apply a higher learning rate and less regularization on the model because the optimal point is now easier to reach. This explains the phenomena in sections 3.3 (Batch Normalization enables higher learning rates) and 3.4 (Batch Normalization regularizes the model) in the research paper that introduced Batch Normalization.

To conclude

We have gone through how Batch Normalization works and unveiled a profound understanding of why Batch Normalization catalyzes optimizing neural networks by comparing the two major research papers about Batch Normalization.

The answer is that Batch Normalization smoothes the loss landscape which brings us several major advantages:

Stable Training Process
Accelerate Learning Process
Regularize Model to Avoid Overfitting

Finally, as we recap the usefulness of Batch Normalization, its impact spans beyond mere training stability. It empowers neural networks to stabilize, learn efficiently, converge faster and generalize better with enhanced accuracy.

If you like this article, give me a like and share it with your friends. You may also leave me a comment/message too.