GroupNorm
After several experiments, I discovered that GroupNorm works surprisingly well on detection models. Simply enabling GroupNorm in FPN yielded a significant improvement. Inspired by this, I wanted to replace BatchNorm in the backbone with GroupNorm and explore how this layer might benefit other networks.
Paper: overview and comments
Criticism about BatchNorm
BatchNorm struggles with models trained on small batches. Several papers have shown that BatchNorm primarily keeps the activation distribution in check to aid training convergence. Consequently, with proper initialization, BatchNorm becomes dispensable.
Related works
- Local Response Normalization.
- Batch Normalization (or Spatial Batch Norm in some frameworks).
- Layer Normalization.
- Weight Normalization.
- Batch Renormalization.
- Synchronized Batchnorm: [Bag of Freebies for Training Object Detection Neural Networks] uses this technique instead of GroupNorm for Yolov3—I wonder why.
- Instance Normalization.
Group-wise computation
- ResNext.
- MobileNet. Note to self: after several weeks working on detection, we found that MobileNet does not perform well for detection. One paper supports this observation [Light-Weight RetinaNet for Object Detection]. Their findings align with my experiments, particularly regarding the low confidence scores of the model.
- Xception.
Normalization Revisiting and Formulation
The authors did a commendable job unifying the formulas of popular normalization techniques. Examine the figure below; to some extent, GroupNorm can be viewed as a variant of LayerNorm and InstanceNorm.

Interestingly, Layer Norm bears a striking resemblance to the pooling method in Triangulation Embedding and other higher-level features. From the figure, we can also deduce why Batch Norm falters with small batch sizes: N is small, leaving insufficient samples to compute reliable estimates of the two moments (mean and variance). How do other methods overcome this limitation? They compute statistics on the channels themselves. Most common CNN models use 64, 128, or 256 channels in convolutional layers, providing enough values to compensate for the limited samples per batch.
Regarding computation, the family of normalization layers consists of two steps:
- Compute the statistics and normalize the input:
where
- For each channel, learn a linear transformation to compensate for the possible loss of representational ability:
where
Implementation
The paper also mentions the TensorFlow implementation, which I won’t dwell on here. However, the C++ implementation from Caffe2 is worth examining. Why? Because it computes both moments at inference time, which disappointed me—I had hoped to fuse the layer into the penultimate Conv layer to optimize mobile inference.
Interestingly, I discovered that different implementations exist for BatchNorm, with no consensus across popular deep learning frameworks on whether Bessel’s correction should be applied.
Surprisingly, using running standard deviations helps avoid serious numerical errors and yields better approximations compared to the textbook formula. Discussions of this problem and computational approaches can be found in The Art of Computer Programming, Volume 2, Section 4.2.2, or the Wikipedia page on calculating variance.
Experiments
| Setting | Label Recall | Label Precision |
|---|---|---|
| Dataset 1 (AffineChannel) | 0.2464 | 0.3310 |
| Dataset 1 (GroupNorm) | 0.2676 | 0.3615 |
| Dataset 2 (AffineChannel) | 0.2492 | 0.3400 |
| Dataset 2 (GroupNorm) | 0.2620 | 0.3761 |
From my experiments using RetinaNet from the Detectron library, GroupNorm indeed improves detection model performance (+8% on recall and precision after tuning thresholds) on two COCO-esque datasets. However, adapting GroupNorm to mobile frameworks can be challenging—some don’t support this layer, requiring custom CPU/CUDA implementations. One workaround: stick with BatchNorm, use smaller image sizes during training, and increase batch size. AffineChannel offers another option—it’s relatively effective and easy to fuse into the conv layer to conserve memory.
Nonetheless, GroupNorm is only used in the FPN layers in all settings. I wonder what happens if I replace all AffineChannel or Spatial BatchNorm by GroupNorm, even on the backbone. I will put the results soon (If I have time to do such experiments).
In conclusion, GroupNorm is simple yet effective normalization method to use in case you have to train models with small batch size.