GroupNorm

After sev­eral ex­per­i­ments, I dis­cov­ered that GroupNorm works sur­pris­ingly well on de­tec­tion mod­els. Simply en­abling GroupNorm in FPN yielded a sig­nif­i­cant im­prove­ment. Inspired by this, I wanted to re­place BatchNorm in the back­bone with GroupNorm and ex­plore how this layer might ben­e­fit other net­works.

Paper: overview and com­ments

Criticism about BatchNorm

BatchNorm strug­gles with mod­els trained on small batches. Several pa­pers have shown that BatchNorm pri­mar­ily keeps the ac­ti­va­tion dis­tri­b­u­tion in check to aid train­ing con­ver­gence. Consequently, with proper ini­tial­iza­tion, BatchNorm be­comes dis­pens­able.

Related works

  • Local Response Normalization.
  • Batch Normalization (or Spatial Batch Norm in some frame­works).
  • Layer Normalization.
  • Weight Normalization.
  • Batch Renormalization.
  • Synchronized Batchnorm: [Bag of Freebies for Training Object Detection Neural Networks] uses this tech­nique in­stead of GroupNorm for Yolov3—I won­der why.
  • Instance Normalization.

Group-wise com­pu­ta­tion

  • ResNext.
  • MobileNet. Note to self: af­ter sev­eral weeks work­ing on de­tec­tion, we found that MobileNet does not per­form well for de­tec­tion. One pa­per sup­ports this ob­ser­va­tion [Light-Weight RetinaNet for Object Detection]. Their find­ings align with my ex­per­i­ments, par­tic­u­larly re­gard­ing the low con­fi­dence scores of the model.
  • Xception.

Normalization Revisiting and Formulation

The au­thors did a com­mend­able job uni­fy­ing the for­mu­las of pop­u­lar nor­mal­iza­tion tech­niques. Examine the fig­ure be­low; to some ex­tent, GroupNorm can be viewed as a vari­ant of LayerNorm and InstanceNorm.

Normalization Methods

Interestingly, Layer Norm bears a strik­ing re­sem­blance to the pool­ing method in Triangulation Embedding and other higher-level fea­tures. From the fig­ure, we can also de­duce why Batch Norm fal­ters with small batch sizes: N is small, leav­ing in­suf­fi­cient sam­ples to com­pute re­li­able es­ti­mates of the two mo­ments (mean and vari­ance). How do other meth­ods over­come this lim­i­ta­tion? They com­pute sta­tis­tics on the chan­nels them­selves. Most com­mon CNN mod­els use 64, 128, or 256 chan­nels in con­vo­lu­tional lay­ers, pro­vid­ing enough val­ues to com­pen­sate for the lim­ited sam­ples per batch.

Regarding com­pu­ta­tion, the fam­ily of nor­mal­iza­tion lay­ers con­sists of two steps:

  1. Compute the sta­tis­tics and nor­mal­ize the in­put:
xi^=1σi(xiμi)

where μi and σi are com­puted for a sub­set Si of fea­ture maps from a batch of in­put. The art of de­sign­ing a new nor­mal­iza­tion layer lies in craft­ing a sub­set that over­comes the short­com­ings of pre­vi­ous meth­ods.

  1. For each chan­nel, learn a lin­ear trans­for­ma­tion to com­pen­sate for the pos­si­ble loss of rep­re­sen­ta­tional abil­ity:
yi=γxi^+β

where γ and β are train­able scale and shift.

Implementation

The pa­per also men­tions the TensorFlow im­ple­men­ta­tion, which I won’t dwell on here. However, the C++ im­ple­men­ta­tion from Caffe2 is worth ex­am­in­ing. Why? Because it com­putes both mo­ments at in­fer­ence time, which dis­ap­pointed me—I had hoped to fuse the layer into the penul­ti­mate Conv layer to op­ti­mize mo­bile in­fer­ence.

Interestingly, I dis­cov­ered that dif­fer­ent im­ple­men­ta­tions ex­ist for BatchNorm, with no con­sen­sus across pop­u­lar deep learn­ing frame­works on whether Bessel’s cor­rec­tion should be ap­plied.

Surprisingly, us­ing run­ning stan­dard de­vi­a­tions helps avoid se­ri­ous nu­mer­i­cal er­rors and yields bet­ter ap­prox­i­ma­tions com­pared to the text­book for­mula. Discussions of this prob­lem and com­pu­ta­tional ap­proaches can be found in The Art of Computer Programming, Volume 2, Section 4.2.2, or the Wikipedia page on cal­cu­lat­ing vari­ance.

Experiments

Setting Label Recall Label Precision
Dataset 1 (AffineChannel) 0.2464 0.3310
Dataset 1 (GroupNorm) 0.2676 0.3615
Dataset 2 (AffineChannel) 0.2492 0.3400
Dataset 2 (GroupNorm) 0.2620 0.3761

From my ex­per­i­ments us­ing RetinaNet from the Detectron li­brary, GroupNorm in­deed im­proves de­tec­tion model per­for­mance (+8% on re­call and pre­ci­sion af­ter tun­ing thresh­olds) on two COCO-esque datasets. However, adapt­ing GroupNorm to mo­bile frame­works can be chal­leng­ing—some don’t sup­port this layer, re­quir­ing cus­tom CPU/CUDA im­ple­men­ta­tions. One workaround: stick with BatchNorm, use smaller im­age sizes dur­ing train­ing, and in­crease batch size. AffineChannel of­fers an­other op­tion—it’s rel­a­tively ef­fec­tive and easy to fuse into the conv layer to con­serve mem­ory.

Nonetheless, GroupNorm is only used in the FPN lay­ers in all set­tings. I won­der what hap­pens if I re­place all AffineChannel or Spatial BatchNorm by GroupNorm, even on the back­bone. I will put the re­sults soon (If I have time to do such ex­per­i­ments).

In con­clu­sion, GroupNorm is sim­ple yet ef­fec­tive nor­mal­iza­tion method to use in case you have to train mod­els with small batch size.