GroupNorm

After sev­eral tries, I have found out that GroupNorm works sur­pris­ingly well on de­tec­tion mod­els. Just turn on the GroupNorm of FPN and I can get an im­prove­ment with a large mar­gin. Going fur­ther, I want to re­place BatchNorm of the back­bone with GroupNorm and see how I can uti­lize this layer in other net­works.

Paper: overview and com­ments

Criticism about BatchNorm

It does not work well on mod­els trained with small batches. Some pa­pers showed that BatchNorm is mainly to get the ac­ti­va­tion dis­tri­b­u­tion in con­trol to help the train­ing con­ver­gence. Therefore, if we al­ready have good ini­tial­iza­tion, we don’t even need to use BatchNorm.

Related works

  • Local Response Normalization.
  • Batch Normalization (or Spatial Batch Norm in some frame­works).
  • Layer Normalization.
  • Weight Normalization.
  • Batch Renormalization.
  • Synchronized Batchnorm: [Bag of Freebies for Training Object Detection Neural Networks] use this tech­nique in­stead of GroupNorm for Yolov3, I won­der why.
  • Instance Normalization.

Group-wise com­pu­ta­tion

  • ResNext.
  • MobileNet. Note to self: af­ter sev­eral weeks work­ing on de­tec­tion, we found out that MobileNet does not work well for de­tec­tion. One pa­per sup­ports this ob­ser­va­tion [Light-Weight RetinaNet for Object Detection] . Their ob­ser­va­tion is also con­sis­tent with my ex­per­i­ment re­sults, es­pe­cially about the low con­fi­dence scores of the model.
  • Xception.

Normalization Revisiting and Formulation

The au­thors did a nice job which is to unify the for­mula of pop­u­lar nor­mal­iza­tion tech­niques. Take a look the fig­ure, in some ex­tends, we can say that GroupNorm is a vari­ant of LayerNorm and Instance Norm.

Normalization Methods

Interestingly, Layer Norm looks oddly like the pool­ing method in Triangulation Embedding, or other higher-level fea­tures. Based on the fig­ure, we also fig­ure out why Batch Norm is not ef­fec­tive on small batch size, namely N is small and there are not enough sam­ples to com­pute the good ap­prox­i­ma­tion for two mo­ments (mean and vari­ance). So how do other meth­ods try to over­come that prob­lem? They try to com­pute the sta­tis­tics on the chan­nel it­self. Most of the com­mon CNN mod­els use 64, 128 or 256 chan­nels in the conv layer, hence we have rel­a­tively enough val­ues in or­der to com­pen­sate the lack of sam­ples in each batch.

Regarding the com­pu­ta­tion, the fam­ily of nor­mal­iza­tion lay­ers are com­posed of two steps:

  1. Compute the sta­tis­tics and nor­mal­ize the in­put:
xi^=1σi(xiμi)

where μi and σi are com­puted for a sub­set Si of fea­ture maps from a batch of in­put. The art of cre­at­ing a new nor­mal­iza­tion layer is how to de­sign a new sub­set which is able to over­come some short­com­ings of pre­vi­ous meth­ods.

  1. For each chan­nel, learn a lin­ear trans­for­ma­tion to com­pen­sate for the pos­si­ble lost of rep­re­sen­ta­tional abil­ity:
yi=γxi^+β

where γ and β are train­able scale and shift.

Implementation

The pa­per also men­tioned the Tensorflow im­ple­men­ta­tion, I don’t want to talk about it much. However, the C++ im­ple­men­ta­tion from caf­fe2 is worth read­ing. Why? Because it com­putes 2 mo­ments in the in­fer­ence time, hence I thought there is no way to in­te­grate the layer into the penul­ti­mate Conv Layer, which is kind of dis­ap­point­ing since I want to op­ti­mize the in­fer­ence model on mo­bile.

Interestingly, I’ve also found that there are dif­fer­ent im­ple­men­ta­tions for BatchNorm and there is no agree­ment across all pop­u­lar deep learn­ing frame­works about whether or not Bessel’s cor­rec­tions are ap­plied.

Surprisingly enough, by us­ing run­ning stan­dard de­vi­a­tions, we can avoid some se­ri­ous nu­mer­i­cal er­rors and get bet­ter ap­prox­i­ma­tion com­pared to the text­book for­mula of stan­dard de­vi­a­tion. Some dis­cus­sions are the prob­lem and how to com­pute std can be found in The Art of Computer Programming, Volume 2, Section 4.2.2 or wiki page of cal­cu­lat­ing vari­ance.

Experiments

Setting Label Recall Label Precision
Dataset 1 (AffineChannel) 0.2464 0.3310
Dataset 1 (GroupNorm) 0.2676 0.3615
Dataset 2 (AffineChannel) 0.2492 0.3400
Dataset 2 (GroupNorm) 0.2620 0.3761

From my own ex­per­i­ments whose model es­sen­tially are RetinaNet from Detectron li­brary, GroupNorm in­deed helps to im­prove the per­for­mance of de­tec­tion mod­els (+8% on re­call and pre­ci­sion, of course, af­ter set­ting the cor­re­spond­ing thresh­olds) on two COCOesque datasets. However, adapt­ing GroupNorm to mo­bile frame­works may be dif­fi­cult, some don’t sup­port this layer and we have to write our own CPU/CUDA code. Another workaround is to stick to BatchNorm, use smaller im­age size to train and in­crease the batch size. AffineChannel is an­other choice. It is rel­a­tively good and eas­ily to merge into the conv layer in or­der to save the mem­ory.

Nonetheless, GroupNorm is only used in the FPN lay­ers in all set­tings. I won­der what hap­pens if I re­place all AffineChannel or Spatial BatchNorm by GroupNorm, even on the back­bone. I will put the re­sults soon (If I have time to do such ex­per­i­ments).

In con­clu­sion, GroupNorm is sim­ple yet ef­fec­tive nor­mal­iza­tion method to use in case you have to train mod­els with small batch size.