Object Detection on Mobile Devices

Introduction

I have been work­ing on im­age de­tec­tion on mo­bile de­vices for the last 6 weeks, which is not long enough to delve into some re­search pa­pers on model com­pres­sion or prun­ing tech­niques but ad­e­quate to get into and con­duct some ex­per­i­ments on some de­tec­tion mod­els.

Firstly, let look at the de­tec­tion model and make some terms. A de­tec­tion usu­ally (and from some of the state-of-the-art model: RetinaNet) is com­posed of 3 com­po­nents, from bot­tom to top:

  1. Backbone: fea­ture ex­trac­tions.
  2. Region Proposal Module: pro­duce pro­pos­als based on fea­ture maps of the back­bone and feed them into the last com­po­nent.
  3. Head pre­dic­tor: Predict the bound­ing boxes and la­bel.

In this note, I will talk about all 3 com­po­nents when we de­velop an de­tec­tion model for mo­bile de­vices. Let as­sume that we are work­ing on Android de­vices and the deep learn­ing frame­works on mo­biles sup­port some com­mon lay­ers, but it is not iden­ti­cal to any other frame­works such as Tensorflow, Pytorch. Therefore, we also have to deal with one more step, namely con­vert­ing the model trained from the work­sta­tion/​servers into the model for mo­bile de­vices.

Backbones

Here is the list of back­bone I have been try­ing:

  • Resnet50.
  • Resnet36.
  • MobileNetv1
  • MobileNetv2
  • SqueezeNet.

Conclusions and Takeaways

  1. If you have to deal with train­ing dataset, make sure that the an­no­ta­tions are con­sis­tent and cor­rect be­cause small net­works do not have large ca­pac­ity to learn from la­bel noise.
  2. Lightweight net­works be­have dif­fer­ent be­tween clas­si­fi­ca­tion and de­tec­tion tasks. Make sure that you have proper ex­per­i­ments to jus­tify the model.
  3. GroupNorm is good for de­tec­tion, use it if it is avail­able.
  4. Sometimes, the FPN and head pre­dic­tor are pretty heavy com­pared to the back­bone. Some tricks to re­duce mem­ory and im­prove the in­fer­ence time are:
    • Remove cer­tain oc­tave scales, es­pe­cially FPN 3.
    • Replace nor­mal Conv lay­ers in FPN branches with Depthwise layer.
    • Reduce an­chor scales, as­pect ra­tios or even num­ber of classes if nec­es­sary.
  5. In gen­eral, Depthwise conv is a good choice to re­place the nor­mal conv layer. How­ever, the per­for­mance de­pends on the un­der­ly­ing im­ple­men­ta­tions. You may not gain any im­prove­ments be­cause of it.
  6. Different frame­works uses dif­fer­ent set­tings, im­ple­men­ta­tions even with some de facto lay­ers. While con­vert­ing the model, make sure that the logic is cor­rect. You may have to re­train the model to the fi­nal set­tings based on the mo­bile frame­work.

References

  • MobileNetv2: Pretty good post about MobileNetv2. The au­thor also men­tions the im­ple­men­ta­tion on iOS.