Object Detection on Mobile Devices

Introduction

I spent six weeks work­ing on im­age de­tec­tion for mo­bile de­vices—not quite enough time to dive deep into model com­pres­sion or prun­ing tech­niques, but suf­fi­cient to ex­per­i­ment with sev­eral de­tec­tion mod­els and dis­cover what ac­tu­ally works in prac­tice.

Let’s start by ex­am­in­ing the anatomy of a de­tec­tion model. A de­tec­tor like RetinaNet typ­i­cally com­prises three com­po­nents, stacked from bot­tom to top:

  1. Backbone: ex­tracts fea­tures from the in­put im­age.
  2. Region Proposal Module: gen­er­ates pro­pos­als from back­bone fea­ture maps and feeds them to the fi­nal stage.
  3. Head Predictor: out­puts bound­ing boxes and class la­bels.

In this note, I’ll dis­cuss all three com­po­nents in the con­text of mo­bile de­ploy­ment. We’ll as­sume Android as our tar­get plat­form, where the deep learn­ing frame­works sup­port com­mon lay­ers but dif­fer sub­tly from desk­top frame­works like TensorFlow and PyTorch. This means we face an ad­di­tional chal­lenge: con­vert­ing mod­els trained on work­sta­tions into for­mats that mo­bile de­vices can ac­tu­ally run.

Backbones

Here are the back­bones I ex­per­i­mented with:

  • ResNet-50
  • ResNet-36
  • MobileNet v1
  • MobileNet v2
  • SqueezeNet

Conclusions and Key Takeaways

  1. Annotation qual­ity mat­ters. Small net­works lack the ca­pac­ity to learn from noisy la­bels. If you con­trol the train­ing dataset, en­sure an­no­ta­tions are con­sis­tent and ac­cu­rate.

  2. Classification per­for­mance does­n’t trans­late di­rectly to de­tec­tion. Lightweight net­works be­have dif­fer­ently across tasks. Run proper ex­per­i­ments be­fore com­mit­ting to a par­tic­u­lar back­bone.

  3. GroupNorm works well for de­tec­tion. Use it if your frame­work sup­ports it.

  4. The FPN and head pre­dic­tor can dom­i­nate com­pute costs. Some tricks to re­duce mem­ory and im­prove in­fer­ence time:

    • Remove cer­tain oc­tave scales, es­pe­cially FPN level 3.
    • Replace stan­dard Conv lay­ers in FPN branches with depth­wise sep­a­ra­ble con­vo­lu­tions.
    • Reduce an­chor scales, as­pect ra­tios, or even the num­ber of classes if your ap­pli­ca­tion per­mits.
  5. Depthwise con­vo­lu­tion is a dou­ble-edged sword. In the­ory, it’s great for re­plac­ing stan­dard con­vo­lu­tions. In prac­tice, per­for­mance de­pends heav­ily on the un­der­ly­ing im­ple­men­ta­tion—you may see no speedup at all.

  6. Framework dif­fer­ences will bite you. Even os­ten­si­bly stan­dard lay­ers can have sub­tly dif­fer­ent im­ple­men­ta­tions across frame­works. When con­vert­ing mod­els, ver­ify the logic care­fully. You may need to re­train with the mo­bile frame­work’s ex­act set­tings.