MobileOne

Right now I’m work­ing on light­weight de­tec­tion mod­els for mo­bile de­vices. Just re­cently, Apple has just re­leased a new back­bone called MobileOne (pa­per, code), and it looks promis­ing. Al­though adapt­ing a com­pletely new back­bone at the very end of the pro­ject is a bit tricky, their per­for­mance (at least from the pa­per) con­vinces me to give it a try.

Main Ideas

  • Decoupling train-time and in­fer­ence-time ar­chi­tec­tures us­ing a lin­early over-pa­ra­me­ter­ized model at train-time and re-pa­ra­me­ter­iz­ing the lin­ear struc­tures at in­fer­ence.
    • Introduce a triv­ial over-pa­ra­me­ter­i­za­tion branches.
  • Relaxing reg­u­lar­iza­tion through­out train­ing to pre­vent the small ca­pac­ity of light­weight mod­els be­ing over-reg­u­lar­ized.

Observations & Proposed meth­ods

Model Design

Info

Latency is mod­er­ately cor­re­lated with FLOPs and weakly cor­re­lated with pa­ra­me­ter counts.

  • No sur­prise to me at all.
  • Good old RELU is still a good choice for de­sign­ing light­weight mod­els.
Info

Two key fac­tors that af­fect run­time per­for­mance are mem­ory ac­cess cost and de­gree of par­al­lelism.

  • Memory ac­cess in­creases in multi-branch ar­chi­tec­tures as ac­ti­va­tions from each branch have to be stored to com­pute the next ten­sor in the graph.

  • Avoid us­ing Squeeze-Excite block be­cause it forces syn­chro­niza­tion.

  • MobileOne block is a resid­ual block con­sist­ing:

    • k blocks of depth­wise con­vo­lu­tional layer.
    • A point­wise layer.
  • The main point is that dur­ing in­fer­ence, the model is re-pa­ra­me­ter­ized so that there is no branch in the model. Detail of the method is de­scribed in the MobileOne Block sec­tion.

Training

  • Apply the co­sine an­neal­ing sched­uler to weight de­cay.

Other tech­niques men­tioned:

  • Progressive learn­ing cur­ricu­lum.
  • Auto aug­men­ta­tion.