Object Detection on Mobile Devices
Introduction
I spent six weeks working on image detection for mobile devices—not quite enough time to dive deep into model compression or pruning techniques, but sufficient to experiment with several detection models and discover what actually works in practice.
Let’s start by examining the anatomy of a detection model. A detector like RetinaNet typically comprises three components, stacked from bottom to top:
- Backbone: extracts features from the input image.
- Region Proposal Module: generates proposals from backbone feature maps and feeds them to the final stage.
- Head Predictor: outputs bounding boxes and class labels.
In this note, I’ll discuss all three components in the context of mobile deployment. We’ll assume Android as our target platform, where the deep learning frameworks support common layers but differ subtly from desktop frameworks like TensorFlow and PyTorch. This means we face an additional challenge: converting models trained on workstations into formats that mobile devices can actually run.
Backbones
Here are the backbones I experimented with:
- ResNet-50
- ResNet-36
- MobileNet v1
- MobileNet v2
- SqueezeNet
Conclusions and Key Takeaways
-
Annotation quality matters. Small networks lack the capacity to learn from noisy labels. If you control the training dataset, ensure annotations are consistent and accurate.
-
Classification performance doesn’t translate directly to detection. Lightweight networks behave differently across tasks. Run proper experiments before committing to a particular backbone.
-
GroupNorm works well for detection. Use it if your framework supports it.
-
The FPN and head predictor can dominate compute costs. Some tricks to reduce memory and improve inference time:
- Remove certain octave scales, especially FPN level 3.
- Replace standard Conv layers in FPN branches with depthwise separable convolutions.
- Reduce anchor scales, aspect ratios, or even the number of classes if your application permits.
-
Depthwise convolution is a double-edged sword. In theory, it’s great for replacing standard convolutions. In practice, performance depends heavily on the underlying implementation—you may see no speedup at all.
-
Framework differences will bite you. Even ostensibly standard layers can have subtly different implementations across frameworks. When converting models, verify the logic carefully. You may need to retrain with the mobile framework’s exact settings.