Libra-RCNN

Pang, Jiangmiao, et al. Libra r-cnn: Towards bal­anced learn­ing for ob­ject de­tec­tion. Proceedings of the IEEE/CVF con­fer­ence on com­puter vi­sion and pat­tern recog­ni­tion. 2019.

Introduction

In the ob­ject de­tec­tion com­mu­nity, train­ing pipelines of­ten take a back seat to net­work ar­chi­tec­ture and in­fer­ence op­ti­miza­tion. This pa­per in­ves­ti­gates an over­looked as­pect of CNN-based de­tec­tion mod­els: the im­bal­ance phe­nom­e­non. The au­thors de­com­pose this is­sue into three dis­tinct lev­els:

  1. Sample level.
  2. Feature level.
  3. Objective level.

These cor­re­spond to the three ma­jor com­po­nents of a de­tec­tion model: fea­ture ex­trac­tion, re­gion pro­pos­als, and pre­dic­tors. Building on this cat­e­go­riza­tion, the au­thors pro­pose the fol­low­ing im­prove­ments:

  1. IoU-balanced sam­pling.
  2. Balanced fea­ture pyra­mid.
  3. Balanced L1 loss.

So, every­thing is bal­anced now.

The au­thors draw a pretty nice fig­ure to demon­strate their points.

Review and Analysis

Let’s ex­am­ine each as­pect of train­ing im­bal­ance through the lens of this pa­per.

Sample level Imbalance

Several fac­tors can cause train­ing data to be­come im­bal­anced:

  • Data dis­tri­b­u­tion: Bias be­comes se­vere when train­ing data fa­vors cer­tain view­points, poses, or ob­ject shapes. The model must fo­cus on hard pos­i­tive sam­ples to gen­er­ate mean­ing­ful gra­di­ents and thus learn to gen­er­al­ize. Otherwise, easy sam­ples dom­i­nate, dri­ving gra­di­ents to­ward zero. These chal­leng­ing cases are known as hard pos­i­tives—not ex­am­ined in this pa­per.

  • Data sam­pling: Two-stage de­tec­tors rely on sam­pling strate­gies dur­ing train­ing. Despite small batch sizes (2 or 4 im­ages) and rel­a­tively few ground truth boxes—even with 80 COCO cat­e­gories, not all im­ages con­tain many ob­jects—the sam­pler typ­i­cally gen­er­ates thou­sands of re­gions. Consequently, easy neg­a­tive sam­ples over­whelm the train­ing set.

  • Existing so­lu­tions: This well-known prob­lem has spawned two note­wor­thy ap­proaches:

    • OHEM (Online Hard Example Mining): Rather than freez­ing the net­work, com­put­ing hard neg­a­tives, aug­ment­ing the train­ing set, and re­sum­ing—OHEM di­rectly com­putes all ROIs in a batch and se­lects hard neg­a­tives on the fly.

    • Focal loss takes a fun­da­men­tally dif­fer­ent ap­proach by re­shap­ing the stan­dard cross-en­tropy loss to down-weight well-clas­si­fied ex­am­ples. The au­thors note that Focal Loss shows mod­est im­prove­ment in R-CNN set­tings. I have only tested it on one-stage mod­els. Interestingly, the YOLOv3 au­thors also re­ported lim­ited suc­cess with Focal Loss on their ar­chi­tec­ture.

Feature level im­bal­ance

This ob­ser­va­tion is in­trigu­ing when dis­cussing FPN and PANet—region pro­posal meth­ods em­ploy­ing multi-scale fea­ture map­ping:

The meth­ods in­spire us that the low-level and high-level in­for­ma­tion are com­ple­men­tary for ob­ject de­tec­tion. The ap­proach that how they are uti­lized to in­te­grate the pyra­mi­dal rep­re­sen­ta­tions de­ter­mines the de­tec­tion per­for­mance. … Our study re­veals that the in­te­grated fea­tures should pos­sess bal­anced in­for­ma­tion from each res­o­lu­tion. But the se­quen­tial man­ner in the afore­men­tioned meth­ods will make in­te­grated fea­ture fo­cus more on ad­ja­cent res­o­lu­tion but less on oth­ers. The se­man­tic in­for­ma­tion con­tained in non-ad­ja­cent lev­els would be di­luted once per fu­sion dur­ing the in­for­ma­tion flow.

Frankly, this sec­tion feels un­der­de­vel­oped. The au­thors pre­sent no ex­per­i­ments to sub­stan­ti­ate their claims. Moreover, while they credit in­spi­ra­tion from prior meth­ods, they ne­glect to ex­plain how these in­sights led to their pro­posed ap­proach.

Objective level im­bal­ance

Modern ob­ject de­tec­tion mod­els tackle two tasks si­mul­ta­ne­ously: la­bel clas­si­fi­ca­tion and bound­ing box re­gres­sion. The dif­fi­culty and data dis­tri­b­u­tion of each task can pre­vent the com­bined ob­jec­tive from in­te­grat­ing prop­erly. For in­stance, when box re­gres­sion dom­i­nates, the model achieves strong lo­cal­iza­tion but poor class pre­dic­tion.

Easy-versus-hard sam­ple im­bal­ance also in­flu­ences gra­di­ent dy­nam­ics. When easy sam­ples dom­i­nate a batch, gra­di­ents be­come sat­u­rated with un­in­for­ma­tive sig­nals. Curiously, easy” does not mean de­void of learn­ing po­ten­tial. Rather, once the model iden­ti­fies the easy” dis­crim­i­na­tive fea­ture, it tends to ig­nore other vi­sual cues—es­sen­tially fix­at­ing on par­tic­u­lar po­si­tions or fea­tures that sim­plify the task.


In my view, sam­ple and fea­ture im­bal­ance rep­re­sent the most crit­i­cal chal­lenges. CNNs can learn di­verse view­points given ad­e­quate train­ing data, but we can­not fea­si­bly cap­ture every ob­ject from every con­ceiv­able an­gle. Furthermore, an­no­ta­tion qual­ity varies dra­mat­i­cally with im­age char­ac­ter­is­tics. Stock pho­tog­ra­phy and prod­uct im­ages yield crisp, ac­cu­rate bound­ing boxes; smart­phone snap­shots and ran­dom in­ter­net im­ages tell an en­tirely dif­fer­ent story. One glance at COCO an­no­ta­tions con­firms this re­al­ity.

Proposed Methods

Balanced Feature Pyramid

The al­go­rithm for this method is de­scribed as the fol­low­ing:

  1. Rescaling: Resize all fea­ture map into 1 size (intermediate size) us­ing in­ter­po­la­tion and max-pool­ing.
  2. Integrating: Sum all rescaled fea­ture and nor­mal­ize it.
  3. Refining: Directly use con­vo­lu­tions or use non-lo­cal mod­ule such as Gaussian non-lo­cal at­ten­tion.
  4. Strengthening: Rescale the ob­tained fea­ture to the orig­i­nal res­o­lu­tions.

We can in­ter­pret those steps as ap­ply­ing a Pooling layer to form high-level fea­ture, which re­sem­bles the fi­nal pool­ing of im­age re­trieval. Hence, it means to im­prove the ab­stract level of the fea­ture.

Balanced L1 Loss

The whole for­mu­la­tion of the loss can be seen in the pa­per. In sum­mary, the au­thors want to: (1) cap the gra­di­ent of the box re­gres­sion in or­der to bal­ance with the clas­si­fi­ca­tion gra­di­ent and (2) im­prove the gra­di­ent of the easy sam­ples.

Experiment re­sults

From the re­sult of ab­la­tion ex­per­i­ments in Table 2, there are some in­ter­est­ing ob­ser­va­tions:

  • In gen­eral, com­bin­ing 3 meth­ods dra­mat­i­cally im­proves the av­er­age pre­ci­sion of large ob­jects. However, there is not much ef­fect shown in the small ob­jects. In my opin­ion, small ob­jects still are the most dif­fi­cult as­pect to im­prove de­tec­tion mod­els.
  • IoU bal­anced Sampling and Balanced L1 Loss clearly help to im­prove the Average Precision at IoU=0.75. It means they pro­duce boxes closer to the ground truth.
  • The same trend can also be seen on RetinaNet, where the au­thors used only two meth­ods (Balanced Feature Pyramid and Balanced L1 Loss). Again, the pro­posed meth­ods im­prove the over­all per­for­mance with quite a large mar­gin (+5%), es­pe­cially on the large ob­jects.

Implementation

In the next two weeks, I will im­ple­ment the bal­anced fea­ture pyra­mid and the bal­anced L1 loss. I am not sure if I have time for IoU sam­pler since my fo­cus is on RetinaNet which nat­u­rally does not use sam­pler (but we can trick it a bit and uti­lize the com­po­nent). Even though the au­thors have al­ready re­leased source code us­ing Pytorch, I have to rewrite the whole things through caf­fe2 and Detectron. It may take a while.


Weighted Component Loss

In a cou­ple of ex­per­i­ments, I have found that there is a big gap be­tween box pre­ci­sion and con­cept pre­ci­sion. So, my hy­poth­e­sis is that the box re­gres­sion loss ac­tu­ally dom­i­nates the whole loss of the model. From that, I halve the weight of the box re­gres­sion and train the model. Here is the re­sults:

Model Concept Recall Concept Precision
resnet36_tiny12_v0800 (baseline, size 256) 0.4000 0.4445
resnet36_tiny15_v0900 0.4122 (+3.05%) 0.4685 (+5.40%)
resnet36_tiny14_v0800 (baseline, size 320) 0.4194 0.4651
resnet36_tiny16_v0800 0.3906 (-6.8%) 0.4984 (+7.16)

tiny12 and tiny15 are ba­si­cally the same ex­cept tiny15 uses smaller loss weight for box re­gres­sion (0.5 in­stead of 1.0). The same set­tings are ap­plied for tiny14 and tiny16, re­spec­tively. From the re­sult, we can see by bal­anc­ing the loss com­po­nent, even with the naive ap­proach, it in­deed helps the over­all per­for­mance. However, the sec­ond set­ting is dif­fi­cult to ob­serve the per­for­mance gain. I bet­ter use the mAP in­stead.