Libra-RCNN

Pang, Jiangmiao, et al. Libra r-cnn: Towards bal­anced learn­ing for ob­ject de­tec­tion. Proceedings of the IEEE/CVF con­fer­ence on com­puter vi­sion and pat­tern recog­ni­tion. 2019.

Introduction

Regarding the ob­ject de­tec­tion prob­lem, it seems like the com­mu­nity pays less at­ten­tion to train­ing pipeline than other tasks such as net­work de­sign, in­fer­ence im­prove­ment. This pa­per in­ves­ti­gates the cur­rent prob­lem of the CNN de­tec­tion mod­els. They con­sider the im­bal­ance phe­nom­e­non, which is com­posed of 3 lev­els:

  1. Sample level.
  2. Feature level.
  3. Objective level.

They ba­si­cally are 3 cor­re­spond­ing ma­jor com­po­nents of a de­tec­tion model: fea­ture ex­trac­tion, re­gion pro­pos­als, and pre­dic­tors. Based on such cat­e­go­riza­tion, they pro­pose the fol­low­ing im­prove­ments:

  1. IoU-balanced sam­pling.
  2. Balanced fea­ture pyra­mid.
  3. Balanced L1 loss.

So, every­thing is bal­anced now.

The au­thors draw a pretty nice fig­ure to demon­strate their points.

Review and Analysis

Now, let take a look at 3 as­pects of im­bal­ance train­ing from the point of view of this pa­per.

Sample level Imbalance

There are many cir­cum­stances in which the train­ing data, es­pe­cially the easy sam­ples, be­comes im­bal­ance:

  • Data dis­tri­b­u­tion: it be­comes se­vere if the train­ing data is bias to cer­tain view­points, pose or ob­ject shape. The model needs to fo­cus on the hard pos­i­tive sam­ples in or­der to gain more gra­di­ent, and thus it is able to learn and gen­er­al­ize. Otherwise, it keeps learn­ing from easy sam­ples, which make the gra­di­ent is al­most 0. These cases are con­sid­ered as hard pos­i­tives, which is not ex­am­ined in this pa­per.

  • Data sam­pling: two-stage de­tec­tors use data sam­pling in or­der to train the model. Even though the num­ber of im­ages is small (2 or 4) and the num­ber of grouthtruth is also rel­a­tively small. Considering the COCO dataset, even though we have 80 la­bels, not all of the im­ages has a large num­ber of boxes. Meanwhile, the sam­pler usu­ally gen­er­ates thou­sands of re­gions, hence the easy neg­a­tive sam­ples dom­i­nate the whole set.

  • This is the well-known prob­lem in ob­ject de­tec­tion, hence many pa­pers tried to tackle this prob­lem. There are 2 meth­ods worth men­tion­ing:

    • OHEM: Instead of freez­ing the net­work, com­put­ing the hard neg­a­tive ex­am­ples, adding them to the train­ing set and con­tin­u­ing train­ing the model, OHEM di­rectly com­putes all the ROIs in a batch and se­lects the hard neg­a­tives from them.

    • Focal loss takes a whole dif­fer­ent ap­proach. It re­shapes the stan­dard cross en­tropy loss such that it down-weights the loss as­signed to well-clas­si­fied ex­am­ples. The au­thors ar­gue that Focal Loss shows lit­tle im­prove­ment in R-CNN. It may be true, I only tested this loss on one-stage de­tec­tion mod­els. However, Yolov3’s au­thors also men­tioned Focal Loss does not work well on their model.

Feature level im­bal­ance

This quote is in­ter­est­ing when they talk about FPN and PANet - re­gion pro­posal meth­ods that use multi-scale fea­ture map­ping:

The meth­ods in­spire us that the low-level and high-level in­for­ma­tion are com­ple­men­tary for ob­ject de­tec­tion. The ap­proach that how they are uti­lized to in­te­grate the pyra­mi­dal rep­re­sen­ta­tions de­ter­mines the de­tec­tion per­for­mance. … Our study re­veals that the in­te­grated fea­tures should pos­sess bal­anced in­for­ma­tion from each res­o­lu­tion. But the se­quen­tial man­ner in the afore­men­tioned meth­ods will make in­te­grated fea­ture fo­cus more on ad­ja­cent res­o­lu­tion but less on oth­ers. The se­man­tic in­for­ma­tion con­tained in non-ad­ja­cent lev­els would be di­luted once per fu­sion dur­ing the in­for­ma­tion flow.

This sec­tion of the pa­per is quite weak and not con­vinc­ing. They do not men­tion any ex­per­i­ments to jus­tify their ar­gu­ments. In ad­di­tion, they said they in­spired from the afore­men­tioned meth­ods but did not elab­o­rate on how they come up with the pro­posed method.

Objective level im­bal­ance

Nowadays, a typ­i­cal ob­ject de­tec­tion model car­ries two tasks: la­bel clas­si­fi­ca­tion and box re­gres­sion. Depend on the dif­fi­culty and the dis­tri­b­u­tion of train­ing data, the ul­ti­mate ob­jec­tive may not be in­te­grated well from two sep­a­rate losses. For ex­am­ple, the box re­gres­sion is com­pro­mised, hence it leads to the high per­for­mance on box re­sults but poor on con­cept re­sults.

Imbalance easy-hard sam­ples also af­fect the gra­di­ent of the model. If the easy sam­ples make up the ma­jor­ity of the batch, the gra­di­ent is dom­i­nated by the hard ex­am­ples and thus the model learns noth­ing from the easy sam­ples. De­spite be­ing called easy sam­ples”, it does not mean that there is noth­ing to learn from the easy sam­ples. The rea­son is that once the model is able to spot the easy” fea­ture in the im­age, it dis­cards the re­main­ing vi­sual fea­tures re­main­ing in the im­age. In other words, it only looks at some par­tic­u­lar po­si­tion/​fea­ture in the im­age which makes it easy to learn.


In my opin­ion, sam­ple im­bal­ance and fea­ture im­bal­ance are the most im­por­tant as­pects we have to deal with. It seems like CNN can learn dif­fer­ent view­points as long as we pro­vide proper train­ing data. Nonetheless, we can not feed a huge amount of data of every ob­ject of every sin­gle an­gle. Secondly, the an­no­ta­tion qual­ity dra­mat­i­cally changes be­cause of the qual­ity, char­ac­ter­is­tic of the im­age. For ex­am­ple, stock im­ages, prod­uct im­ages get clear, ac­cu­rate bound­ing box la­bel. Photos that are taken from smart­phones, ran­dom im­ages on Internet, on the other hand, are com­pletely dif­fer­ent sto­ries. By look­ing at the an­no­ta­tion of COCO dataset, you know what I mean.

Proposed Methods

Balanced Feature Pyramid

The al­go­rithm for this method is de­scribed as the fol­low­ing:

  1. Rescaling: Resize all fea­ture map into 1 size (intermediate size) us­ing in­ter­po­la­tion and max-pool­ing.
  2. Integrating: Sum all rescaled fea­ture and nor­mal­ize it.
  3. Refining: Directly use con­vo­lu­tions or use non-lo­cal mod­ule such as Gaussian non-lo­cal at­ten­tion.
  4. Strengthening: Rescale the ob­tained fea­ture to the orig­i­nal res­o­lu­tions.

We can in­ter­pret those steps as ap­ply­ing a Pooling layer to form high-level fea­ture, which re­sem­bles the fi­nal pool­ing of im­age re­trieval. Hence, it means to im­prove the ab­stract level of the fea­ture.

Balanced L1 Loss

The whole for­mu­la­tion of the loss can be seen in the pa­per. In sum­mary, the au­thors want to: (1) cap the gra­di­ent of the box re­gres­sion in or­der to bal­ance with the clas­si­fi­ca­tion gra­di­ent and (2) im­prove the gra­di­ent of the easy sam­ples.

Experiment re­sults

From the re­sult of ab­la­tion ex­per­i­ments in Table 2, there are some in­ter­est­ing ob­ser­va­tions:

  • In gen­eral, com­bin­ing 3 meth­ods dra­mat­i­cally im­proves the av­er­age pre­ci­sion of large ob­jects. However, there is not much ef­fect shown in the small ob­jects. In my opin­ion, small ob­jects still are the most dif­fi­cult as­pect to im­prove de­tec­tion mod­els.
  • IoU bal­anced Sampling and Balanced L1 Loss clearly help to im­prove the Average Precision at IoU=0.75. It means they pro­duce boxes closer to the ground truth.
  • The same trend also can be seen on RetinaNet in which the au­thors only two meth­ods (Balanced Feature Pyramid and Balanced L1 Loss). Again, the pro­posed meth­ods im­prove the over­all per­for­mance with quite a large mar­gin (+5%), es­pe­cially on the large ob­jects.

Implementation

In the next two weeks, I will im­ple­ment the bal­anced fea­ture pyra­mid and the bal­anced L1 loss. I am not sure if I have time for IoU sam­pler since my fo­cus is on RetinaNet which nat­u­rally does not use sam­pler (but we can trick it a bit and uti­lize the com­po­nent). Even though the au­thors have al­ready re­leased source code us­ing Pytorch, I have to rewrite the whole things through caf­fe2 and Detectron. It may take a while.


Weighted Component Loss

In cou­ple of ex­per­i­ments, I have found that there is a big gap be­tween box pre­ci­sion and con­cept pre­ci­sion. So, my hy­poth­e­sis is that the box re­gres­sion loss ac­tu­ally dom­i­nates the whole loss of the model. From the that, I halve the weight of the box re­gres­sion and train the model. Here is the re­sults:

Model Concept Recall Concept Precision
resnet36_tiny12_v0800 (baseline, size 256) 0.4000 0.4445
resnet36_tiny15_v0900 0.4122 (+3.05%) 0.4685 (+5.40%)
resnet36_tiny14_v0800 (baseline, size 320) 0.4194 0.4651
resnet36_tiny16_v0800 0.3906 (-6.8%) 0.4984 (+7.16)

tiny12 and tiny15 are ba­si­cally the same ex­cept tiny15 uses smaller loss weight for box re­gres­sion (0.5 in­stead of 1.0). The same set­tings are ap­plied for tiny14 and tiny16, re­spec­tively. From the re­sult, we can see by bal­anc­ing the loss com­po­nent, even with the naive ap­proach, it in­deed helps the over­all per­for­mance. However, the sec­ond set­ting is dif­fi­cult to ob­serve the per­for­mance gain. I bet­ter use the mAP in­stead.