Partial Annotation in Object Detection

In this post, I dis­cuss two pa­pers that tackle the chal­lenge of par­tially an­no­tated datasets. But first, why should we care about miss­ing an­no­ta­tions in de­tec­tion? For starters, la­bel­ing bound­ing boxes is te­dious and er­ror-prone. Ex­pand­ing the tax­on­omy only am­pli­fies this bur­den ex­po­nen­tially. Consider this sce­nario: you have a train­ing dataset with 20 cat­e­gories, and later want to in­cor­po­rate 10 new ones. Must you re-an­no­tate the en­tire dataset? Or can some clever tech­nique han­dle this au­to­mat­i­cally? With the emer­gence of the Open Images Dataset—containing a stag­ger­ing num­ber of im­ages and an­no­ta­tions—the com­mu­nity has grown in­creas­ingly in­ter­ested in this prob­lem. Here are two pa­pers I found par­tic­u­larly il­lu­mi­nat­ing:

  1. Wu, Zhe, et al. Soft sam­pling for ro­bust ob­ject de­tec­tion.” arXiv preprint arXiv:1806.06986 (2018).
  2. Niitani, Yusuke, et al. Sampling Techniques for Large-Scale Object Detection from Sparsely Annotated Objects.” arXiv preprint arXiv:1811.10862 (2018).

In the first pa­per, the au­thors in­ves­ti­gate how ro­bust ob­ject de­tec­tion sys­tems are when an­no­ta­tions go miss­ing. I have ex­plored this phe­nom­e­non my­self with COCO-like datasets, though the au­thors take a far more sys­tem­atic ap­proach to their ex­per­i­ments. Their con­clu­sion is in­trigu­ing:

we ob­serve that af­ter drop­ping 30% of the an­no­ta­tions (and la­bel­ing them as back­ground), the per­for­mance of CNN-based ob­ject de­tec­tors like Faster-RCNN only drops by 5% on the PASCAL VOC dataset.

Here is the catch: this con­clu­sion holds when the de­tec­tion thresh­old is set to 0—hardly practical for real-world ap­pli­ca­tions. Any pro­duc­tion sys­tem must use higher thresh­olds to achieve ac­cept­able pre­ci­sion/​re­call trade-offs. Indeed, at thresh­olds above 0.4, we ob­serve a sig­nif­i­cant mAP drop, which makes per­fect sense. To their credit, the au­thors ac­knowl­edge in Section 4 that it is im­por­tant for prac­ti­tion­ers to tune the de­tec­tion thresh­old per class when us­ing de­tec­tors trained on miss­ing la­bels.”

A telling il­lus­tra­tion ap­pears in Figure 2, which charts per­for­mance changes on the trainval and test sets of VOC2007 across var­i­ous de­tec­tion thresh­olds.

One im­por­tant ex­per­i­men­tal de­tail: they drop ground-truth boxes uni­formly across all classes—quite dif­fer­ent from the sce­nario of adding new cat­e­gories to an ex­ist­ing model. The for­mer pre­serves the tax­on­omy; the lat­ter fun­da­men­tally re­shapes the en­tire la­bel struc­ture.

Now for the pro­posed method. First, they ad­vo­cate hard ex­am­ple min­ing to ad­dress miss­ing an­no­ta­tions—the ra­tio­nale be­ing that hard ex­am­ple min­ing nat­u­rally steers away from ran­domly sam­pling unan­no­tated re­gions. They then in­tro­duce a gra­di­ent weight­ing func­tion based on IoU over­lap. At this point, as­tute read­ers will rec­og­nize this as es­sen­tially an­other in­car­na­tion of the Balanced IoU Sampler.

So, there is noth­ing sur­pris­ing here.

The au­thors also pro­pose a sec­ond ap­proach (otherwise, the pa­per would be rather slim for con­fer­ence pub­li­ca­tion). This time, they weight gra­di­ents for ROIs that are nei­ther pos­i­tives nor hard neg­a­tives. The weight­ing func­tion re­sem­bles the pre­vi­ous one. Essentially, they place faith in the mod­el’s pre­dic­tions: if an am­bigu­ous ROI (neither pos­i­tive nor hard neg­a­tive) re­ceives a high con­fi­dence score, treat it as pos­i­tive and am­plify its gra­di­ent; oth­er­wise, dampen it. The au­thors can­didly ad­mit that the trained model re­mains weak, so this boot­strap­ping ap­proach falls short. The ex­per­i­men­tal re­sults con­firm this ob­ser­va­tion.

On to the sec­ond pa­per. Their method, dubbed pseudo la­bel-guided, rests on a sim­ple ob­ser­va­tion: when an ob­ject ap­pears in an im­age, its con­stituent parts likely ap­pear as well. Spot a car? Expect to find tires nearby. This rea­son­ing, of course, only ap­plies to hi­er­ar­chi­cal tax­onomies where part-whole re­la­tion­ships ex­ist.

The pro­posed method is com­posed of two com­po­nents:

  1. part-aware sam­pling: they sim­ply ig­nore the clas­si­fi­ca­tion loss of part cat­e­gories when an in­stance of them is in­side an in­stance of their sub­ject cat­e­gories.

  2. pseudo la­bels: to ex­clude re­gions that are likely not to be an­no­tated.

In essence, they strate­gi­cally ig­nore re­gions sus­pected of har­bor­ing miss­ing an­no­ta­tions.

Table 1. in the pa­per is in­ter­est­ing. There are two no­tions here:

  1. Included: the ra­tio of part com­po­nents co-lo­cated with their sub­ject cat­e­gory in the same in­stance to the to­tal bound­ing boxes of the part com­po­nent.
  2. Co-occur: the ra­tio of im­ages con­tain­ing both part and sub­ject cat­e­gories to the to­tal im­ages hav­ing sub­ject cat­e­gories.

These num­bers paint a sober­ing pic­ture: miss­ing an­no­ta­tions per­vade the Open Images Dataset.

The pa­per sum­ma­rizes both al­go­rithms for­mally, but a plain-Eng­lish trans­la­tion proves help­ful:

  • Part-aware sam­pling: For each RoI pro­posal (Line 1), check whether the as­so­ci­ated ground-truth (Line 3) con­tains part cat­e­gories (Line 4). If so, ig­nore la­bels (Line 6) that have not been ver­i­fied (Line 5).

  • Pseudo la­bel-guided sam­pling: For each out­put from a trained model (Line 2), discard en­tries whose score falls be­low the thresh­old or whose la­bel be­longs to the ver­i­fied set (Line 3); also dis­card any that lie too close to ex­ist­ing ground-truth (Line 6). Then, for each RoI pro­posal (Line 8), add boxes from the fil­tered out­put to the ig­nored set (Line 11) if their IoU with the RoI exceeds the thresh­old (Line 10).

Experiments

Nothing ground­break­ing here—stan­dard ex­per­i­men­tal val­i­da­tion. That said, I plan to im­ple­ment soft sam­pling and pseudo la­bel-guided sam­pling in the com­ing weeks. Time will tell whether these meth­ods gen­uinely im­prove my own work.