Object Localization is one of the fundamental problems in Computer Vision that helps to understand visual scenes in a better manner. Learning strong predictive models using imprecise labels is becoming a trend lately, since it involves cheaper annotations and reduced human effort for manual labeling of data. Also, the massive availability of weakly labeled data in the form of videos and images over the internet makes it possible to explore various real-world problems in deep learning in a Weakly Supervised Learning (WSL) paradigm.
Humans possess an innate capability of recognizing objects and their corresponding parts and confine their attention to that location in a visual scene where the object is spatially present, with minimal supervision. Recently, efforts to train machines to mimic the ability of humans in the form of weakly supervised object localization, using training labels only at the image-level, have garnered a lot of attention. In this way, it is possible to localize objects without using full supervision in the form of bounding boxes by employing the WSL approach. When the object is partially occluded, it poses an additional challenge for object localization and scene understanding in general. Object localization, being a fundamental problem in various real-world scenarios, serves as a helping hand in tasks like image and video captioning, scene graph generation, and visual grounding.
One of the well-known problems that most of the existing weakly supervised methods suffer from is localizing only the most discriminative part of an object. Such methods provide very little or no focus on other pertinent details of the object. E.g., Given an image of a dog, existing methods try to generate an implicit attention map only on the face of the dog, leaving its other body parts like legs, tail unattended. This often leads to sub-optimal localization performance. Hence, our main focus has been to design an architecture that can cover the entire spatial extent of the objects and localize them integrally.
In this work, we propose a novel way of scrupulously localizing objects using training with labels as available for the entire image, by mining information from complementary regions in an image. Primarily, we adapt to regional dropout at complementary spatial locations to create two intermediate images. With the help of a novel Channel-wise Assisted Attention Module (CAAM) coupled with a Spatial Self-Attention Module (SSAM), we parallely train our model to leverage the information from complementary image regions for excellent localization. Finally, we fuse the attention maps generated by the two classifiers using our Attention-based Fusion Loss. Several experimental studies manifest the superior performance of our proposed approach. Our method demonstrates a significant increase in object localization performance over the existing state-of-the-art methods on CUB-200-2011 and ILSVRC 2016 datasets. Future work involves attempting to localize objects by learning to mine complementary information from both discriminative and non-discriminative object parts in the feature space.