Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7% of test instances do not require reasoning on linguistic structure. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn't. Additionally, we create an adversarial dataset that tests the model's ability to generalize to unseen distribution of target referring objects. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12% to 23% lower in performance than the established progress for this task. We also propose two methods, one based on negative sampling and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task.
Acknowledgements: ARO W911NF1810296, DARPA XAI N66001-17-2-4029, ONR MURI N00014-16-1-2007