Towards General Purpose Vision Systems

This post covers paper “Towards General Purpose Vision Systems


  • GPV-1 is Task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text.
  • Supports: classification, localization, question answering, captioning
  • Model is trained on four tasks using COCO images and annotations
    • Visual Question Answering, Captioning, Localization, and Classification
  • Demo

The GPV-I model


  • Visual encoder, Language encoder, vision language co-attention module, and output heads for the supported output modalities – boxes, relevance scores, and text