This post covers paper “Towards General Purpose Vision Systems”
- GPV-1 is Task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text.
- Supports: classification, localization, question answering, captioning
- Model is trained on four tasks using COCO images and annotations
- Visual Question Answering, Captioning, Localization, and Classification
The GPV-I model
- Visual encoder, Language encoder, vision language co-attention module, and output heads for the supported output modalities – boxes, relevance scores, and text