The OWL-ViT (short for Vision Transformer for Open-World Localization) was proposed in Simple Open-Vocabulary Object Detection with Vision Transformers by Matthias Minderer among others. It is a recent model and not many citations have been made on it. It can be used to query an image with one or multiple text queries to search for and detect target objects described in the text.
Its practical applications include:
This has significant implications for downstream edge devices and applications.
This article was written by Danson Waweru of the AI Class of 2022. Learn more about Artificial Intelligence at HURU School by Enrolling here.