I’ve chosen to explore the concept presented in the article "Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language". This approach introduces LENS, a modular framework designed to enhance Large Language Models (LLMs) with vision capabilities without the need for additional multimodal pretraining. Instead of aligning vision and language modalities through expensive pretraining, LENS leverages pretrained vision modules to extract textual descriptions (e.g., tags, attributes, captions) from images, which are then fed into a frozen LLM for reasoning. This innovative method bridges the gap between computer vision and natural language processing at zero computational cost, offering a scalable solution for tasks like object recognition, visual reasoning, and multimodal question answering. The framework emphasizes practicality, leveraging existing state-of-the-art tools to address challenges in robotics, accessibility, and cross-domain applications.
Source Material:
Title: Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
Published: 2023
GitHub Link: https://github.com/ContextualAI/lens