vision language models (VLMs)
What are vision language models (VLMs)?
Vision language models (VLMs) combine machine vision and semantic processing techniques to make sense of the relationship within and between objects in images. In practice, this means combining various visual machine learning (ML) algorithms with transformer-based large language models (LLMs). Current VLMs include OpenAI's GPT-4, Google Gemini and the open-source LLaVA.
VLMs, sometimes called large vision language models (LVLMs), are among the earliest multimodal AI techniques used to train models across various modalities such as text, images, audio and other data formats. The multimodal distinction is primarily made against early single-modality transformer-based LLMs like OpenAI's GPT series, Google Gemini and Meta's open-source Llama, but is also relevant to most other ML techniques.
For decades, machine vision models have been used for simple counting, classification and categorization tasks in areas like quality control, facial recognition and estimating damages for insurance claims. VLMs help connect these techniques to people's questions about images.
VLMs promise to help automate, simplify and democratize various business, scientific, medical, artistic and consumer use cases. Common applications include answering questions, writing captions about images and generating new images based on prompts.
Early work focused on photographic and artistic images owing to an abundance of pictures and captions readily available for training. However, VLMs also show promise in interpreting other kinds of graphical data such as electrocardiogram (ECG) graphs, machine performance data, org charts, business process models and virtually any other data type that experts can label.
Brief history of VLMs
The history of VLMs is rooted in developments in machine vision and LLMs and the relatively recent integration of these disciplines.
Machine vision research dates to the early 1970s when researchers began exploring various ways to extract edges, label lines, identify objects or classify conditions. Throughout the 1980s and 1990s, researchers investigated how scale space -- a way of representing images at different levels of detail -- could help align views of things across various scales. This led to the development of algorithms that could connect multiple levels of abstraction. For example, this might help join up imagery of cells, organs and body parts in medicine. Similar relationships exist across business processes, biology, physics and the built environment.
Later, researchers started developing feature-based methods that helped characterize defects based on images of products passing down an assembly line. However, this was an expensive and manual process. Innovations in convolutional neural networks (CNNs) and their training process helped automate much of this type of work.
On the language side, early work in the early 1960s focused on improving various techniques to analyze and automate logic and the semantic relationships between words. Innovations in recurrent neural networks (RNNs) helped automate much of the training and development of linguistic algorithms in the mid-1980s. However, these struggled with long sentences or long sequences. Later innovations, such as long short-term memory, a special type of RNN, extended these limits.
The next wave of innovation came with the development of generative AI (GenAI) algorithms in the 2010s that addressed various ways to represent reality in intermediate forms that could be adjusted to create new content or discern subtle patterns. Diffusion models were good at physics and image problems that involved adding and removing noise. Generative adversarial networks (GANs) were good at creating realistic images by setting up a competition between generating and discriminating algorithms. Variational autoencoders found better ways to represent probability distributions best suited for sequential data.
The big breakthrough was the introduction of the transformer model by a team of Google researchers in 2017. This let a new generation of algorithms simultaneously consider the relationship between multiple elements in longer sentences, paragraphs and, later on, books. More importantly, it opened the gates for automating the training of algorithms that could learn the connections between semantic elements that reflect how humans interpret the world and the raw sights, sounds and text we are presented with.
It took another four years for multimodal algorithms to take hold in the research community, as not many computer scientists were thinking about capturing the world in terms of noise, neural networks or probability fields.
In 2021, OpenAI introduced its foundation model known as Contrastive Language-Image Pre-training (CLIP), which suggested how LLM innovations might be combined with other processing techniques.
Stability AI – in conjunction with researchers from Ludwig Maximilian University of Munich and Runway AI -- introduced Stable Diffusion in 2022, combining LLMs and diffusion models to generate creative imagery. Since then, various other research and commercial projects have explored how transformer models could be combined with GenAI techniques and traditional ML approaches to connect visual and language domains.
Why are VLMs important?
Traditional ML and AI models focused on one task. They could often be trained to perform well on simple visual tasks such as recognizing characters in printed documents, identifying faulty products or recognizing faces. However, they frequently struggled with the larger context.
VLMs help bridge the gap between visual representations and how humans are used to thinking about the world. This is where the scale space concept comes into play. Humans are experts at jumping across different levels of abstraction. We see a small pattern and can quickly understand how it might connect to the larger context in which this image forms a part. LVMs represent an important aspect of automating some of this process.
For example, when we see a car with a dent sitting in the middle of the road and an ambulance nearby, we instantly know that a crash probably occurred even though we did not see it. We start to imagine stories about how it could have happened, look for evidence that supports our hypothesis and think about how we might avoid a similar fate. Sometimes, people write stories about these experiences that can help train an LLM. A VLM can help connect the dots between stories humans write about car crashes and ambulances with images of them.
For example, when humans look at an image of a person holding a ball with their arm raised and a nearby dog looking up, we know they are playing fetch. We also write many stories involving humans, balls and dogs. VLMs trained on these stories and images of people holding circular shapes connect the dots among balls, humans, dogs and the game fetch to discern a similar interpretation. They can also help interpret many other things that people often describe about images in captions to connect what might be apparent in an image within a larger context.
Analytics tools can commonly transform raw data into various charts, graphs and maps to help see patterns and interpret the meaning of data. VLMs can take advantage of this significant analytics and presentation infrastructure to automate aspects of interpreting these graphics. They can also connect the visual patterns to the way experts might talk about the data to help democratize and simplify the understanding of complex data streams through a conversational interface.
How VLMs work: Core concepts
Various preprocessing and intermediate training steps are commonly involved before joining visual and language elements in training the LVM. This requires more work than training LLMs, which can start with a large text collection. Vision Transformers (ViTs) are sometimes used in this preprocessing step to learn the relationships between visual elements but not the words that describe them. VLMs use ViTs and other preprocessing techniques to connect visual elements such as lines, shapes and objects to linguistic elements. This lets them make sense of, generate and translate between visual and textual data.
The following three elements are essential in a LVM:
- Machine vision. Translates raw pixels into representations of the lines, shapes and forms of objects in visual imagery, such as determining if an image has a cat or a dog.
- LLMs. Connects the dots between concepts expressed across many different contexts such as all the ways we might interact with dogs versus cats.
- Fusing aspects. Automates the process of labeling parts of an image and connecting them to words in an LLM, such as describing a dog as sitting, eating, chasing squirrels or walking with its owner.
Training and evaluation of VLMs
VLMs need a lot of data to learn about the world, starting with pairs of images and text descriptions. Popular open-source data sets for various VLM tasks include LAION-5B, Public MultiModel Dataset, Visual Question Answering and ImageNet.
Finding data for novel tasks can be tricky, and data scientists need to be cautious about bias, particularly when there is an imbalance of samples that relate to the real world. This can be critical when enterprise developers seek data for a particular use case that can vary across different contexts.
For example, there are many images, captions and stories about rabbits on the Internet. But these might vary by geography. In the U.S., the stories might be about pets and the Easter Bunny, while in Australia, they might be about wild rabbits as pests and the endangered Easter Bilby. A similar problem could show up in the way different business units or types of experts talk about images. A VLM trained on only one set of stories and captions might struggle when used across various regions, business units, disciplines or teams.
It's also important to incorporate different image descriptions within the training data. For example, one description might label the animals in the image, another might mention a ball and raised arm, while a third description might include the game fetch. More nuanced descriptions might talk about the types of grass, clouds in the sky, the outfit a person is wearing or the brand of leash used on the dog. The challenge is curating a selection of training data that might discuss these various image aspects.
The training process also needs to help distinguish between similar looking but different things. For example, a pug has a cat-like face, but humans can quickly identify a picture of it as a dog. Models therefore need to be trained on many images of dogs and cats with unusual shapes, sizes and forms.
There are many approaches to training a VLM once a team has curated a relevant data set. The process sometimes starts with a pretraining step that maps visual data to text descriptions, which is later fused to an existing LLM. In other cases, the VLM is trained directly on paired images and text data, which can require more time and computing resources but might deliver better results. In both approaches, the base model is adjusted through an iterative process of answering questions or writing captions. Based on how well the model performs, the strength of the connections within the answers -- known as neural weights -- are adjusted to reduce mistakes. These can sometimes be combined to enhance accuracy, improve performance or reduce model size. The following are a few popular examples:
- Contrastive learning. Models like CLIP learn to discern similarities and differences between pairs of images like dogs and cats and then apply text labels to similar images fed into an LLM. The open-source LLaVA uses CLIP as part of a pretraining step, which is then connected to a version of the Llama LLM.
- PrefixLM. Models like SimVLM and VirTex directly train a transformer across sections of an image and a sentence stub (the prefix) that is good at predicting the next set of words in an appropriate caption.
- Multi-modal fusing with cross attention. Models like VisualGPT, VC-GPT and Flamingo fuse visual elements to the attention mechanism in an existing LLM.
- Masked-Language Modeling and Image-Text Matching. Models like BridgeTower, FLAVA, LXMERT and VisualBERT combine algorithms that predict masked words with other algorithms that associate images and captions.
- Knowledge distillation. Models like ViLD distill a larger teacher model with high accuracy into a more compact student model with fewer parameters that runs faster and cheaper but retains similar performance.
Evaluating the performance of a VLM is a highly subjective process that can vary across individuals, their expertise and task type. Researchers have developed a variety of metrics that can help automate and standardize the evaluation of VLM performance across different kinds of tasks. Popular metrics include the following:
- Bilingual Evaluation Understudy. BLEU compares the number of words in a machine translation to a human-curated reference translation.
- Consensus-based Image Description Evaluation. CIDEr compares machine descriptions to a reference description curated by a consensus of human evaluators.
- Metric for Evaluation of Translation with Explicit Ordering. METEOR measures the precision, order, recall, and human quality assessments of descriptions or translations.
- Recall-Oriented Understudy for Gisting Evaluation. ROUGE compares the quality of a VLM-generated description to human summaries.
VLM applications
Some of the many applications of VLMs include the following:
- Captioning images. Automatically creates descriptions of images, aiding efforts to index and search through a large library of imagery.
- Visual question answering. Helps uncover insights in images for users with different levels of expertise to improve understanding and analysis.
- Visual summarization. Writes a brief summary about visual information in an org chart, medical image or equipment repair process.
- Image text retrieval. Lets users find images of things that relate to their query, such as finding a product using a different set of words than its official product description.
- Image generation. Helps someone create a new image in a particular style based on a text prompt.
- Image annotation. Highlights sections of an image with colors and labels to indicate areas relating to a query.
A rapidly evolving field
VLMs, like LLMs, are prone to hallucination, so guardrails are required with high-stakes decisions. A VLM might tell a trained radiologist where to look in a radiological image but it should not be trusted to make a diagnosis independently. Similarly, a VLM might help frame important questions about the business, but subject matter experts need to be consulted before rushing to any major decision. Conversely, creative hallucinations might be considered a feature for an artist experimenting with new imagery ideas.
Additionally, VLMs generate their results based on the complex relationships captured in the weight of billions of features. This can make it difficult to decipher how they arrived at a particular decision or adjust them when mistakes are made.
Ongoing research is also looking at how to combine different metrics that can improve performance across multiple types of tasks. For example, a newer Attribution, Relation and Order benchmark measure visual reasoning skills better than traditional metrics developed for machine translation. More work is also required to develop better metrics for various use cases in medicine, industrial automation, warehouse management and robotics.
Despite the many challenges, VLMs represent an exciting opportunity to apply GenAI techniques to visual information. Researchers, vendors and enterprise data scientists will continue to find ways to improve their performance, apply them to existing business workflows, and improve the user experience for employees and customers.