Skip to content

Multimodal AI: How Vision-Language Models Work

DS
LDS Team
Let's Data Science
26 minAudio · 2 listens
Listen Along
0:00/ 0:00
AI voice

Take a photo of your broken code and ask an AI to fix it. Point your phone at a restaurant menu written in Japanese and ask for a translation with calorie counts. Upload a hand-drawn architecture diagram and ask the model to generate the corresponding infrastructure-as-code. None of this requires any special tricks — it just works, because vision-language models have gotten genuinely good.

The multimodal AI market was valued at $1.6 billion in 2024 and is projected to reach 27 billion by 2034, according to multiple market research firms. That growth reflects one thing: the technology crossed a practical deployment threshold. Models today can read charts, interpret medical scans, extract tables from photos, and write code from screenshots with a level of accuracy that makes them actually useful in production.

This article explains exactly how vision-language models (VLMs) work — from the foundational CLIP architecture that made image-text alignment possible, to how modern models like GPT-4o, Gemini 2.5 Pro, and InternVL3.5 connect visual perception to language reasoning. Our running example throughout is a receipt scanner: a system that takes a photo of a restaurant bill and extracts each line item, quantity, price, and the final total into structured JSON.

What Makes a Model Multimodal

A multimodal model handles more than one type of input or output within a single architecture. The modalities in play for modern AI systems are text, images, audio, and video. A model that can only process text is unimodal. A model that accepts both images and text — and reasons about them together — is multimodal.

The key word is "together." Early approaches processed each modality separately and merged the results at the end. A true multimodal model uses shared representations, letting the model reason about the relationship between what it sees and what it reads in the same forward pass.

For vision-language models specifically, the goal is bidirectional: the model should understand images given text context ("which part of this image shows the error mentioned above?"), and generate text grounded in image content ("what is wrong with this Python code in the screenshot?").

CLIP: The Architecture That Changed Everything

Vision-language models trace their origins to CLIP (Contrastive Language-Image Pre-training), released by OpenAI in 2021. Before CLIP, image recognition required labeling thousands of examples per category. CLIP learned to connect images and text by training on 400 million image-text pairs scraped from the internet — without any manual labels.

The insight was deceptively simple: train two separate encoders (one for images, one for text) so that the embedding of an image and the embedding of its caption end up close together in a shared vector space. This is contrastive learning.

CLIP dual-encoder architecture: image encoder and text encoder aligned in shared embedding spaceClick to expandCLIP dual-encoder architecture: image encoder and text encoder aligned in shared embedding space

The Contrastive Loss Function

Given a batch of NN image-text pairs, CLIP computes similarity scores for all N×NN \times N possible combinations. Correct pairs should have high similarity; incorrect pairs should have low similarity. The loss function that enforces this is:

L=1Ni=1N[logexp(sim(vi,ti)/τ)j=1Nexp(sim(vi,tj)/τ)+logexp(sim(ti,vi)/τ)j=1Nexp(sim(tj,vi)/τ)]\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\text{sim}(v_i, t_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(v_i, t_j) / \tau)} + \log \frac{\exp(\text{sim}(t_i, v_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(t_j, v_i) / \tau)} \right]

Where:

  • viv_i is the normalized image embedding for the ii-th example
  • tit_i is the normalized text embedding for the ii-th example
  • sim(vi,tj)\text{sim}(v_i, t_j) is the cosine similarity between image ii and text jj
  • τ\tau is a learned temperature parameter that scales the similarity scores
  • NN is the batch size (larger batches = more negative examples = stronger training signal)

In Plain English: Imagine laying out 256 receipts and 256 descriptions on a table. CLIP scores how well each receipt matches each of the 256 descriptions — that's 65,536 comparisons per batch. The loss pushes the correct pairs (receipt 1 with its description, receipt 2 with its description) toward the diagonal of this grid, and pushes all incorrect pairs away from it. Temperature τ\tau controls how sharp these boundaries are.

The result: after training on 400 million pairs, CLIP learns a universal embedding space where "a photo of a cat" and an actual cat photo end up as neighbors, even if the model never saw that specific image during training. Zero-shot image classification became possible — no task-specific fine-tuning required.

SigLIP: CLIP's More Efficient Successor

Google's SigLIP (Sigmoid Loss for Language-Image Pre-training, 2023) replaced CLIP's softmax contrastive objective with a pairwise sigmoid loss. The difference matters in practice: CLIP's loss requires computing all pairwise similarities within a batch globally, which means you need very large batches (32K+) to see enough negatives. SigLIP treats each image-text pair as an independent binary classification problem — does this pair match or not? This decouples the loss from batch size, making training far more memory-efficient.

SigLIP 2 (2025) extends this with an improved contrastive objective that achieves higher precision text-image alignment, especially for retrieval tasks. Most modern open-source VLMs — including LLaVA-NeXT and InternVL3.5 — now use SigLIP as their vision encoder rather than the original CLIP ViT, because SigLIP produces higher-quality visual features with less training compute.

How the Vision Encoder Works: ViT and Image Tokens

Modern VLMs use a Vision Transformer (ViT) as their image encoder, introduced by Dosovitskiy et al. in 2020. The approach is elegant: rather than treating pixels individually, ViT divides images into patches and processes each patch as a single token — exactly like words in a sentence.

Here's the complete pipeline:

ViT image tokenization pipeline: image patches to visual tokens with positional encodingClick to expandViT image tokenization pipeline: image patches to visual tokens with positional encoding

Step 1 — Split into patches. A 224×224 image with a patch size of 16×16 yields (224/16)² = 196 patches. Each patch contains 16 × 16 × 3 = 768 raw pixel values (for RGB).

Step 2 — Flatten and project. Each 768-dimensional patch vector is linearly projected into the model's embedding dimension (e.g., 768 or 1024). This learned projection is where visual features begin to take shape.

Step 3 — Add positional encodings. Unlike text, images are 2D. ViT adds learned positional embeddings so the transformer knows that patch (3, 7) is in the upper-right region of the image. Without this, the model sees an unordered bag of patches with no spatial structure.

Step 4 — Prepend [CLS] token. A special classification token is prepended to the sequence. After processing, this token's output captures a global summary of the entire image — all 196 patches distilled into one vector.

Step 5 — Process through transformer encoder. The sequence of 197 tokens (196 patches + 1 CLS) flows through multi-head self-attention layers. Attention heads specialize: some focus on edges, others on textures, others on semantic regions.

code
Image size: 224x224 pixels
Patch size: 16x16 pixels
Number of patches: 196
Patch embedding dim (flattened): 768
Patches array shape: (196, 16, 16, 3)

With [CLS] token: 197 total tokens fed to transformer

For our receipt scanner, a photo of a dinner bill might be 1024×1024 pixels. With a 16×16 patch size, that's 4,096 visual tokens — each one capturing roughly a 1/4-inch square of the receipt. The model's attention mechanism can then link the token covering the word "Pasta" with the token covering "14.50" several columns to the right.

Key Insight: The jump from CNNs to ViT as the vision encoder in VLMs wasn't arbitrary. Transformers use self-attention, which can directly model long-range relationships across the entire image. A CNN would struggle to connect "Pasta" (left side of receipt) with its price (right side) in one operation. The transformer does it in every attention layer.

If you're building on transformer foundations, the transformer architecture article covers how multi-head self-attention works — which is the same mechanism VLMs extend to visual tokens.

Connecting Vision to Language: The Projection Layer

Having a vision encoder that produces visual tokens is only half the problem. The language model speaks a different language — its embedding space is tuned for text, not image patches. Something has to translate between them. This connector is where the major architectural families diverge.

LLaVA-style MLP projection. The original LLaVA (Large Language and Vision Assistant, Haotian Liu et al., 2023) used the simplest possible connector: a two-layer MLP. Visual tokens from CLIP ViT are passed through the MLP, which projects them into the language model's embedding dimension. Those projected vectors are then concatenated with text token embeddings and fed into the LLM together. The simplicity works surprisingly well — the LLM effectively learns to "read" the projected visual tokens during instruction tuning. LLaVA 1.5 demonstrated that a well-trained MLP connector rivals far more complex designs.

Flamingo-style cross-attention. DeepMind's Flamingo (2022) injected visual information into the language model via dedicated cross-attention layers. Frozen visual features are passed as keys and values; the LLM's text representations are queries. This lets the LLM "look up" relevant visual features at each layer without flooding the token sequence with hundreds of image tokens — a real advantage when you're processing high-resolution inputs.

Unified token approach (GPT-4o, Gemini). The most recent approach treats visual tokens no differently from text tokens from the start. GPT-4o was trained end-to-end across text, vision, and audio as a single unified model. Gemini was designed from the ground up as natively multimodal, trained on interleaved image-text data at scale. There's no clean seam between "visual encoder" and "language model" — they're one thing.

Common Pitfall: Many tutorials describe GPT-4V and GPT-4o as architecturally equivalent. They're not. GPT-4V bolted vision onto an existing language model via an adapter. GPT-4o was trained from scratch as a multimodal model. The architectural difference explains GPT-4o's significantly better performance on tasks requiring tightly integrated visual and linguistic reasoning.

Modern VLMs: The 2026 Landscape

The field moved fast between 2024 and early 2026. Here's where things stand today:

ModelArchitectureContext WindowVideo SupportOpen SourceStrength
GPT-4oUnified native multimodal128K tokensFrame samplingNoComplex reasoning, code from screenshots
Gemini 2.5 ProNative multimodal1M tokensYes (84.8% VideoMME)NoLong documents, video, thinking
Claude 3.5 SonnetAdapter-based vision200K tokensNoNoDocument extraction, faithful OCR
Llama 3.2 Vision (90B)Cross-attention adapter128K tokensNoYesOpen-weight general vision
Qwen2.5-VL (72B)Native multimodal32K tokensYes (1hr+ video)YesMultilingual documents, charts
InternVL3.5 (78B)SigLIP + InternLM128K tokensYesYesCompetitive with GPT-4o on benchmarks
LLaVA-NeXT (34B)SigLIP ViT + MLP + Llama4K tokensNoYesLightweight on-premise deployment

GPT-4o remains the strongest general-purpose option for tasks requiring tight integration between seeing and reasoning. On MMMU (college-level multimodal reasoning), it scores 69.1, and on DocVQA it reaches 92.8. OpenAI trained GPT-4o as a single model end-to-end, which gives it an edge on tasks like receipt scanning where the model must simultaneously recognize text (OCR), understand table structure, and apply arithmetic verification. Note that OpenAI has since released GPT-5.4 as of early 2026, which extends these capabilities further — but GPT-4o remains widely deployed and the standard reference point.

Gemini 2.5 Pro (released mid-2025) scores 84.8% on VideoMME, making it the leading production model for video understanding. The one-million-token context window means you can feed an hour-long video as sampled frames, or a 500-page document, and reason over the entire thing in a single call. Google's architecture processes images as native tokens alongside text, with no separate adapter step, and its thinking mode adds a reasoning layer on top of multimodal perception.

InternVL3.5 from the OpenGVLab (August 2025) is the open-source story to watch. The 78B model scores 71.4 on WildVision and matches GPT-4o on several benchmarks including MMIU (55.8 vs GPT-4o's 55.7) and surpasses it on MMT-Bench (70.8 vs 65.4). It uses SigLIP as its vision encoder and InternLM as the language backbone. For production deployments that can't send data to a cloud API, InternVL3.5 is now a genuine alternative to the proprietary models.

Qwen2.5-VL (72B) from Alibaba handles video clips over one hour in length and excels on document and chart understanding across multiple languages. Released February 2025, it costs $0.20 per million input tokens via API — over 10x cheaper than GPT-4o — while staying competitive on document-focused benchmarks.

Llama 3.2 Vision (90B) scores 60.3 on MMMU compared to GPT-4o's 69.1, but achieves 90.1 on DocVQA, trailing GPT-4o's 92.8 on document visual question answering. At roughly 16x lower API cost than GPT-4o, it's a strong choice for high-volume document pipelines where you can tolerate slightly lower general reasoning quality.

Three Architectural Approaches Side by Side

VLM architectural approaches: MLP adapter, cross-attention, and unified native multimodalClick to expandVLM architectural approaches: MLP adapter, cross-attention, and unified native multimodal

The two-stage adapter approach (LLaVA) is the most interpretable. You can swap vision encoders, swap the LLM, and swap the connector independently. The limitation is that the vision encoder and language model were never trained together, so there's a fundamental mismatch in their representations that the connector has to overcome.

The cross-attention approach (Flamingo, Llama 3.2) avoids flooding the context with hundreds of image tokens. When you have a 1024×1024 image producing 4,096 visual tokens, injecting all of them into a 128K-token context window uses over 3% of available context for just one image. Cross-attention lets the LLM query visual features on demand rather than receiving everything upfront.

The unified approach (GPT-4o, Gemini) removes the seam entirely. The model never "switches modes" between looking at an image and reading text — everything flows through the same attention heads. This produces the best performance but requires enormous training compute, since you're training the visual and linguistic representations jointly from scratch.

Real Applications of Vision-Language Models

Document understanding. This is the killer app for VLMs today. Receipts, invoices, medical forms, legal contracts, financial statements — documents that have always required manual data entry. A VLM can extract structured data from a receipt photo in under a second, with accuracy that rivals a trained human on clean images. The model understands both the content (what the items are) and the structure (which price corresponds to which item). For multi-page PDFs, Gemini 2.5 Pro's million-token context means you can pass entire contracts — not just pages — and ask cross-document questions.

Visual question answering. The model takes an image plus a natural language question and produces an answer grounded in the image content. "Is there a fire extinguisher visible in this factory floor photo?" "What does the error message on this server rack say?" "Which bar in this chart represents Q3 revenue?" These questions were impossible for traditional ML systems without task-specific training data.

Code from screenshots. Hand GPT-4o a screenshot of a UI and ask it to generate the HTML/CSS. Give it a photo of a whiteboard architecture diagram and ask for Terraform code. Show it a broken terminal output and ask it to diagnose the issue. This workflow is one of the most frequently cited uses of vision models by software engineers.

Video understanding. Gemini 2.5 Pro and Qwen2.5-VL can process extended video input — Qwen2.5-VL handles clips over one hour. Practical applications include analyzing security footage for events, generating meeting summaries from screen recordings, and automating quality inspection by reviewing assembly line video. The Allen Institute's Molmo2 (2026) pushed video grounding further: it can pinpoint the exact timestamp where a specific event occurs in a long video, achieving 38.4 F1 on video pointing versus Gemini 3 Pro's 20.0 — a striking result from a fully open-weight model.

Medical imaging analysis. Foundation models like MedGemini apply vision-language reasoning to radiology reports, pathology slides, and clinical photographs. This is an area where VLMs are moving toward clinical deployment, though regulatory frameworks are still catching up. A radiologist might ask a VLM to highlight regions of concern in a chest X-ray and draft an initial report — with the physician reviewing and signing off.

Fine-Tuning VLMs with LoRA

Pre-trained VLMs work well out of the box for general tasks. But when you need a model specialized for your domain — custom invoice formats, a proprietary chart style, or handwritten forms from a specific institution — fine-tuning on your own labeled examples changes the accuracy picture completely.

The challenge: full fine-tuning of a 34B VLM requires more than 80GB of VRAM. That locks most teams out. LoRA (Low-Rank Adaptation) solves this by adding small trainable rank-decomposition matrices to existing attention layers, freezing the original weights entirely. Fine-tuning LLaVA-1.5 with LoRA at rank 16 reduces trainable parameters by 99.2% compared to full fine-tuning, while achieving comparable benchmark performance.

The typical fine-tuning pipeline for a VLM has three stages:

VLM fine-tuning pipeline with LoRA: frozen vision encoder, trainable MLP projection, and LoRA on the language modelClick to expandVLM fine-tuning pipeline with LoRA: frozen vision encoder, trainable MLP projection, and LoRA on the language model

  1. Keep the vision encoder frozen. The ViT or SigLIP encoder is already excellent at extracting visual features. Updating it requires far more data than you likely have, and risks degrading the features that make the model work. Freeze it.

  2. Fine-tune the projection layer. The MLP connector between vision and language is relatively small and fast to train. It's also the layer most likely to need adaptation for domain-specific visual formats.

  3. Apply LoRA to the language model's attention layers. This is where the model learns to interpret your domain's visual features correctly. With rank=16 and alpha=32, you can fine-tune an entire 13B LLM component in under 4 hours on a single A100.

python
# Example: LoRA fine-tuning config for LLaVA-style VLM (display-only — requires transformers + peft)
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                          # Rank of the decomposition matrices
    lora_alpha=32,                 # Scaling factor (typically 2x rank)
    target_modules=["q_proj", "v_proj"],  # Which attention layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Freeze vision encoder and projection, apply LoRA to LLM
model = get_peft_model(base_llava_model, lora_config)
model.print_trainable_parameters()
# Trainable params: 3,538,944 || All params: 13,017,300,000 || Trainable%: 0.027%

Pro Tip: For receipt scanning fine-tuning, prepare a dataset of 500 to 2,000 receipt photos with verified JSON ground truth. That's enough for the model to learn your specific format — receipt layouts, currency symbols, tax line variations. Use Llama 3.2 Vision (11B) as the base for lower VRAM requirements. The 11B model fine-tuned on domain data will outperform the 90B base model on your specific task.

The fine-tuning LLMs with LoRA and QLoRA article covers the LoRA mechanics in depth — the same principles apply when fine-tuning the language component of a VLM.

Where Vision-Language Models Still Fail

VLMs have real, documented failure modes that any production deployment needs to account for:

Counting objects. Ask a VLM how many cars are in a parking lot photo and you'll get unreliable answers above roughly seven objects. The model doesn't count systematically — it estimates based on density. Counting remains one of the most consistently cited limitations across GPT-4o, Gemini, and Claude.

Spatial reasoning. "Is the blue cup to the left or right of the red plate?" Models frequently confuse spatial relationships, especially when objects overlap or are viewed from unusual angles. The token-based representation loses precise spatial coordinates — the model knows that patches depicting the blue cup and the red plate exist, but their relative positions are encoded only implicitly.

Fine-grained text in images. Small text (license plates, nutrition labels, fine print) degrades VLM accuracy significantly. High-resolution input helps, but even at 1024×1024 resolution, tiny text is often misread or hallucinated. For our receipt scanner, this is a genuine concern — a crumpled or photographed-at-an-angle receipt can fool the OCR layer.

Hallucinating image content. VLMs will sometimes describe objects that aren't present in the image, particularly when the prompt implies what should be there. If you ask "what type of car is parked in the driveway?" and there's no car in the image, some models will invent one rather than say the driveway is empty. This is the visual analog of text hallucination.

Common Pitfall: Treating VLM vision output as ground truth in automated pipelines without confidence checks. In production, always add a verification step for high-stakes extractions — have the model return a confidence field and flag low-confidence outputs for human review. For receipt scanning specifically, validate that extracted line items sum to the extracted subtotal before accepting the result.

Running VLMs Locally with Ollama

For applications where data privacy matters — a hospital scanning patient intake forms, a law firm processing contracts — sending images to a cloud API isn't viable. Ollama makes local VLM deployment practical on consumer hardware.

bash
# Pull and run LLaVA locally
ollama pull llava:13b
ollama run llava:13b
python
import ollama

# Receipt scanner using local LLaVA
response = ollama.chat(
    model='llava:13b',
    messages=[
        {
            'role': 'user',
            'content': """Extract all line items from this receipt as JSON.
Format: {"items": [{"name": "...", "qty": N, "price": 0.00}], "subtotal": 0.00, "tax": 0.00, "total": 0.00}""",
            'images': ['receipt.jpg']
        }
    ]
)

print(response['message']['content'])

LLaVA-13B runs comfortably on a 16GB consumer GPU (RTX 3080 or M2 Pro). The 34B variant requires a 48GB GPU but produces results competitive with GPT-4V on document extraction tasks. Quantized versions (Q4_K_M) reduce VRAM requirements by roughly 60% with minimal quality loss.

Pro Tip: For receipt scanning in production, use a two-pass approach. First run a lightweight model (LLaVA-7B or GPT-4o-mini with vision) for fast extraction, then validate the sum of line items against the extracted total. Mismatches trigger a second pass with the full model. This keeps cost and latency low while catching extraction errors automatically.

When to Use VLMs, When Not To

Use VLMs when:

  • Input is inherently visual (photos, screenshots, scanned documents)
  • The task requires understanding structure and content together (reading a table in an image)
  • You need zero-shot generalization across many document types without per-type training
  • The alternative is manual data entry or a brittle task-specific OCR pipeline
  • You need video understanding at the minute-to-hour scale (Gemini 2.5 Pro, Qwen2.5-VL)

Don't use VLMs when:

  • You have clean digital text — just use a text LLM. Adding vision adds latency and cost with zero benefit
  • Counting, precise spatial measurements, or pixel-accurate tasks are required — traditional computer vision (YOLO, SAM, depth estimation models) still wins here
  • Images contain very small text and accuracy is critical — use dedicated OCR tools like Tesseract or cloud vision APIs for the text extraction step, then pass the text to an LLM
  • Latency must be below 500ms — even GPU-accelerated VLMs are slower than text-only models
  • You need object detection in real time — specialized detection models like YOLOv8 run at 60+ fps; VLMs don't

Conclusion

Vision-language models work by solving three sub-problems in sequence: encode the image into meaningful tokens (ViT or SigLIP), align those tokens with text representations (CLIP/SigLIP contrastive learning or an MLP connector), and let a language model reason jointly over both (via concatenation, cross-attention, or a fully unified architecture).

The 2021 CLIP breakthrough is the foundation — it proved that image and text representations can be aligned in a shared space using contrastive learning on internet-scale data. Every modern VLM builds on that insight, whether it's LLaVA-NeXT using SigLIP as a frozen encoder with a learned MLP projection, or GPT-4o training the full visual-language system end-to-end. By March 2026, open-source models like InternVL3.5-78B have closed the gap with proprietary models to within 5 to 10 percentage points on standard benchmarks — making on-premise VLM deployment genuinely viable.

For practical deployment: cloud VLMs (GPT-4o, Gemini 2.5 Pro, Claude 3.5 Sonnet) win on raw accuracy, long context, and zero maintenance; open-source VLMs (InternVL3.5, Llama 3.2 Vision, Qwen2.5-VL) win on data privacy, cost at scale, and fine-tuning flexibility. The failure modes — counting, spatial reasoning, small text, hallucination — are real and architectural, not implementation bugs. Build verification layers around them.

If you're building on top of language models and want to understand the full LLM stack, the GPT architecture article explains the decoder-only transformer that most VLMs use as their language backbone. For the attention mechanism specifically, the transformer revolution article covers how multi-head self-attention works — the same mechanism VLMs extend to visual tokens.

Interview Questions

What is contrastive learning and how does CLIP use it?

Contrastive learning trains a model by pushing similar examples together and dissimilar examples apart in embedding space. CLIP applies this to image-text pairs: given a batch of N pairs, it maximizes cosine similarity between each image and its matching caption while minimizing similarity to all other captions in the batch. The temperature parameter controls the sharpness of this separation. The result is a shared embedding space where semantically related images and text cluster together, enabling zero-shot image classification without any task-specific labeled data.

How does SigLIP differ from CLIP, and why do most modern open-source VLMs use it?

CLIP uses a softmax contrastive loss that requires computing all pairwise similarities across the entire batch globally, which means you need very large batches (32K+) to see enough negatives during training. SigLIP replaces this with a sigmoid loss that treats each image-text pair as an independent binary classification — does this pair match, yes or no? This decouples the loss from batch size, making training more memory-efficient and enabling better performance with smaller batches. SigLIP 2 (2025) further improved text-image alignment precision, and most 2025-2026 open-source VLMs including LLaVA-NeXT and InternVL3.5 now use SigLIP rather than the original CLIP.

Explain the ViT architecture and why it replaced CNNs in vision-language models.

ViT (Vision Transformer) divides an image into fixed-size patches (typically 16×16 pixels), flattens each patch into a vector, applies a learned linear projection, adds positional encodings, and processes the resulting sequence through a standard transformer encoder. The advantage over CNNs is self-attention: ViT can model long-range dependencies across the entire image in every layer, whereas CNNs build up receptive field size gradually through stacking layers. For document understanding — where a price in one column must be linked to a label in a distant column — this global attention is critical and CNN-based encoders struggle.

What is the role of the projection layer in LLaVA-style VLMs?

The vision encoder and language model have different embedding spaces — one tuned for visual features, the other for text semantics. The projection layer (a two-layer MLP in LLaVA) bridges this gap by mapping visual tokens from the vision encoder's embedding dimension into the language model's embedding dimension. After projection, visual tokens and text tokens are concatenated and fed into the LLM together, allowing it to attend across both modalities simultaneously. The key insight from the LLaVA paper is that this simple MLP connector, when trained properly, rivals far more complex architectural designs.

How does GPT-4o differ architecturally from GPT-4V?

GPT-4V bolted vision onto an existing language model using an adapter: a vision encoder processes images separately, and an adapter aligns the visual features with the LLM's embedding space. GPT-4o was trained from scratch as a unified model across text, vision, and audio — there's no separate encoder/adapter distinction. This end-to-end joint training gives GPT-4o tighter integration between visual perception and language reasoning, explaining its higher benchmark scores on tasks that require both modalities simultaneously rather than just one at a time.

Why do VLMs struggle with counting objects accurately?

VLMs represent images as patches processed through attention layers — they have no explicit counting mechanism. Attention heads learn to respond to density patterns and categorical presence, not to enumerate instances one by one. When asked to count, the model generates a plausible number based on visual density cues rather than actually iterating over each object. Performance degrades above roughly seven objects because the model can no longer reliably distinguish "seven" from "eight" based on visual density alone. For reliable counting in production, a specialized detection model (YOLO, DETR) is a better choice.

How would you fine-tune a VLM for a domain-specific document extraction task?

Start by freezing the vision encoder — it's already well-trained and you likely have too little data to improve it. Fine-tune the MLP projection layer and apply LoRA (rank 16, alpha 32) to the language model's attention layers. This reduces trainable parameters by roughly 99% compared to full fine-tuning, making it feasible on a single A100. Collect 500 to 2,000 labeled examples in your document format, structured as image-instruction-answer pairs. For receipt extraction, verify ground truth by having a human confirm the extracted JSON matches the actual receipt. Domain-specific fine-tuning on this amount of data consistently outperforms a much larger base model on the target task.

What is the token budget for a high-resolution image in a VLM context window?

With a 16×16 patch size, a 1024×1024 image produces (1024/16)² = 4,096 visual tokens. A 2048×2048 image would produce 16,384 tokens. This matters for context window budgeting: a model with a 128K token context that receives five high-resolution images uses over 15% of its context on visual tokens before any text is added. Many production VLMs use dynamic resolution scaling — reducing patch size or downsampling the image — to control token cost. Gemini's approach processes images as native tokens within its 1M token context, which is why it handles multi-image and video tasks where other models run out of context.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths