← Back to Hub

Error loading visualization

VLM Parameters

ModelPaliGemma-3B
LLM BackboneGemma-2B
Vision EncoderSigLIP-So400M
Embed Dim2048
Decoder Layers18
Attention Heads16
WeightsFrozen

Processing Progress

Token Sequence (23 shown)

Legend

Text Prompt / Language
PaliGemma Decoder
Visual Tokens (SigLIP)
Cross-Attention Beams
Data Flow
Fused Embeddings
Ready
Press Play to see how PaliGemma combines visual tokens from SigLIP with language tokens to create fused vision-language embeddings — Step 2 of the pi0 architecture.
Speed1x