Interactive 3D Exploration
pi0 Architecture
Interactive 3D Visual Guide
by Vizuara
Architecture Pipeline
Explore Each Step
1
Visual Encoding with SigLIP
Camera images are processed through a SigLIP vision encoder, producing a grid of visual tokens that capture spatial features for the robot.
Explore step →
2
VLM Backbone — PaliGemma
Visual tokens merge with language instruction tokens inside PaliGemma, producing fused multimodal embeddings that understand both scene and task.
Explore step →
3
Action Expert — Flow Matching
Noisy action samples are iteratively denoised through learned velocity fields, producing smooth robot trajectories via flow matching.
Explore step →
4
Input Assembly & Attention
Three token blocks are assembled into a single sequence and processed through block-wise causal attention inside the transformer.
Explore step →