Interactive 3D Exploration

pi0 Architecture

Interactive 3D Visual Guide

by Vizuara

Architecture Pipeline

Image + Text

Robot Actions

Explore Each Step

Visual Encoding with SigLIP

Camera images are processed through a SigLIP vision encoder, producing a grid of visual tokens that capture spatial features for the robot.

Explore step →

VLM Backbone — PaliGemma

Visual tokens merge with language instruction tokens inside PaliGemma, producing fused multimodal embeddings that understand both scene and task.

Explore step →

Action Expert — Flow Matching

Noisy action samples are iteratively denoised through learned velocity fields, producing smooth robot trajectories via flow matching.

Explore step →

Input Assembly & Attention

Three token blocks are assembled into a single sequence and processed through block-wise causal attention inside the transformer.

Explore step →