ML Architecture
Vision Transformer (ViT) Architecture
Patch embedding, position encoding, transformer encoder stack and a classification head.
Prompt
A Vision Transformer (ViT) architecture, left-to-right horizontal flow. Step 1 β Patch Embedding (left): - An input image is split into a 4x4 grid of non-overlapping 16x16 patches. - Each patch is flattened and linearly projected to a D=768 embedding. - Show the patch grid clearly with a small example image inside. Step 2 β Position + [CLS] token: - Prepend a learnable [CLS] token to the patch sequence. - Add learned positional embeddings element-wise. Step 3 β Transformer Encoder (center): - A stack of L=12 standard encoder layers (multi-head self-attention + MLP + LayerNorm). - Show the stack as a tall vertical column with one expanded layer to the side. Step 4 β Output (right): - The [CLS] token output is projected by an MLP head to class logits. Style: clean academic vector, navy and teal palette, thin connectors, white background. Annotate tensor shapes (N+1, D).Use in Generator
When to use
For computer-vision papers using transformer encoders for classification, segmentation or detection.
Variations
Hierarchical Swin variant
Same flow but show a hierarchical Swin-style ViT with 4 stages of decreasing spatial resolution and increasing channel dim. Add window-attention boxes inside each stage.
Tips
- State patch size and image size β figures default to a generic grid otherwise.
- Mention "[CLS] token" by name. Models reproduce the literal label and arrow.
- For a paper figure, fix L (layer count) to your value to avoid ambiguity.
FAQ
How do I extend this to dense prediction (segmentation)?
Replace the MLP head with a "lightweight decoder reshaping patch tokens back to 2D and producing a per-pixel mask." The encoder structure stays the same.
