Vision Transformer (ViT) Architecture

Patch embedding, position encoding, transformer encoder stack and a classification head.

Prompt

A Vision Transformer (ViT) architecture, left-to-right horizontal flow.

Step 1 — Patch Embedding (left):
- An input image is split into a 4x4 grid of non-overlapping 16x16 patches.
- Each patch is flattened and linearly projected to a D=768 embedding.
- Show the patch grid clearly with a small example image inside.

Step 2 — Position + [CLS] token:
- Prepend a learnable [CLS] token to the patch sequence.
- Add learned positional embeddings element-wise.

Step 3 — Transformer Encoder (center):
- A stack of L=12 standard encoder layers (multi-head self-attention + MLP + LayerNorm).
- Show the stack as a tall vertical column with one expanded layer to the side.

Step 4 — Output (right):
- The [CLS] token output is projected by an MLP head to class logits.

Style: clean academic vector, navy and teal palette, thin connectors, white background. Annotate tensor shapes (N+1, D).

Use in Generator

When to use

For computer-vision papers using transformer encoders for classification, segmentation or detection.

Variations

Hierarchical Swin variant

Same flow but show a hierarchical Swin-style ViT with 4 stages of decreasing spatial resolution and increasing channel dim. Add window-attention boxes inside each stage.

Tips

State patch size and image size — figures default to a generic grid otherwise.
Mention "[CLS] token" by name. Models reproduce the literal label and arrow.
For a paper figure, fix L (layer count) to your value to avoid ambiguity.

FAQ

How do I extend this to dense prediction (segmentation)?

Replace the MLP head with a "lightweight decoder reshaping patch tokens back to 2D and producing a per-pixel mask." The encoder structure stays the same.