Transformer Encoder-Decoder Architecture

Publication-quality transformer block diagram with self-attention, cross-attention, and residual connections.

Prompt

A transformer encoder-decoder architecture for sequence-to-sequence modeling.

Layout: vertical stack, encoder on the left column, decoder on the right column, connected by horizontal cross-attention arrows in the middle.

Encoder (6 stacked layers):
- Input embedding + positional encoding at the bottom
- Each layer contains: multi-head self-attention -> Add & LayerNorm -> feed-forward -> Add & LayerNorm
- Show residual (skip) arrows around each sub-layer with curved dashed lines

Decoder (6 stacked layers):
- Output embedding (shifted right) + positional encoding at the bottom
- Each layer contains: masked multi-head self-attention -> Add & LayerNorm -> cross-attention to encoder output -> Add & LayerNorm -> feed-forward -> Add & LayerNorm
- Linear + softmax head at the top

Style: clean academic vector style, minimal palette (navy blue, teal, light gray), thin borders on rounded boxes, monospace font for tensor shape annotations, white background. Follow NeurIPS figure conventions.

Use in Generator

When to use

For NeurIPS / ICML / ICLR papers introducing or extending transformer-based architectures. Works well as Figure 1 of a methods section.

Variations

Decoder-only (GPT style)

A decoder-only transformer architecture in the style of GPT. 12 stacked layers, each with masked multi-head self-attention -> Add & LayerNorm -> feed-forward -> Add & LayerNorm. Token embedding + positional encoding at the bottom; linear + softmax language modeling head on top. Show residual connections as curved arrows. Annotate hidden dim 768, num heads 12. Clean vector style, navy/teal palette.

Vision Transformer (ViT)

A Vision Transformer architecture. Input image is split into 16x16 patches, each linearly projected with a positional embedding, prepended with a learnable [CLS] token. The sequence is fed through 12 transformer encoder layers (multi-head self-attention + MLP + LayerNorm). The [CLS] token output goes to a classification head. Show the patch grid clearly on the left, the encoder stack in the center, and the MLP head on the right. Academic style, white background.

Tips

State the number of layers, heads, and hidden dim — generators reproduce these as labels.
Use the words "residual", "Add & LayerNorm", "cross-attention" exactly — they map to standard visual primitives.
Avoid mixing top-down and left-right flow in one prompt. Pick one and stay consistent.

FAQ

Can I add my own block names like "Mixture of Experts"?

Yes. Replace any sub-layer description with your own block name and a one-sentence behavior description; the model will draw a labeled box and route arrows around it.

How do I get a wider, single-row layout for a 2-column paper?

Add "Layout: horizontal left-to-right, single row, max 7in width" to the prompt and remove the vertical-stack instruction.

Why are the tensor shape annotations sometimes wrong?

Models are not deterministic about numeric labels. Either omit specific shapes, or list them once at the bottom as a separate annotation block with explicit values.