Decoder-Only LLM Architecture (GPT-Style)

Stacked decoder blocks with masked self-attention and a language modeling head.

Prompt

A decoder-only transformer architecture in the style of GPT, drawn as a vertical stack with input at the bottom and output at the top.

Bottom: token embedding + sinusoidal positional encoding.

Middle: a stack of N decoder layers (N=12 for the figure). Each layer contains:
- Masked multi-head self-attention (12 heads)
- Add & LayerNorm
- Feed-forward MLP (hidden dim 3072)
- Add & LayerNorm

Show residual (skip) connections as curved dashed arcs around each sub-layer.

Top:
- Final LayerNorm
- Linear projection to vocab size
- Softmax to next-token probability distribution

Right margin: tensor shape annotations beside each block (B = batch, T = seq length, D = 768).
Style: clean academic vector, navy / teal accent, white background, sans-serif labels. Suitable for ICLR or NeurIPS.

Use in Generator

When to use

For LLM / fine-tuning / instruction-tuning papers introducing or modifying a decoder-only model.

Variations

With KV-cache annotation

Same architecture but annotate the KV-cache flow: highlight where keys and values are cached at each layer during autoregressive decoding. Add a side note showing how cache reuse skips re-computation across positions.

Tips

Specify N (number of layers) and D (hidden dim) explicitly — generic prompts produce generic counts.
Mention "masked" self-attention. Without it the figure may not show the causal triangle.
For causal attention masks, ask for a small triangular mask icon next to the attention block.

FAQ

Can I show LoRA adapters on top of this architecture?

Yes — say "Overlay small LoRA adapter modules on each attention and MLP block as orange tabs labeled \"LoRA r=8\"."

How do I emphasize the autoregressive generation?

Add "Show three arrows on the right side from output back to input, each labeled with a generation step (t, t+1, t+2)."