ML Architecture
Decoder-Only LLM Architecture (GPT-Style)
Stacked decoder blocks with masked self-attention and a language modeling head.
Prompt
A decoder-only transformer architecture in the style of GPT, drawn as a vertical stack with input at the bottom and output at the top. Bottom: token embedding + sinusoidal positional encoding. Middle: a stack of N decoder layers (N=12 for the figure). Each layer contains: - Masked multi-head self-attention (12 heads) - Add & LayerNorm - Feed-forward MLP (hidden dim 3072) - Add & LayerNorm Show residual (skip) connections as curved dashed arcs around each sub-layer. Top: - Final LayerNorm - Linear projection to vocab size - Softmax to next-token probability distribution Right margin: tensor shape annotations beside each block (B = batch, T = seq length, D = 768). Style: clean academic vector, navy / teal accent, white background, sans-serif labels. Suitable for ICLR or NeurIPS.Use in Generator
When to use
For LLM / fine-tuning / instruction-tuning papers introducing or modifying a decoder-only model.
Variations
With KV-cache annotation
Same architecture but annotate the KV-cache flow: highlight where keys and values are cached at each layer during autoregressive decoding. Add a side note showing how cache reuse skips re-computation across positions.
Tips
- Specify N (number of layers) and D (hidden dim) explicitly β generic prompts produce generic counts.
- Mention "masked" self-attention. Without it the figure may not show the causal triangle.
- For causal attention masks, ask for a small triangular mask icon next to the attention block.
FAQ
Can I show LoRA adapters on top of this architecture?
Yes β say "Overlay small LoRA adapter modules on each attention and MLP block as orange tabs labeled \"LoRA r=8\"."
How do I emphasize the autoregressive generation?
Add "Show three arrows on the right side from output back to input, each labeled with a generation step (t, t+1, t+2)."
