Mixture-of-Experts (MoE) Layer

Sparse routing of tokens through a gating network into top-k experts.

Prompt

A Mixture-of-Experts (MoE) layer drawn as a single transformer-style block.

Left — Input tokens:
- A sequence of N input token vectors enters the layer (drawn as a row of small rectangles).

Center — Gating Network:
- A small router (gating network) takes each token and outputs scores over E=8 expert slots.
- Top-2 experts per token are selected (sparse routing).
- Show routing decisions as colored arrows from each token to its chosen experts.

Right — Experts:
- Eight expert MLPs labeled E1 ... E8, drawn as parallel boxes.
- Only 2 of the 8 are active per token; inactive experts are shown faded/grayed.

Output:
- Token outputs are formed by a weighted sum of the selected expert outputs (weights from the gating softmax).
- Show the residual + LayerNorm wrapper around the entire MoE block.

Style: clean publication vector, navy / teal / amber palette, white background, suitable for NeurIPS / ICLR.

Use in Generator

When to use

For sparse / efficient LLM papers (Switch Transformer, GShard, Mixtral-style).

Variations

Load-balancing loss callout

Add an inset showing the auxiliary load-balancing loss with a histogram of expert utilization across a batch. Include a short equation: L_aux = alpha * sum_e f_e * P_e.

Tips

Fade inactive experts — it visually communicates the "sparse" property at a glance.
Specify top-k explicitly (top-1, top-2 are common). Without it the figure shows dense routing.
Mention residual + LayerNorm so the block stays consistent with surrounding transformer layers.

FAQ

How do I show expert specialization?

Add a small panel below labeling each expert with what it specializes in (e.g., "code", "math", "language X"). Use a heatmap of token-class -> expert routing frequency.