ML Architecture
Mixture-of-Experts (MoE) Layer
Sparse routing of tokens through a gating network into top-k experts.
Prompt
A Mixture-of-Experts (MoE) layer drawn as a single transformer-style block. Left β Input tokens: - A sequence of N input token vectors enters the layer (drawn as a row of small rectangles). Center β Gating Network: - A small router (gating network) takes each token and outputs scores over E=8 expert slots. - Top-2 experts per token are selected (sparse routing). - Show routing decisions as colored arrows from each token to its chosen experts. Right β Experts: - Eight expert MLPs labeled E1 ... E8, drawn as parallel boxes. - Only 2 of the 8 are active per token; inactive experts are shown faded/grayed. Output: - Token outputs are formed by a weighted sum of the selected expert outputs (weights from the gating softmax). - Show the residual + LayerNorm wrapper around the entire MoE block. Style: clean publication vector, navy / teal / amber palette, white background, suitable for NeurIPS / ICLR.Use in Generator
When to use
For sparse / efficient LLM papers (Switch Transformer, GShard, Mixtral-style).
Variations
Load-balancing loss callout
Add an inset showing the auxiliary load-balancing loss with a histogram of expert utilization across a batch. Include a short equation: L_aux = alpha * sum_e f_e * P_e.
Tips
- Fade inactive experts β it visually communicates the "sparse" property at a glance.
- Specify top-k explicitly (top-1, top-2 are common). Without it the figure shows dense routing.
- Mention residual + LayerNorm so the block stays consistent with surrounding transformer layers.
FAQ
How do I show expert specialization?
Add a small panel below labeling each expert with what it specializes in (e.g., "code", "math", "language X"). Use a heatmap of token-class -> expert routing frequency.
