Pipeline & Workflow
Multimodal Fusion Pipeline (Image + Text)
Per-modality encoders, projection to a shared space, fusion module and a downstream classifier.
Prompt
A multimodal fusion pipeline for image + text classification, left-to-right. Top branch β Image - Input image fed into a frozen vision encoder (CLIP-class ViT) producing a sequence of patch embeddings. - A small projection MLP maps these to a shared embedding dimension D. Bottom branch β Text - Input text fed into a frozen language encoder (BERT-class) producing token embeddings. - A small projection MLP maps these to the same shared embedding dimension D. Center β Fusion Module - Cross-attention block where text tokens attend to image patches and vice versa. - Output: a joint multimodal representation h_mm. Right β Classifier Head - A small MLP on top of h_mm produces class logits. - Loss: cross-entropy. Style: flat-design publication schematic, white background, no gradients, navy / teal / amber palette, thin arrows, sans-serif. Suitable for ACL / EMNLP / WACV.Use in Generator
When to use
For multimodal classification papers (hate speech, medical, retrieval, etc.).
Variations
Late-fusion variant
Replace the cross-attention fusion with simple concatenation of image and text embeddings followed by an MLP. Note that this is a "late-fusion" baseline for comparison.
With contrastive alignment objective
Add a contrastive alignment loss between image and text embeddings before fusion (CLIP-style InfoNCE). Show this as an auxiliary loss arrow alongside the classification loss.
Tips
- Show each modality's encoder explicitly. Generic "encoder" boxes do not communicate the architecture.
- Mark which encoders are frozen vs trainable with a small lock icon.
- Use cross-attention rather than concatenation when the fusion is interaction-rich.
FAQ
How do I extend to three modalities?
Replicate the encoder + projection branch for the third modality. The fusion module then performs three-way cross-attention or a hierarchical pairwise fusion.
