Multimodal Fusion Pipeline (Image + Text)

Per-modality encoders, projection to a shared space, fusion module and a downstream classifier.

Prompt

A multimodal fusion pipeline for image + text classification, left-to-right.

Top branch — Image
- Input image fed into a frozen vision encoder (CLIP-class ViT) producing a sequence of patch embeddings.
- A small projection MLP maps these to a shared embedding dimension D.

Bottom branch — Text
- Input text fed into a frozen language encoder (BERT-class) producing token embeddings.
- A small projection MLP maps these to the same shared embedding dimension D.

Center — Fusion Module
- Cross-attention block where text tokens attend to image patches and vice versa.
- Output: a joint multimodal representation h_mm.

Right — Classifier Head
- A small MLP on top of h_mm produces class logits.
- Loss: cross-entropy.

Style: flat-design publication schematic, white background, no gradients, navy / teal / amber palette, thin arrows, sans-serif. Suitable for ACL / EMNLP / WACV.

Use in Generator

When to use

For multimodal classification papers (hate speech, medical, retrieval, etc.).

Variations

Late-fusion variant

Replace the cross-attention fusion with simple concatenation of image and text embeddings followed by an MLP. Note that this is a "late-fusion" baseline for comparison.

With contrastive alignment objective

Add a contrastive alignment loss between image and text embeddings before fusion (CLIP-style InfoNCE). Show this as an auxiliary loss arrow alongside the classification loss.

Tips

Show each modality's encoder explicitly. Generic "encoder" boxes do not communicate the architecture.
Mark which encoders are frozen vs trainable with a small lock icon.
Use cross-attention rather than concatenation when the fusion is interaction-rich.

FAQ

How do I extend to three modalities?

Replicate the encoder + projection branch for the third modality. The fusion module then performs three-way cross-attention or a hierarchical pairwise fusion.