Pipeline & Workflow

Pipeline & Workflow

Multimodal Fusion Pipeline (Image + Text)

Per-modality encoders, projection to a shared space, fusion module and a downstream classifier.

Prompt

A multimodal fusion pipeline for image + text classification, left-to-right.

Top branch β€” Image
- Input image fed into a frozen vision encoder (CLIP-class ViT) producing a sequence of patch embeddings.
- A small projection MLP maps these to a shared embedding dimension D.

Bottom branch β€” Text
- Input text fed into a frozen language encoder (BERT-class) producing token embeddings.
- A small projection MLP maps these to the same shared embedding dimension D.

Center β€” Fusion Module
- Cross-attention block where text tokens attend to image patches and vice versa.
- Output: a joint multimodal representation h_mm.

Right β€” Classifier Head
- A small MLP on top of h_mm produces class logits.
- Loss: cross-entropy.

Style: flat-design publication schematic, white background, no gradients, navy / teal / amber palette, thin arrows, sans-serif. Suitable for ACL / EMNLP / WACV.
Use in Generator

When to use

For multimodal classification papers (hate speech, medical, retrieval, etc.).

Variations

Late-fusion variant

Replace the cross-attention fusion with simple concatenation of image and text embeddings followed by an MLP. Note that this is a "late-fusion" baseline for comparison.

With contrastive alignment objective

Add a contrastive alignment loss between image and text embeddings before fusion (CLIP-style InfoNCE). Show this as an auxiliary loss arrow alongside the classification loss.

Tips

  • Show each modality's encoder explicitly. Generic "encoder" boxes do not communicate the architecture.
  • Mark which encoders are frozen vs trainable with a small lock icon.
  • Use cross-attention rather than concatenation when the fusion is interaction-rich.

FAQ

How do I extend to three modalities?

Replicate the encoder + projection branch for the third modality. The fusion module then performs three-way cross-attention or a hierarchical pairwise fusion.