ML Architecture

ML Architecture

Retrieval-Augmented Generation (RAG) Pipeline

Query embedding, vector retrieval, prompt augmentation and LLM response generation.

Prompt

A Retrieval-Augmented Generation (RAG) pipeline, left-to-right horizontal layout.

Stage 1 β€” User Query:
- A short text query enters the system.

Stage 2 β€” Query Embedding:
- The query is encoded by a sentence-embedding model into a dense vector q.

Stage 3 β€” Vector Retrieval:
- q is matched against a vector database (drawn as a stack of vectors with a "Vector Store" label).
- Top-k nearest neighbors (k=4) are retrieved as context chunks.

Stage 4 β€” Prompt Construction:
- The original query and the retrieved chunks are concatenated into an augmented prompt template.

Stage 5 β€” LLM Generation:
- The augmented prompt is fed to an LLM (e.g., GPT-class model) which produces the final grounded response.

Outside the main flow, on top: an offline indexing pipeline showing documents -> chunker -> embedder -> vector store. Connect with a dashed arrow into Stage 3.
Style: clean academic vector, navy and amber palette, white background, sans-serif labels.
Use in Generator

When to use

For RAG / question-answering / knowledge-grounded generation papers and engineering blog posts.

Variations

With re-ranker stage

Insert a re-ranking stage between vector retrieval and prompt construction. The re-ranker (a cross-encoder) scores each retrieved chunk against the query and reorders them, keeping the top-k'.

Hybrid (sparse + dense) retrieval

Replace the single vector retrieval with two parallel retrievers: BM25 sparse retrieval and dense embedding retrieval. Their results are merged via reciprocal rank fusion before prompt construction.

Tips

  • Always include the offline indexing branch β€” without it readers don't see how the vector store was built.
  • Use k=4 or k=5 in the figure. Larger k crowds the layout; smaller k looks toy.
  • Annotate the prompt template inline if space allows β€” it shows readers what the LLM actually sees.

FAQ

Can I show citation generation in the output?

Add "The LLM output includes inline citation markers [1], [2] referring back to retrieved chunks."