Retrieval Augmented Generation (RAG)

Case: NYC Taxi Trips

NYC TLC Trip Record Data β€” 50K trips, Jan 2024

tpep_pickup_datetime  RatecodeID  payment_type
2024-01-15 08:23:11   1.0         1.0
2024-01-15 08:31:44   2.0         1.0
2024-01-15 09:02:58   5.0         2.0
2024-01-15 09:17:03   1.0         2.0
Column Value Meaning
RatecodeID 1 Standard rate
2 JFK flat rate ($70)
3 Newark airport
5 Negotiated fare
payment_type 1 Credit card
2 Cash β€” tip not recorded

LLM sees column names and numeric values. The codebook is external.

The problem

Option 1: Put the entire codebook in the system prompt

❌ 20+ paragraphs, every query pays the token cost, hard to keep updated

. . .

Option 2: Find the relevant part and inject it per query

βœ… Only the chunks that match this question β€” retrieval cost < 1ms

. . .

How do you find the right chunks automatically? β†’ RAG

RAG applies whenever…

The LLM needs domain knowledge that’s too large for the system prompt:

  • πŸš• Codebook / data dictionary β€” numeric codes that need human-readable labels (our demo): large codebook you don’t want to fit in system prompt
  • πŸŽ“ University rulebook, 200-page calendar: program requirements, major electives, accommodation procedures, concession rules, course prerequisites. Student asks β€œCan I take DSCI 532 as a CS elective?” β€” LLM needs current program rules, not its training data
  • πŸ”’ Privacy + local model, patient records, HR data you can’t send to a cloud API. Run a local Llama / Mistral with a local knowledge base β€” RAG works the same way, fully offline

RAG pipeline

Without RAG

flowchart TD
    Q(["❓ JFK trips?"]) --> LLM[LLM]
    LLM --> Ans(["RatecodeID = 99 ❌"])

With RAG

flowchart LR
    KB[("Knowledge Base<br/>(built once)")] --> R
    Q([❓ question]) --> R["Retrieve<br/>top-k chunks"]
    R --> A["Augmented message =<br/>❓ question<br/>+ context from KB"]
    A --> LLM[LLM]
    LLM --> Ans([Answer βœ“])

The knowledge base is built once β€” but can be rapidly updated (add/edit text files, re-index).

. . .

Key questions:

  • πŸ“¦ Knowledge base: what format? how do you build it?
  • πŸ” Retrieve: how does it find the right chunks? TF-IDF? embeddings?
  • πŸ’¬ Augmented message: what does it actually look like?
  • πŸ€” When to use RAG at all? vs. system prompt, vs. tool

πŸ“¦ Knowledge base

Can be just plain text chunks:

RatecodeID defines the fare rate type applied for the trip:
1 = Standard rate (most trips)
2 = JFK airport flat rate ($70 + tolls)
3 = Newark airport
4 = Nassau or Westchester
5 = Negotiated fare (pre-arranged price, not metered)
6 = Group ride

payment_type codes (how the passenger paid):
1 = Credit card  2 = Cash  3 = No charge
4 = Dispute      5 = Unknown  6 = Voided trip

One topic per paragraph. No special format needed β€” plain .txt is enough to start.

. . .

When I want to scale storage beyond flat files:

  • When I have 1K+ docs and need fast similarity search β†’ ChromaDB: local, no server, pip install chromadb
  • When I need a production-grade vector DB β†’ Qdrant or Pinecone: hosted, scalable, filter by metadata
  • When my data is already in MongoDB β†’ MongoDB Atlas Vector Search: no separate DB needed

πŸ” Retrieve

TF-IDF (Term Frequency–Inverse Document Frequency):

  • Represents each chunk as a sparse vector: one dimension per vocabulary word
  • Non-zero only where the word appears; weighted by how rare it is across all chunks
  • Query is vectorized the same way β†’ cosine similarity ranks chunks by overlap

. . .

Good enough for structured glossaries and codebooks where exact keywords appear in queries.

No API key. No model download. Runs in < 1ms.

πŸ” Retrieve: how it works

flowchart LR
    Q([User query]) --> EQ["Vectorize query<br/>(same method as KB)"]
    DOCS[("KB vectors<br/>(pre-built)")] --> COS
    EQ --> COS["Cosine similarity<br/>query ↔ each chunk"]
    COS --> RANK[Rank by score]
    RANK --> THR{"score β‰₯<br/>threshold?"}
    THR -->|"yes β†’ top-k"| CTX[Selected chunks]
    THR -->|no match| SKIP[No injection]
    CTX --> AUG[Inject into<br/>user prompt]

πŸ” Retrieve: plain text + word frequency vector (TF-IDF)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Build index once
chunks = Path("taxi_glossary.txt").read_text().split("\n\n")
vectorizer = TfidfVectorizer()
kb_vectors = vectorizer.fit_transform(chunks)

# Retrieve on every query
def retrieve(query: str, top_k: int = 3) -> list[str]:
    q_vec = vectorizer.transform([query])
    scores = cosine_similarity(q_vec, kb_vectors).flatten()
    top_idx = np.argsort(scores)[::-1][:top_k]
    return [chunks[i] for i in top_idx if scores[i] > 0]

No framework. No API key. ~10 lines.

πŸ” Retrieve: TF-IDF vs. dense embeddings

TF-IDF represents text as word counts. Dense embeddings represent text as meaning: a neural network maps each chunk to a point in high-dimensional space where semantically similar text lands nearby.

TF-IDF Dense embeddings
Vector type Sparse (word counts) Dense (all dims non-zero)
Matches Exact keyword overlap Semantic meaning
β€œJFK” β†”οΈŽ β€œairport” ❌ different words βœ… similar meaning
Setup sklearn, instant model download or API
Speed < 1ms 5–50ms
Best for Structured glossaries Natural language docs

. . .

Start with TF-IDF. Switch to embeddings when vocabulary mismatch causes misses.

For codebooks, exact match is actually the point: RatecodeID should only match chunks about RatecodeID, not semantically similar fare-related paragraphs.

πŸ” Retrieve: scale up with dense embeddings

When exact keyword overlap isn’t enough, switch to dense embeddings:

  • When I want free local embeddings, no API key, no internet: β†’ sentence-transformers: download once, run forever

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("all-MiniLM-L6-v2")  # ~90MB, 384-dim
    kb_vectors = model.encode(chunks)
  • When I want a free API with no model download: β†’ Jina AI: 1M tokens free, no credit card, 1024-dim vectors

  • When I need production-quality embeddings: β†’ OpenAI text-embedding-3-small or Cohere Embed

  • When I want retrieval + storage + indexing in one framework: β†’ llama_index: loads docs, builds vector index, persists to disk

    from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
    index = VectorStoreIndex.from_documents(SimpleDirectoryReader("data").load_data())
    index.storage_context.persist(persist_dir="./storage")

RAG is a spectrum

Level Approach When
0: No RAG All context in system prompt < 10 short paragraphs
1: Simple RAG TF-IDF + flat files Structured glossary, < 1ms
2: Embedding RAG Sentence-transformers + vector DB 1K+ docs, semantic search
3: Agentic RAG LLM decides when and what to retrieve Multi-step reasoning

. . .

Start at 0. Move up only when you hit the wall at the current level.

When to use RAG

βœ… Use RAG when:

  • Domain has codes, acronyms, or jargon the LLM doesn’t know
  • Knowledge base changes β€” update KB, not the model
  • Large glossary / rulebook β€” inject only the relevant slice per query

. . .

❌ Skip RAG when:

  • LLM already knows the domain (standard Python, SQL, common APIs)
  • Tiny KB (< 10 paragraphs) β€” just put it all in the system prompt
  • Guaranteed accuracy required β€” retrieval can fail silently

. . .

⚠️ Failure modes: retrieval miss · hallucination despite context · stale KB · context overflow

Where to go next

Connect RAG to your Shiny app:

  • Hook into @reactive.event(input.querychat_user_input): retrieve context, prepend to the user message before calling the model
  • Register retrieve() as a tool: LLM decides when to call it (β†’ tools lecture)

β†’ Walkthrough and Demo with QueryChat

. . .

Scale up when ready:

Storage:

Retrieval:

Eval:

  • When you want to measure retrieval quality β†’ RAGAS