flowchart TD
Q(["β JFK trips?"]) --> LLM[LLM]
LLM --> Ans(["RatecodeID = 99 β"])
Retrieval Augmented Generation (RAG)
Case: NYC Taxi Trips
NYC TLC Trip Record Data β 50K trips, Jan 2024
tpep_pickup_datetime RatecodeID payment_type
2024-01-15 08:23:11 1.0 1.0
2024-01-15 08:31:44 2.0 1.0
2024-01-15 09:02:58 5.0 2.0
2024-01-15 09:17:03 1.0 2.0
| Column | Value | Meaning |
|---|---|---|
RatecodeID |
1 | Standard rate |
| 2 | JFK flat rate ($70) | |
| 3 | Newark airport | |
| 5 | Negotiated fare | |
payment_type |
1 | Credit card |
| 2 | Cash β tip not recorded |
LLM sees column names and numeric values. The codebook is external.
The problem
Option 1: Put the entire codebook in the system prompt
β 20+ paragraphs, every query pays the token cost, hard to keep updated
. . .
Option 2: Find the relevant part and inject it per query
β Only the chunks that match this question β retrieval cost < 1ms
. . .
How do you find the right chunks automatically? β RAG
RAG applies wheneverβ¦
The LLM needs domain knowledge thatβs too large for the system prompt:
- π Codebook / data dictionary β numeric codes that need human-readable labels (our demo): large codebook you donβt want to fit in system prompt
- π University rulebook, 200-page calendar: program requirements, major electives, accommodation procedures, concession rules, course prerequisites. Student asks βCan I take DSCI 532 as a CS elective?β β LLM needs current program rules, not its training data
- π Privacy + local model, patient records, HR data you canβt send to a cloud API. Run a local Llama / Mistral with a local knowledge base β RAG works the same way, fully offline
RAG pipeline
Without RAG
With RAG
flowchart LR
KB[("Knowledge Base<br/>(built once)")] --> R
Q([β question]) --> R["Retrieve<br/>top-k chunks"]
R --> A["Augmented message =<br/>β question<br/>+ context from KB"]
A --> LLM[LLM]
LLM --> Ans([Answer β])
The knowledge base is built once β but can be rapidly updated (add/edit text files, re-index).
. . .
Key questions:
- π¦ Knowledge base: what format? how do you build it?
- π Retrieve: how does it find the right chunks? TF-IDF? embeddings?
- π¬ Augmented message: what does it actually look like?
- π€ When to use RAG at all? vs. system prompt, vs. tool
π¦ Knowledge base
Can be just plain text chunks:
RatecodeID defines the fare rate type applied for the trip:
1 = Standard rate (most trips)
2 = JFK airport flat rate ($70 + tolls)
3 = Newark airport
4 = Nassau or Westchester
5 = Negotiated fare (pre-arranged price, not metered)
6 = Group ride
payment_type codes (how the passenger paid):
1 = Credit card 2 = Cash 3 = No charge
4 = Dispute 5 = Unknown 6 = Voided trip
One topic per paragraph. No special format needed β plain .txt is enough to start.
. . .
When I want to scale storage beyond flat files:
- When I have 1K+ docs and need fast similarity search β
ChromaDB: local, no server,pip install chromadb - When I need a production-grade vector DB β
QdrantorPinecone: hosted, scalable, filter by metadata - When my data is already in MongoDB β MongoDB Atlas Vector Search: no separate DB needed
π Retrieve
TF-IDF (Term FrequencyβInverse Document Frequency):
- Represents each chunk as a sparse vector: one dimension per vocabulary word
- Non-zero only where the word appears; weighted by how rare it is across all chunks
- Query is vectorized the same way β cosine similarity ranks chunks by overlap
. . .
Good enough for structured glossaries and codebooks where exact keywords appear in queries.
No API key. No model download. Runs in < 1ms.
π Retrieve: how it works
flowchart LR
Q([User query]) --> EQ["Vectorize query<br/>(same method as KB)"]
DOCS[("KB vectors<br/>(pre-built)")] --> COS
EQ --> COS["Cosine similarity<br/>query β each chunk"]
COS --> RANK[Rank by score]
RANK --> THR{"score β₯<br/>threshold?"}
THR -->|"yes β top-k"| CTX[Selected chunks]
THR -->|no match| SKIP[No injection]
CTX --> AUG[Inject into<br/>user prompt]
π Retrieve: plain text + word frequency vector (TF-IDF)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Build index once
chunks = Path("taxi_glossary.txt").read_text().split("\n\n")
vectorizer = TfidfVectorizer()
kb_vectors = vectorizer.fit_transform(chunks)
# Retrieve on every query
def retrieve(query: str, top_k: int = 3) -> list[str]:
q_vec = vectorizer.transform([query])
scores = cosine_similarity(q_vec, kb_vectors).flatten()
top_idx = np.argsort(scores)[::-1][:top_k]
return [chunks[i] for i in top_idx if scores[i] > 0]No framework. No API key. ~10 lines.
π Retrieve: TF-IDF vs. dense embeddings
TF-IDF represents text as word counts. Dense embeddings represent text as meaning: a neural network maps each chunk to a point in high-dimensional space where semantically similar text lands nearby.
| TF-IDF | Dense embeddings | |
|---|---|---|
| Vector type | Sparse (word counts) | Dense (all dims non-zero) |
| Matches | Exact keyword overlap | Semantic meaning |
| βJFKβ βοΈ βairportβ | β different words | β similar meaning |
| Setup | sklearn, instant |
model download or API |
| Speed | < 1ms | 5β50ms |
| Best for | Structured glossaries | Natural language docs |
. . .
Start with TF-IDF. Switch to embeddings when vocabulary mismatch causes misses.
For codebooks, exact match is actually the point: RatecodeID should only match chunks about RatecodeID, not semantically similar fare-related paragraphs.
π Retrieve: scale up with dense embeddings
When exact keyword overlap isnβt enough, switch to dense embeddings:
When I want free local embeddings, no API key, no internet: β
sentence-transformers: download once, run foreverfrom sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") # ~90MB, 384-dim kb_vectors = model.encode(chunks)When I want a free API with no model download: β Jina AI: 1M tokens free, no credit card, 1024-dim vectors
When I need production-quality embeddings: β OpenAI
text-embedding-3-smallor Cohere EmbedWhen I want retrieval + storage + indexing in one framework: β
llama_index: loads docs, builds vector index, persists to diskfrom llama_index.core import SimpleDirectoryReader, VectorStoreIndex index = VectorStoreIndex.from_documents(SimpleDirectoryReader("data").load_data()) index.storage_context.persist(persist_dir="./storage")
RAG is a spectrum
| Level | Approach | When |
|---|---|---|
| 0: No RAG | All context in system prompt | < 10 short paragraphs |
| 1: Simple RAG | TF-IDF + flat files | Structured glossary, < 1ms |
| 2: Embedding RAG | Sentence-transformers + vector DB | 1K+ docs, semantic search |
| 3: Agentic RAG | LLM decides when and what to retrieve | Multi-step reasoning |
. . .
Start at 0. Move up only when you hit the wall at the current level.
When to use RAG
β Use RAG when:
- Domain has codes, acronyms, or jargon the LLM doesnβt know
- Knowledge base changes β update KB, not the model
- Large glossary / rulebook β inject only the relevant slice per query
. . .
β Skip RAG when:
- LLM already knows the domain (standard Python, SQL, common APIs)
- Tiny KB (< 10 paragraphs) β just put it all in the system prompt
- Guaranteed accuracy required β retrieval can fail silently
. . .
β οΈ Failure modes: retrieval miss Β· hallucination despite context Β· stale KB Β· context overflow
Where to go next
Connect RAG to your Shiny app:
- Hook into
@reactive.event(input.querychat_user_input): retrieve context, prepend to the user message before calling the model - Register
retrieve()as a tool: LLM decides when to call it (β tools lecture)
β Walkthrough and Demo with QueryChat
. . .
Scale up when ready:
Storage:
- When KB grows beyond flat files β
ChromaDB: local, no server,pip install chromadb - When you need hosted + metadata filtering β
QdrantorPinecone - When data is already in MongoDB β MongoDB Atlas Vector Search
Retrieval:
- When TF-IDF misses semantic matches β
sentence-transformers: local, ~90MB, no API key - When you want a free embedding API β Jina AI: 1M tokens, no credit card
- When you need production-quality embeddings β OpenAI
text-embedding-3-smallor Cohere Embed
Eval:
- When you want to measure retrieval quality β RAGAS