flowchart TD
Q(["❓ JFK trips?"]) --> LLM[LLM]
LLM --> Ans(["RatecodeID = 99 ❌"])
NYC TLC Trip Record Data — 50K trips, Jan 2024
tpep_pickup_datetime RatecodeID payment_type
2024-01-15 08:23:11 1.0 1.0
2024-01-15 08:31:44 2.0 1.0
2024-01-15 09:02:58 5.0 2.0
2024-01-15 09:17:03 1.0 2.0
| Column | Value | Meaning |
|---|---|---|
RatecodeID |
1 | Standard rate |
| 2 | JFK flat rate ($70) | |
| 3 | Newark airport | |
| 5 | Negotiated fare | |
payment_type |
1 | Credit card |
| 2 | Cash — tip not recorded |
LLM sees column names and numeric values. The codebook is external.
Option 1: Put the entire codebook in the system prompt
❌ 20+ paragraphs, every query pays the token cost, hard to keep updated
Option 2: Find the relevant part and inject it per query
✅ Only the chunks that match this question — retrieval cost < 1ms
How do you find the right chunks automatically? → RAG
The LLM needs domain knowledge that’s too large for the system prompt:
Without RAG
flowchart TD
Q(["❓ JFK trips?"]) --> LLM[LLM]
LLM --> Ans(["RatecodeID = 99 ❌"])
With RAG
flowchart LR
KB[("Knowledge Base<br/>(built once)")] --> R
Q([❓ question]) --> R["Retrieve<br/>top-k chunks"]
R --> A["Augmented message =<br/>❓ question<br/>+ context from KB"]
A --> LLM[LLM]
LLM --> Ans([Answer ✓])
The knowledge base is built once — but can be rapidly updated (add/edit text files, re-index).
Key questions:
Can be just plain text chunks:
RatecodeID defines the fare rate type applied for the trip:
1 = Standard rate (most trips)
2 = JFK airport flat rate ($70 + tolls)
3 = Newark airport
4 = Nassau or Westchester
5 = Negotiated fare (pre-arranged price, not metered)
6 = Group ride
payment_type codes (how the passenger paid):
1 = Credit card 2 = Cash 3 = No charge
4 = Dispute 5 = Unknown 6 = Voided trip
One topic per paragraph. No special format needed — plain .txt is enough to start.
When I want to scale storage beyond flat files:
ChromaDB: local, no server, pip install chromadbQdrant or Pinecone: hosted, scalable, filter by metadataTF-IDF (Term Frequency–Inverse Document Frequency):
Good enough for structured glossaries and codebooks where exact keywords appear in queries.
No API key. No model download. Runs in < 1ms.
flowchart LR
Q([User query]) --> EQ["Vectorize query<br/>(same method as KB)"]
DOCS[("KB vectors<br/>(pre-built)")] --> COS
EQ --> COS["Cosine similarity<br/>query ↔ each chunk"]
COS --> RANK[Rank by score]
RANK --> THR{"score ≥<br/>threshold?"}
THR -->|"yes → top-k"| CTX[Selected chunks]
THR -->|no match| SKIP[No injection]
CTX --> AUG[Inject into<br/>user prompt]
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Build index once
chunks = Path("taxi_glossary.txt").read_text().split("\n\n")
vectorizer = TfidfVectorizer()
kb_vectors = vectorizer.fit_transform(chunks)
# Retrieve on every query
def retrieve(query: str, top_k: int = 3) -> list[str]:
q_vec = vectorizer.transform([query])
scores = cosine_similarity(q_vec, kb_vectors).flatten()
top_idx = np.argsort(scores)[::-1][:top_k]
return [chunks[i] for i in top_idx if scores[i] > 0]No framework. No API key. ~10 lines.
TF-IDF represents text as word counts. Dense embeddings represent text as meaning: a neural network maps each chunk to a point in high-dimensional space where semantically similar text lands nearby.
| TF-IDF | Dense embeddings | |
|---|---|---|
| Vector type | Sparse (word counts) | Dense (all dims non-zero) |
| Matches | Exact keyword overlap | Semantic meaning |
| “JFK” ↔︎ “airport” | ❌ different words | ✅ similar meaning |
| Setup | sklearn, instant |
model download or API |
| Speed | < 1ms | 5–50ms |
| Best for | Structured glossaries | Natural language docs |
Start with TF-IDF. Switch to embeddings when vocabulary mismatch causes misses.
For codebooks, exact match is actually the point: RatecodeID should only match chunks about RatecodeID, not semantically similar fare-related paragraphs.
When exact keyword overlap isn’t enough, switch to dense embeddings:
When I want free local embeddings, no API key, no internet: → sentence-transformers: download once, run forever
When I want a free API with no model download: → Jina AI: 1M tokens free, no credit card, 1024-dim vectors
When I need production-quality embeddings: → OpenAI text-embedding-3-small or Cohere Embed
When I want retrieval + storage + indexing in one framework: → llama_index: loads docs, builds vector index, persists to disk
| Level | Approach | When |
|---|---|---|
| 0: No RAG | All context in system prompt | < 10 short paragraphs |
| 1: Simple RAG | TF-IDF + flat files | Structured glossary, < 1ms |
| 2: Embedding RAG | Sentence-transformers + vector DB | 1K+ docs, semantic search |
| 3: Agentic RAG | LLM decides when and what to retrieve | Multi-step reasoning |
Start at 0. Move up only when you hit the wall at the current level.
✅ Use RAG when:
❌ Skip RAG when:
⚠️ Failure modes: retrieval miss · hallucination despite context · stale KB · context overflow
Connect RAG to your Shiny app:
@reactive.event(input.querychat_user_input): retrieve context, prepend to the user message before calling the modelretrieve() as a tool: LLM decides when to call it (→ tools lecture)→ Walkthrough and Demo with QueryChat
Scale up when ready:
Storage:
ChromaDB: local, no server, pip install chromadbQdrant or PineconeRetrieval:
sentence-transformers: local, ~90MB, no API keytext-embedding-3-small or Cohere EmbedEval:
DSCI 532: Data Visualization 2 https://github.com/UBC-MDS/DSCI_532_vis-2_book