flowchart LR
subgraph disk["💾 On disk"]
D[(parquet<br>3M trips)]
end
subgraph mem["⚡ In RAM"]
direction LR
A[/payment = cash/]
B[/distance < 5 mi/]
F{{filtered}}
P1([n_trips])
P2([table])
A --> F
B --> F
F --> P1
F --> P2
end
F -- ".execute()" --> D
D -- "~190k rows" --> F
Databases in Shiny
Today
Part 1: Reading (large) data into your dashboard app
- Why not (always) pandas: Eager Loading Trap
- Lazy Loading in Shiny: DBs, parquet
- Why not (always) SQL: ibis + DuckDB-lazy loading
- Running example: NYC Yellow Taxi (~3M trips)
app-01,app-02a/b
Part 2: Writing data from your app
- Where problems come from: local, concurrent users, cloud deployments (Posit Connect)
- External backend
- Simple framework (writing form data)
- Using external database to log LLM interaction
Why Not (always) Pandas?: Eager vs Lazy Loading
The Eager Loading Trap
Everything so far:
import pandas as pd
df = pd.read_csv("trips.csv") # entire file -> RAM. . .
That works fine for penguins (344 rows).
. . .
NYC taxi 2024: ~40 million rows × 20 columns -> 5+ GB
. . .
Your laptop may survive (16–32 GB RAM)…
Posit Connect Cloud (and most of the similar services) free tier: 4 GB RAM. App crashes on startup.
| Plan | RAM |
|---|---|
| Free | 4 GB |
| Basic | 8 GB |
| Enhanced | 16 GB |
Lazy Loading
Key idea: describe what you want (maybe in multiple steps), let the database fetch only the rows you need.
- The query shape lives in Python
- DuckDB executes it against the file on disk (in 525 we will learn more)
- Only the result (e.g. 500 matching rows) enters RAM
. . .
DuckDB scans the parquet file on disk and returns only the matching rows - the 3M-row file never enters RAM.
The Stack: Lazy Data Loading
parquet + DuckDB + ibis (works in many analytical settings, not only shiny)
| Layer | Tool | Role |
|---|---|---|
| Data | Parquet file | Columnar: only needed columns are read |
| Query engine | DuckDB | Executes (SQL) queries, reads parquet natively |
| Query language | ibis | Python API for building queries programmatically |
. . .
pip install 'ibis-framework[duckdb]'DuckDB runs in-process-no server, no config, just a file with data (or nothing at all).
Connect to Your Data
. . .
taxi is a table reference, not a DataFrame. Nothing has been read from disk.
ibis Expressions
# this part defines plan (the query), nothing happens here
taxi.filter([_.payment_type == 1]) # filter rows
taxi.select(["trip_distance", "fare_amount"]) # select columns
taxi.limit(500) # first N rows
taxi.count() # row count (still lazy! nothing happened yet)
#this part executes the plan (query), stuff finally happens
taxi.filter([_.payment_type == 1]).execute() # <- runs the query. . .
_ is a column reference shorthand: _.col_name reads as “the column named col_name”.
Lazy Until .execute()
# Step 1: build the expression (instant: no query sent to DuckDB yet, we just planning what to do)
expr = (
taxi
.filter([_.payment_type == 1, _.trip_distance < 5])
.select(["trip_distance", "fare_amount", "total_amount"])
.limit(500)
)
# Step 2: execute (we run our query: DuckDB scans the parquet file NOW)
df = expr.execute() # returns a DataFrame. . .
You can build and reshape expressions freely. DuckDB only does work at .execute().
ETL: One-Time Prep
Your raw data arrives as CSV (or json, …) - large, row-oriented, slow to query. Convert it to Parquet once; the app reads the fast columnar file from then on:
. . .
read_csv_auto infers headers and column types automatically.
Parquet is ~ 5–10× faster for filtered queries than CSV - DuckDB only reads the columns it needs.
(we will cover Parquet in depth in DSCI 525 next week)
Part 1 Summary: Lazy Data Loading
| Step | What | Why |
|---|---|---|
| One-time prep | CSV → Parquet (prep_data.py) |
Columnar format: 5–10× faster filtered reads |
| Query engine | DuckDB (in-process) | No server, reads Parquet natively, pushes filters to disk |
| Query layer | ibis or raw SQL | ibis: composable Python; SQL: familiar, explicit |
| Shiny integration | @reactive.calc + .execute() |
Query runs once per input change, only matching rows enter RAM |
. . .
This pattern works beyond Shiny — same stack in notebooks, ETL scripts, or any Python pipeline.
. . .
Dealing with Data in Shiny
SQL vs ibis: Building Filters
Three checkbox filters - user can check any combination.
SQL
@reactive.calc
def filtered():
conditions = []
if input.payment():
ids = ", ".join(str(p) for p in input.payment())
conditions.append(f"payment_type IN ({ids})")
if input.vendor():
ids = ", ".join(str(v) for v in input.vendor())
conditions.append(f"VendorID IN ({ids})")
if input.rate_code():
ids = ", ".join(str(r) for r in input.rate_code())
conditions.append(f"RatecodeID IN ({ids})")
where = ("WHERE " + " AND ".join(conditions)
if conditions else "")
print(where)
return con.execute(
f"SELECT * FROM taxi {where} LIMIT 500"
).df()ibis mirror
@reactive.calc
def filtered():
expr = taxi
if input.payment():
expr = expr.filter(
_.payment_type.isin(
[int(p) for p in input.payment()]
)
)
if input.vendor():
expr = expr.filter(
_.VendorID.isin(
[int(v) for v in input.vendor()]
)
)
if input.rate_code():
expr = expr.filter(
_.RatecodeID.isin(
[int(r) for r in input.rate_code()]
)
)
return expr.limit(500).execute(). . .
Same structure per filter - but ibis builds a query object, not a string. No injection risk. Nothing to assemble and check. Less problems with ordering expressions.
ibis Goes Further
When all filters share the same shape, ibis expressions compose in a loop:
ibis mirror (from previous slide)
@reactive.calc
def filtered():
expr = taxi
if input.payment():
expr = expr.filter(
_.payment_type.isin(
[int(p) for p in input.payment()]
)
)
if input.vendor():
expr = expr.filter(
_.VendorID.isin(
[int(v) for v in input.vendor()]
)
)
if input.rate_code():
expr = expr.filter(
_.RatecodeID.isin(
[int(r) for r in input.rate_code()]
)
)
return expr.limit(500).execute()ibis: filters as a list
@reactive.calc
def filtered():
# col = ibis column expression (e.g. _.payment_type)
# fn = Shiny input callable (e.g. input.payment)
# fn() = current checked values (e.g. ("1", "2"))
COLS = {
"payment": (_.payment_type, input.payment),
"vendor": (_.VendorID, input.vendor),
"rate_code": (_.RatecodeID, input.rate_code),
}
filters = [
col.isin([int(v) for v in fn()])
for col, fn in COLS.values()
if fn() # skip if nothing checked
]
expr = taxi.filter(filters) if filters else taxi
return expr.limit(500).execute(). . .
ibis expressions are first-class Python objects - store them, loop over them, compose them. Works well when filters are homogeneous (all .isin()). For mixed types (sliders, text), the sql mirror style is clearer.
ibis expression API: ibis-project.org/reference
So Why Not Just (Always) Write SQL?
Raw DuckDB SQL
@reactive.calc
def filtered():
conditions = []
if input.payment():
ids = ",".join(str(p) for p in input.payment())
conditions.append(
f"payment_type IN ({ids})"
)
lo, hi = input.distance()
conditions.append(
f"trip_distance BETWEEN {lo} AND {hi}"
)
if input.passengers() != "All":
conditions.append(
f"passenger_count = {input.passengers()}"
)
if input.vendor() != "Both":
conditions.append(
f"VendorID = {input.vendor()}"
)
where = ("WHERE " + " AND ".join(conditions)
if conditions else "")
print(where)
return con.execute(
f"SELECT * FROM taxi {where} LIMIT 500"
).df()ibis expressions
@reactive.calc
def filtered():
expr = taxi
if input.payment():
expr = expr.filter(
_.payment_type.isin(
[int(p) for p in input.payment()]
)
)
lo, hi = input.distance()
expr = expr.filter(
_.trip_distance.between(lo, hi)
)
if input.passengers() != "All":
expr = expr.filter(
_.passenger_count == int(input.passengers())
)
if input.vendor() != "Both":
expr = expr.filter(
_.VendorID == int(input.vendor())
)
return expr.limit(500).execute(). . .
SQL: manual WHERE assembly, f-string injection risk, “All” guards scattered everywhere. ibis: each filter is optional-skip it, and it simply isn’t in the query.
ibis + DuckDB in Shiny
Together with DuckDB and lazy loading:
flowchart LR
A[/"input.payment()"/] --> C
B[/"input.distance()"/] --> C
C{{filtered}} -- ".execute()" --> DB[(parquet<br>file)]
DB -- "filtered<br>rows" --> C
C --> D([table])
C --> E([value_box])
con = ibis.duckdb.connect()
taxi = con.read_parquet("data/taxi_2024-01.parquet")
def server(input, output, session):
session.on_ended(con.disconnect) # clean up when user leaves
@reactive.calc
def filtered(): # ibis expression-inputs flow in naturally
lo, hi = input.distance()
return taxi.filter(
[
_.payment_type.isin([int(p) for p in input.payment()]),
_.trip_distance.between(lo, hi),
]
)
@render.data_frame
def table():
return filtered().limit(500).execute() # query runs here. . .
One @reactive.calc, multiple outputs-DuckDB queries only once per input change.
app-01: ibis Intro (penguins)
code/lecture07/app-01-ibis-intro.py
- Loads palmerpenguins into an in-process DuckDB
- Species + island filters (selectize)
- ibis expression → row count + DataGrid
con = ibis.duckdb.connect()
penguins = con.create_table(
"penguins", load_penguins(), overwrite=True
)
@reactive.calc
def filtered():
return penguins.filter([
_.species.isin(input.species()),
_.island.isin(input.island()),
])
@render.data_frame
def table():
return filtered().execute()app-02: Taxi Dashboard
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| standalone: true
#| components: [editor, viewer]
#| viewerHeight: 520
## file: app.py
## app-02a-taxi-inmem.py — ibis + DuckDB taxi dashboard (shinylive / in-memory version)
## Uses an embedded 400-row sample so it runs in the browser.
## For the full 3M-row version with lazy parquet loading: see app-02b-taxi-parquet.py
import random
import ibis
import pandas as pd
from ibis import _
from shiny import App, reactive, render, ui
# ── Embedded sample data (400 rows, realistic distributions) ─────────────────
random.seed(42)
_n = 400
_pw = [60, 30, 5, 5]
_vw = [45, 55]
_xw = [2, 70, 15, 6, 3, 2, 2]
_rows = []
for i in range(_n):
d = round(random.uniform(0.1, 22), 1)
f = round(2.5 + d * random.uniform(2.2, 3.5), 2)
_rows.append(
dict(
VendorID=random.choices([1, 2], weights=_vw)[0],
passenger_count=random.choices([0, 1, 2, 3, 4, 5, 6], weights=_xw)[0],
trip_distance=d,
payment_type=random.choices([1, 2, 3, 4], weights=_pw)[0],
fare_amount=f,
total_amount=round(f * random.uniform(1.05, 1.3), 2),
)
)
data = pd.DataFrame(_rows)
# In production this would be:
# con = ibis.duckdb.connect()
# taxi = con.read_parquet("data/taxi_2024-01.parquet")
taxi = ibis.memtable(data)
# ── UI ────────────────────────────────────────────────────────────────────────
PAYMENT_LABELS = {"1": "Credit card", "2": "Cash", "3": "No charge", "4": "Dispute"}
VENDOR_CHOICES = {"Both": "Both", "1": "Creative Mobile", "2": "VeriFone"}
PASSENGER_CHOICES = ["All", "0", "1", "2", "3"]
app_ui = ui.page_sidebar(
ui.sidebar(
ui.input_checkbox_group(
"payment",
"Payment type",
choices=PAYMENT_LABELS,
selected=list(PAYMENT_LABELS.keys()),
),
ui.input_slider(
"distance", "Trip distance (miles)", min=0, max=25, value=[0, 15], step=0.5
),
ui.input_select(
"passengers", "Passenger count", choices=PASSENGER_CHOICES, selected="All"
),
ui.input_radio_buttons(
"vendor", "Vendor", choices=VENDOR_CHOICES, selected="Both"
),
width=250,
),
ui.layout_columns(
ui.value_box("Matching trips", ui.output_text("n_trips")),
ui.value_box("Avg fare", ui.output_text("avg_fare")),
ui.value_box("Avg distance", ui.output_text("avg_dist")),
col_widths=[4, 4, 4],
),
ui.output_data_frame("table"),
title="NYC Taxi Sample — ibis + DuckDB",
)
# ── Server ────────────────────────────────────────────────────────────────────
def server(input, output, session):
@reactive.calc
def filtered():
expr = taxi
if input.payment():
expr = expr.filter(
_.payment_type.isin([int(p) for p in input.payment()])
)
lo, hi = input.distance()
expr = expr.filter(_.trip_distance.between(lo, hi))
if input.passengers() != "All":
expr = expr.filter(_.passenger_count == int(input.passengers()))
if input.vendor() != "Both":
expr = expr.filter(_.VendorID == int(input.vendor()))
return expr
@render.text
def n_trips():
return f"{filtered().count().execute():,}"
@render.text
def avg_fare():
result = filtered().agg(avg=_.fare_amount.mean()).execute()
return f"${result['avg'].iloc[0]:.2f}"
@render.text
def avg_dist():
result = filtered().agg(avg=_.trip_distance.mean()).execute()
return f"{result['avg'].iloc[0]:.1f} mi"
@render.data_frame
def table():
return (
filtered()
.select(["passenger_count", "trip_distance", "payment_type",
"fare_amount", "total_amount"])
.execute()
)
app = App(app_ui, server)
## file: requirements.txt
ibis-framework
toolz
pyarrow
pyarrow-hotfix
duckdb
Part 2: Writing & Persistent Storage
Why Local Files Break
This Works Locally
sequenceDiagram
participant U as User
participant A as Shiny App
participant F as logs.csv
U->>A: submit query
A->>F: append row
F-->>A: ok
A->>U: query logged ✓
. . .
logs.csv lives on your laptop-survives restarts, only one process writes to it.
No problem.
Where Does This Break?
| Deployment | Works? | Why |
|---|---|---|
| Local machine (terminal) | ✅ | Real filesystem, single process |
| Dedicated server (yours) | ⚠️ | Single worker: fine. Multiple workers: race conditions |
| Shinylive | ❌ | In-memory file system is ethemeral (-> next slide) |
| Posit Connect | ❌ | Container storage is ethemeral, same single file problems (-> next two slides) |
. . .
Shinylive: Silent Failure
Shinylive runs entirely in the browser-there is no server.
. . .
df.to_csv("logs.csv", mode="a") # <- this runs… and succeeds. . .
Pyodide has an in-memory filesystem (Emscripten VFS).
The write succeeds-no error, no warning.
. . .
But:
- Data lives in browser memory-gone on page reload
- Every browser tab gets its own isolated instance
- The host machine never sees the file
. . .
From the Shiny docs:
“No secrets” - code runs in the browser, credentials are visible to users.
“Restricted network” - browser blocks TCP sockets, so pymongo and requests cannot connect to external services.
Posit Connect: Container Restarts
sequenceDiagram
participant U as User
participant A as App Container
participant F as logs.csv
U->>A: submit query
A->>F: append row ✓
Note over A,F: redeploy or idle timeout-container recreated
U->>A: submit query
A->>F: ❌ FileNotFoundError
Note over F: file never existed<br>in the new container
. . .
Posit Connect runs your app in an ephemeral container - the filesystem resets on every restart or redeploy.
Concurrent Users or Multiple Workers
sequenceDiagram
participant U1 as User 1
participant F as logs.csv
participant U2 as User 2
U1->>F: read → 3 rows
U2->>F: read → 3 rows
Note over U1: LLM call (await-event loop switches)
U2->>F: write → 4 rows ✓
U1->>F: write → 4 rows ✓
Note over F: race condition-U1 overwrites U2's new row
. . .
Every await (e.g. an LLM call) is a yield point-the event loop can switch to another user’s request between your read and your write.
Posit Connect makes this worse: it scales by spinning up multiple separate processes (workers), each with their own copy of the file. No coordination means concurrent writes silently corrupt data-even if you never hit an await.
The Fix: External Storage
sequenceDiagram
participant W1 as Worker 1
participant S as External Service
participant W2 as Worker 2
W1->>S: append row
S-->>W1: ok
W2->>S: append row
S-->>W2: ok
Note over S: atomic writes, persistent,<br>survives restarts
. . .
Swap the backend-keep the same save_info() / load_data() wrapper in your app.
Persistent Storage: What’s Out There
| Service | Free tier | Schema | Python client | Good for |
|---|---|---|---|---|
| Google Sheets | ✅ Unlimited rows | None | gspread |
Simple logs, familiar UI |
| Airtable | ✅ 1 000 rows | Must pre-define columns | pyairtable |
Low-code apps, visual editing |
| PostgreSQL (server or cloud) | ✅ varies | SQL schema | psycopg2 / sqlalchemy |
Structured data, joins |
| MongoDB (server or cloud) | ✅ varies | None (document store) | pymongo |
Logs, surveys, free-form records |
. . .
. . .
Databases: Mode 1 — Local
Both Postgres and MongoDB can run on your laptop. This works fine during development.
sequenceDiagram
participant A as Shiny App<br/>(your laptop)
participant D as Database<br/>(localhost)
A->>D: INSERT / insert_one
D-->>A: ✅ ok
Note over D: lives on your machine,<br/>survives restarts
. . .
Problem: deploy the app and the database stays behind.
The container on Posit Connect has no idea localhost:5432 or localhost:27017 exists.
Databases: Mode 2 — Self-hosted Server
You run the database on a separate server, accessible over the network.
sequenceDiagram
participant A as App Container<br/>(Posit Connect)
participant F as Firewall / VPC
participant D as Database Server<br/>(your VM)
A->>F: connect (port 5432 / 27017)
Note over F: inbound rule open?<br/>IP allowlist? auth configured?
F->>D: forward
D-->>A: ✅ ok (if everything is configured)
. . .
This works — but you own the maintenance: updates, backups, auth rules, uptime.
For a student project or a small app, that’s a lot of overhead.
Databases: Mode 3 — Cloud
Someone else runs the database. You get a connection string.
sequenceDiagram
participant A as App Container<br/>(Posit Connect)
participant D as Cloud DB<br/>(MongoDB Atlas / Supabase / Neon / …)
A->>D: INSERT / insert_one<br/>via connection string
D-->>A: ✅ ok
Note over D: managed, persistent,<br/>free tier available,<br/>accessible from anywhere
No firewall config. No server to maintain. Free tier covers small apps easily.
Works identically locally and on Posit Connect — same connection string in .env.
| DB | Cloud providers |
|---|---|
| MongoDB | Atlas (M0 free), Azure Cosmos DB |
| Postgres | Neon, Supabase, Railway, Render |
Why MongoDB Atlas Today?
| MongoDB | Postgres | |
|---|---|---|
| Data shape | Documents (JSON), flexible | Tables, fixed schema |
| Best for | Logs, surveys, free-form records | Joins, constraints, reporting |
| Python client | pymongo |
psycopg2 / ibis |
| Free cloud | Atlas M0, 512 MB | Neon, Supabase |
- Free tier — M0 cluster, no credit card required
- No schema —
insert_one(dict), add fields any time - Auth — one connection string in
.env - Concurrent writes — atomic at the document level
. . .
# .env
MONGODB_URI=mongodb+srv://user:pass@cluster.mongodb.net/?appName=Cluster0Setup: mongodb.com → free account → create M0 cluster → create user with password → copy connection string
App 03: A Simple Form: The Pattern
Any Shiny input can trigger a write.
The simplest case: a feedback form.
# UI: plain inputs
ui.input_text("name", "Your name"),
ui.input_text_area("comment", "Comment"),
ui.input_action_button("submit", "Submit"),
# Server: write on button click
@reactive.effect
@reactive.event(input.submit)
def save_response():
row = {"name": input.name(), "comment": input.comment(),
"timestamp": datetime.now().isoformat()}
collection.insert_one(row). . .
@reactive.event(input.submit) - only fires on button click, not on every keystroke.
Editable DataGrid -> MongoDB
Shiny’s DataGrid supports inline editing-students can fill in a table and submit it:
# Render an editable grid seeded with empty rows
@render.data_frame
def response_table():
return render.DataGrid(template_df, editable=True)
# On submit: read edits and write to MongoDB
@reactive.effect
@reactive.event(input.submit)
def save_edits():
edited = response_table.data_view() # DataFrame with user edits
for row in edited.to_dict("records"):
collection.insert_one(row). . .
More complex loop: querychat Logger
What We’re Logging
Every time querychat generates SQL, we capture:
| Field | Source |
|---|---|
timestamp |
datetime.now() |
user_query |
on_tool_request hook: last user turn |
sql |
req.arguments["query"] from the tool call |
n_rows |
len(qc_vals.df()): how many rows (reactive domain caveat) |
model |
ChatAuto provider + model name |
section |
input.section(): user-selected |
. . .
qc_vals is what qc.server() returns-exposes .sql(), .df(), .title(), .client.
The Logging Hook: Why Not @reactive.effect?
The naive approach:
@reactive.effect
def log_query():
sql = qc_vals.sql() # fires on every reactive invalidation
save_info(...) # -> thousands of duplicate writes. . .
qc_vals.sql() invalidates on every streaming token from the LLM, not just when the final SQL arrives.
. . .
Better: hook directly into the tool call (querychat allows that)
on_tool_request: One Fire Per Query
def on_query(req):
if req.name not in ("querychat_update_dashboard", "querychat_query"):
return # ignore other tools
sql = req.arguments.get("query", "")
turns = qc_vals.client.get_turns()
user_turns = [t for t in turns if t.role == "user"]
pending.set({"user_query": user_turns[-1].text,
"sql": sql, "tool": req.name}) # .set() is safe from Extended Task
qc_vals.client.on_tool_request(on_query) #register the hook. . .
Fires exactly once when the LLM calls the tool-before the response streams back.
Data Flow: app-04
sequenceDiagram
participant U as User<br/>(browser)
box Shiny App Process
participant QC as querychat
participant OTR as on_tool_request
participant RV as reactive.value<br/>(pending)
participant FL as flush_log
end
participant MDB as MongoDB Atlas
box Log App (separate process)
participant LOG as Log DataGrid
end
U->>QC: sends message
QC->>OTR: LLM tool call fires
OTR->>RV: pending.set(row)
RV-->>FL: invalidates effect
FL->>MDB: insert_one(row)
LOG->>MDB: load_data()
MDB-->>LOG: DataFrame
. . .
app-04: MongoDB Backend
Only save_info() and load_data() change, everything else is identical to 04a:
from pymongo import MongoClient
import os
collection = MongoClient(os.environ["MONGODB_URI"])["dsci532"]["query_log"]
def save_info(row: dict) -> None:
collection.insert_one(row) # atomic-safe under multiple workers
def load_data(section: str) -> pd.DataFrame:
rows = list(collection.find({"section": section}, {"_id": 0}))
return pd.DataFrame(rows, columns=SCHEMA) if rows else pd.DataFrame(columns=SCHEMA). . .
insert_one() is atomic-no locking, no race conditions, works across workers.
Switching Providers
The same app works with any LLM-no code changes, just env vars:
# GitHub Models (free):
CHATLAS_CHAT_PROVIDER_MODEL=github/gpt-4.1-mini
GITHUB_TOKEN=...
# Anthropic:
CHATLAS_CHAT_PROVIDER_MODEL=anthropic/claude-haiku-4-5
ANTHROPIC_API_KEY=.... . .
from chatlas import ChatAuto
_chat = ChatAuto() # reads CHATLAS_CHAT_PROVIDER_MODEL
LLM_MODEL = f"{_chat.provider.name}/{_chat.provider.model}"
qc = querychat.QueryChat(titanic, "titanic", client=_chat). . .
Deploy the same code twice on Posit Connect with different env vars -> compare models in MongoDB.