BigQuery Query Infrastructure (Admin Portal) β Design Spec β
Date: 2026-04-30 Status: Draft, pending user review Owner: cattreedev Umbrella issue: to be filed (Admin: BigQuery query infrastructure (aggregate-backed canned reports))
Goal β
Build infrastructure for surfacing BigQuery-derived metrics in the admin portal, with a clean architectural split between aggregate-backed canned reports (curated dashboards reading from small precomputed tables) and raw-events ad-hoc SQL (admin-authored, cost-gated). Land aggregate-backed canned reports end-to-end this iteration with one initial entry β event counts by type over a chosen window β proving the path. The raw-SQL editor and its cost-gate infrastructure are a deliberate follow-up. Same backend will later serve merchant-facing metrics.
Architectural split (the core decision) β
Two distinct query modes with different backing strategies:
| Mode | Reads from | Cost profile | Cost gates needed | This iteration |
|---|---|---|---|---|
| Curated canned reports | Pre-aggregated rollup tables (e.g., analytics.event_counts_daily) | Near-zero, predictable | No β aggregates are tiny | β Built |
| Ad-hoc raw SQL | Raw analytics.events | Variable, can be expensive | Yes β maximumBytesBilled, row caps, SQL-shape checks, global ceiling | βοΈ Deferred |
Why the split matters: cost gates only need to exist where humans author SQL. Curated dashboards are server-controlled, so concurrent admin viewers cost the same as one viewer (BQ caches table reads cheaply, and aggregates are small regardless). This collapses ~half the runtime guardrails out of this iteration's scope and crisply names what cost gates protect against.
Non-goals (this iteration) β
- Raw-SQL admin editor and the runtime infrastructure that supports it (
maximumBytesBilledper-job caps, row-limit truncation, SQL-shape startswith check, global byte ceiling,dryRuncost estimation). All deferred to the follow-up that builds the editor. - Charting / visualization. Tables only. Sort/filter on the client.
- Scheduled / exported / emailed reports. Reports tab stays placeholder.
- Merchant-facing metrics. Architectural seam left for it; not built.
- Sub-daily freshness. Aggregation runs every 24h. "Today's count" trails by up to 24h. Acceptable for the canned reports shipping now; revisit if a future report needs minute-level freshness.
Existing surface area β
- Data source:
analytics.eventsper Firebase project. Schema in tooling/schemas/bigquery-events.json. - BQ writer: packages/forge/bigquery.js streams inserts via
@google-cloud/bigquery. - Admin scaffolding: apps/admin/src/admin/analytics/BigQueryWorkspace.jsx routes
/analytics/bigquery/{export,console,reports,retention}β all four currently render placeholder cards. This iteration fills in Console (its existing subtitle: "Run ad-hoc analytics queries and inspect event-level data" β the natural home for canned and eventually raw query running). - API service: services/api/analytics on Cloud Run, port 8082. Existing
routes/admin.jshosts admin reads behindverifyFirebaseToken+requireRole. Existingroutes/scheduled.jshosts cron-triggered jobs (e.g.,/daily-aggregation) β this is where the new aggregation job lands.
Architecture β
βββββββββββββββββββββββ
β analytics.events β β raw events stream (Forge writes)
β (raw, partitioned) β
ββββββββββββ¬βββββββββββ
β MERGE every 24h (Cloud Scheduler β analytics-api)
βΌ
βββββββββββββββββββββββββββββββ
β analytics.event_counts_ β β daily rollup, partitioned by day, clustered by event_name
β daily β
ββββββββββββ¬βββββββββββββββββββ
β POST /admin/bq-query { queryName: 'event_counts', params: { window: '7d' } }
βΌ
βββββββββββββββββββββββ JSON: { rows, schema, jobMeta, generatedAt } ββββββββββββββββββββ
β analytics-api β ββββββββββββββββββββββββββββββββββββββββββββββΊ β Admin portal β
β + query registry β ββββββββββββββββββββββββββββββββββββββββββββββ β Console tab β
β + Forge runner β β β
β + 60s memo cache β β β
βββββββββββββββββββββββ ββββββββββββββββββββAggregation infrastructure (new) β
Aggregate table: analytics.event_counts_daily β
CREATE TABLE `lantern-app-${env}.analytics.event_counts_daily` (
day DATE NOT NULL,
event_name STRING NOT NULL,
event_tier STRING NOT NULL, -- 'auto' | 'registered'
environment STRING NOT NULL, -- 'production' | 'development'
count INT64 NOT NULL,
aggregated_at TIMESTAMP NOT NULL -- when this row was computed
)
PARTITION BY day
CLUSTER BY event_name;Partition by day for cheap pruning on dashboard reads. Cluster by event_name because it's the primary group/filter axis. At ~50 event types Γ 2 tiers Γ 2 environments = ~200 rows/day, this stays tiny indefinitely (~73K rows/year). Schema lives at tooling/schemas/bigquery-event-counts-daily.json (new file, BQ JSON schema format).
Aggregation MERGE (idempotent, runs every 24h) β
MERGE `lantern-app-${env}.analytics.event_counts_daily` T
USING (
SELECT
DATE(timestamp) AS day,
event_name,
event_tier,
environment,
COUNT(*) AS count,
CURRENT_TIMESTAMP() AS aggregated_at
FROM `lantern-app-${env}.analytics.events`
WHERE DATE(timestamp) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
GROUP BY day, event_name, event_tier, environment
) S
ON T.day = S.day
AND T.event_name = S.event_name
AND T.event_tier = S.event_tier
AND T.environment = S.environment
WHEN MATCHED THEN UPDATE SET
count = S.count,
aggregated_at = S.aggregated_at
WHEN NOT MATCHED THEN INSERT (day, event_name, event_tier, environment, count, aggregated_at)
VALUES (S.day, S.event_name, S.event_tier, S.environment, S.count, S.aggregated_at);Why MERGE not INSERT: re-running is safe. The job can be replayed without producing duplicate rows. Why yesterday only: one full day is the smallest correct unit. Aggregating CURRENT_DATE() would lock in a partial number. Why these grouping columns: preserves dimensions admins are likely to filter on later (tier, env) without bloating the table.
Scheduled route: POST /scheduled/aggregate-event-counts-daily β
New endpoint in services/api/analytics/src/routes/scheduled.js alongside the existing /daily-aggregation. Runs the MERGE once per invocation. Auth: existing scheduled-route auth (Cloud Scheduler invoker β same pattern as /daily-aggregation). Cloud Scheduler config: cron 0 1 * * * (01:00 UTC daily β late enough that yesterday's events have all landed; existing scheduler entries cluster around this window).
Logs structured pino entry on completion: { event: 'event_counts_aggregation', daysProcessed: 1, rowsAffected, durationMs, jobId }. Failures surface to Cloud Logging error-budget alerts (existing infra).
Backfill route: POST /admin/aggregations/event-counts/backfill β
Sibling endpoint in routes/admin.js. Body: { days: number } (1β365). Runs the MERGE for the last N days in a single statement (loop unrolled into the SQL by adjusting the WHERE DATE(timestamp) BETWEEN β¦ clause). Admin-only auth. Called manually after first deploy to populate historical data; reusable if re-aggregation is ever needed.
Backend: packages/forge/query.js (read-side runner) β
New read-side companion to packages/forge/bigquery.js. Sole responsibility: execute a parameterized BQ query and return shaped results.
export async function runQuery({ sql, params }) {
// Returns { rows, schema, jobMeta: { bytesProcessed, totalBytesBilled, cacheHit, jobId, durationMs }, generatedAt }
// Logs structured pino entry: { event: 'bq_query_executed', queryName, params, userId, ...jobMeta, costUsd }
// No cost gates this iteration β aggregate tables make them unnecessary. Gates land with raw-SQL iteration.
}Why structured logging now even though aggregates make cost negligible: it builds the audit trail (who ran what, when) and the cost data sink that the future raw-SQL iteration's cost-association infrastructure will consume. Captures totalBytesBilled and computes costUsd = (totalBytesBilled / 1e12) * 5 (BQ on-demand price). One line per query, structured, exportable.
Backend: query registry β
New module at services/api/analytics/src/services/queryRegistry.js. Plain object mapping queryName β { sql, paramSchema, description, source, audience }.
export const QUERY_REGISTRY = {
event_counts: {
description: 'Event types and counts over a chosen time window',
source: 'aggregate', // vs 'raw' β informational this iteration; gates branch on it later
audience: 'admin', // vs 'merchant' β endpoint-level enforcement
paramSchema: { window: { enum: ['24h', '7d', '30d', 'all'], default: '7d' } },
sql: ({ window }) => ({
sql: `
SELECT event_name, SUM(count) AS count
FROM \`${getProjectId()}.analytics.event_counts_daily\`
WHERE day >= @since
GROUP BY event_name
ORDER BY count DESC
LIMIT 5000
`,
params: { since: resolveWindowToDate(window) },
}),
},
}Adding the second/third/Nth canned report = adding an entry (and its aggregate table + aggregation job, if it needs one). No new endpoint, no new auth wiring, no new result-table code.
resolveWindowToDate: 24h β todayβ1, 7d β todayβ7, 30d β todayβ30, all β todayβ365 (aggregate only goes back as far as we've aggregated anyway, but the cap remains as a defensive belt).
Note: LIMIT 5000 here is a query author convention (the registry can decide per-entry), not a runner-enforced cap this iteration. Runner-level row-limit truncation lands with the raw-SQL iteration.
Endpoint: POST /admin/bq-query β
In services/api/analytics/src/routes/admin.js. Auth via existing verifyFirebaseToken + requireRole (already wired for that route group).
POST chosen over GET: future raw-SQL mode requires bodies. Using POST consistently from day 1 keeps the contract stable. Server-side 60s memo cache gives GET-like cheapness regardless.
Request:
{ "queryName": "event_counts", "params": { "window": "7d" } }Reserved for next iteration: { "sql": "SELECT β¦" }. This iteration rejects with 400 if sql is present.
Response:
{
"queryName": "event_counts",
"params": { "window": "7d" },
"generatedAt": "2026-04-30T18:00:00.000Z",
"rows": [{ "event_name": "lantern_lit", "count": 1247 }],
"schema": [
{ "name": "event_name", "type": "STRING" },
{ "name": "count", "type": "INTEGER" }
],
"jobMeta": {
"bytesProcessed": 12480,
"totalBytesBilled": 10485760,
"costUsd": 0.0000524,
"cacheHit": false,
"durationMs": 184
}
}Validation:
queryNamemust exist in registry β else 400.paramsvalidated againstparamSchemaβ else 400 with field-level error.audienceon registry entry must match the route's audience β else 403 (defense-in-depth for the future merchant route).sqlfield present in body β 400 (this iteration).
Caching: Server-side memoize for 60s, keyed on (queryName, JSON.stringify(params)). Per Cloud Run instance.
Errors: 400 / 403 / 502, structured { error: { code, message, jobId? } }. No silent fallbacks.
IAM / service account β
The Cloud Run SA running analytics-api already has BQ access for writes. For reads + the new aggregation MERGE we need:
roles/bigquery.dataVieweronanalyticsdataset (for canned-report reads)roles/bigquery.dataEditoronanalyticsdataset (for the MERGE β needed whether foreventsor justevent_counts_daily; cleanest to scope to dataset)roles/bigquery.jobUseron the project (to run jobs)
Verified during plan execution: what the SA currently has. If it has dataEditor already, no change. If it has only dataViewer, add dataEditor on the dataset (needed for the MERGE writes to event_counts_daily). Not narrowing further this iteration β when raw-SQL ships, that path will use a separate, strictly-dataViewer SA so admin-authored queries can't write.
Frontend β
Console tab (/analytics/bigquery/console) β working surface this iteration.
βββββββββββββββββββββββββββββββββββββββββββ
β Saved Queries β
β βββββββββββββββββ β
β βΈ Event counts by type [Run] β β clickable card, expands to params
β Window: [7 days βΎ] β
β β
β βββββββββββββββββ β
β Raw SQL editor β coming next iteration β β inline placeholder block
β βββββββββββββββββ β
β β
β Results β
β βββββββββββββββββ β
β Event Type βΌ Count β β sortable
β lantern_lit 1,247 β
β wave_sent 312 β
β β¦ β
β β
β Scanned: 12 KB Β· Billed: 10 MB Β· β¦ β
βββββββββββββββββββββββββββββββββββββββββββComponents:
- Saved-query card list β driven by a small client-side manifest mirroring registry metadata (name, description, param schema). Initially one entry. Future entries auto-render.
- Param form β generated from
paramSchema.enumparams use<StyledSelect>(CLAUDE.md rule). Defaults pre-applied. - Raw SQL placeholder block β visible inline, links to the follow-up issue once filed.
- Result table β generic shared
QueryResultTablekeyed on responseschema. Sortable client-side. Right-aligns numeric columns. Reused by raw-SQL mode later. - Job metadata footer β bytes processed, billed, cost, cache hit/miss, duration. Lightweight observability.
- Refresh button β
.btn-secondary.
State: local useState/useEffect. No new global state.
Reports tab β stays placeholder. Recurring/scheduled/delivered reports are a separate future iteration on top of the same registry.
Future seam: merchant metrics β
Same runQuery, same registry pattern, but:
- Merchant-scoped registry entries flagged
audience: 'merchant'and take amerchantIdparam. - Mounted under a different route (e.g.
/merchant/bq-query) with merchant-role auth. - Endpoint enforces
audiencematches the route β admin queries can't be invoked from merchant route, and vice versa.
Out of scope this iteration. Called out so the registry shape doesn't accidentally preclude it.
Testing β
| Layer | Test |
|---|---|
packages/forge/query.js | Vitest unit, BQ client mocked. Param binding, schema mapping, error propagation, structured-log emission shape. |
| Query registry | Vitest unit. Validates paramSchema, SQL builders produce expected { sql, params }, windowβdate resolution at boundaries. Parameterized test exercising every entry with valid + known-bad params (catches missing entries when registry grows). |
| Aggregation route | Vitest endpoint test, BQ client mocked. Runs MERGE, returns rows-affected, handles BQ errors. |
| Backfill route | Vitest endpoint test. Admin auth gate, days-range validation, MERGE invocation. |
/admin/bq-query endpoint | Vitest endpoint test, registry + Forge mocked. Auth gate, queryName validation, param validation, audience enforcement, cache behavior, explicit reject of { sql } payloads. |
QueryResultTable | Vitest + RTL. Sort, numeric alignment, empty state. |
| Console tab | Vitest + RTL. Renders saved-query list, param form, runs query, displays results, shows raw-SQL placeholder. Mocked endpoint. |
| Storybook | Story for QueryResultTable with mocked rows + schema. |
Coverage threshold (75%) on new code. Existing files stay above their current line.
Open questions β
None blocking. Decisions locked:
- Endpoint path & method:
POST /admin/bq-queryinroutes/admin.js. - Aggregate table:
analytics.event_counts_daily, partitioned byday, clustered byevent_name. - Aggregation cadence: every 24h at
0 1 * * *UTC. - Aggregation strategy: MERGE for yesterday's events only (idempotent, replayable).
- Default window for
event_counts: 7 days.allcapped at 365 days. - Cache TTL: 60s.
- Result row hint per registry entry (
LIMIT 5000forevent_counts); runner-enforced cap deferred to raw-SQL iteration. - This iteration: aggregate-backed only. Raw SQL + cost gates deferred.
- Cost logging: structured pino on every query (audit + future cost-association substrate).
To verify during plan execution (non-blocking either way):
- Current
analytics-apiSA's BQ roles (dataViewervsdataEditor). - Cloud Scheduler config pattern used by existing
/daily-aggregation(replicate exactly).
Workflow notes (process, not feature) β
- Branch:
feat/<umbrella-N>-bq-query-infraonce umbrella issue is filed. - Worktree:
.claude/worktrees/bq-query-infrawith.env.localsymlinked from main checkout. Setup steps live in user'sfeedback_avoid_worktrees_for_sequential_prs.mdmemory. - PR: Not opened until user explicitly says so. Pre-PR pushes are CI-free per
feedback_no_unprompted_push.md; PR-creation needs go-ahead. - Issues bundled: This single PR closes the new umbrella issue. Files a follow-up issue for the raw-SQL Console (next iteration). Does not claim to close #272, #332, #333 β those remain on their own tracks.
Risks & mitigations β
| Risk | Mitigation |
|---|---|
| Generic endpoint = generic risk surface | Registry-only-emits-SELECT this iteration; { sql } payloads explicitly rejected; SQL written by registry author with parameterized queries (@param), never string-interpolated user input. |
| Aggregation job fails silently | Structured pino log on success/failure; failures surface via Cloud Logging error-budget alerts (existing infra). Re-running the job is safe (MERGE is idempotent). |
| Aggregation runs but yesterday's data still incomplete (timezone edge) | 0 1 * * * UTC is well after midnight in any business-relevant timezone; events stream finalizes near-real-time. If late-arriving events become an issue, switch to interval 2 days lookback. |
| Aggregate table schema drifts from raw events | New event_name values flow through automatically (no schema change needed). New dimensions (e.g., user_agent) require a table-schema migration if we want to aggregate by them β accept and revisit when needed. |
| BQ cost in registry queries grows unexpectedly | Aggregate tables stay tiny (~73K rows/year). Even unbounded scans on event_counts_daily cost fractions of a cent. Real cost surface lives with raw-SQL iteration. |
| Future merchant route misused for admin queries | audience flag on registry entries; endpoint-level enforcement that audience matches route. Seam exists, not built. |
Success criteria β
- After first deploy + manual backfill, an admin opens
/analytics/bigquery/console, picks "Event counts by type", chooses a window, runs it, and sees a sorted table within ~1s on a warm cache. - The aggregation runs every 24h on schedule, populates
event_counts_daily, and re-running it is provably safe (idempotent MERGE). - The endpoint enforces admin auth, rejects unknown
queryNames, rejects raw{ sql }payloads, and enforcesaudiencematches route. - Adding a hypothetical second aggregate-backed report (e.g., "Active users by day") requires only: a new aggregate table + aggregation job + registry entry + client manifest entry. No changes to endpoint, runner, auth, or result-table code.
- The inline raw-SQL placeholder on the Console tab tells the admin clearly that raw SQL is "coming next" without looking abandoned.
- Per-query structured logs (
event: 'bq_query_executed') are emitted with cost data, queryable in Cloud Logging, ready for cost-association infrastructure when it lands. - All new code is covered by tests;
npm run validatepasses. - The pattern documented in this spec is sufficient for the follow-up raw-SQL iteration to extend rather than redesign.