BigQuery Query Infrastructure (Admin Portal) — Design Spec

Date: 2026-04-30 Status: Draft, pending user review Owner: cattreedev Umbrella issue: to be filed (Admin: BigQuery query infrastructure (aggregate-backed canned reports))

Goal

Build infrastructure for surfacing BigQuery-derived metrics in the admin portal, with a clean architectural split between aggregate-backed canned reports (curated dashboards reading from small precomputed tables) and raw-events ad-hoc SQL (admin-authored, cost-gated). Land aggregate-backed canned reports end-to-end this iteration with one initial entry — event counts by type over a chosen window — proving the path. The raw-SQL editor and its cost-gate infrastructure are a deliberate follow-up. Same backend will later serve merchant-facing metrics.

Architectural split (the core decision)

Two distinct query modes with different backing strategies:

Mode	Reads from	Cost profile	Cost gates needed	This iteration
Curated canned reports	Pre-aggregated rollup tables (e.g., `analytics.event_counts_daily`)	Near-zero, predictable	No — aggregates are tiny	✅ Built
Ad-hoc raw SQL	Raw `analytics.events`	Variable, can be expensive	Yes — `maximumBytesBilled`, row caps, SQL-shape checks, global ceiling	⏭️ Deferred

Why the split matters: cost gates only need to exist where humans author SQL. Curated dashboards are server-controlled, so concurrent admin viewers cost the same as one viewer (BQ caches table reads cheaply, and aggregates are small regardless). This collapses ~half the runtime guardrails out of this iteration's scope and crisply names what cost gates protect against.

Non-goals (this iteration)

Raw-SQL admin editor and the runtime infrastructure that supports it (maximumBytesBilled per-job caps, row-limit truncation, SQL-shape startswith check, global byte ceiling, dryRun cost estimation). All deferred to the follow-up that builds the editor.
Charting / visualization. Tables only. Sort/filter on the client.
Scheduled / exported / emailed reports. Reports tab stays placeholder.
Merchant-facing metrics. Architectural seam left for it; not built.
Sub-daily freshness. Aggregation runs every 24h. "Today's count" trails by up to 24h. Acceptable for the canned reports shipping now; revisit if a future report needs minute-level freshness.

Existing surface area

Data source: analytics.events per Firebase project. Schema in tooling/schemas/bigquery-events.json.
BQ writer: packages/forge/bigquery.js streams inserts via @google-cloud/bigquery.
Admin scaffolding: apps/admin/src/admin/analytics/BigQueryWorkspace.jsx routes /analytics/bigquery/{export,console,reports,retention} — all four currently render placeholder cards. This iteration fills in Console (its existing subtitle: "Run ad-hoc analytics queries and inspect event-level data" — the natural home for canned and eventually raw query running).
API service: services/api/analytics on Cloud Run, port 8082. Existing routes/admin.js hosts admin reads behind verifyFirebaseToken + requireRole. Existing routes/scheduled.js hosts cron-triggered jobs (e.g., /daily-aggregation) — this is where the new aggregation job lands.

Architecture

┌─────────────────────┐
│  analytics.events   │  ← raw events stream (Forge writes)
│  (raw, partitioned) │
└──────────┬──────────┘
           │ MERGE every 24h (Cloud Scheduler → analytics-api)
           ▼
┌─────────────────────────────┐
│  analytics.event_counts_    │  ← daily rollup, partitioned by day, clustered by event_name
│  daily                      │
└──────────┬──────────────────┘
           │ POST /admin/bq-query { queryName: 'event_counts', params: { window: '7d' } }
           ▼
┌─────────────────────┐    JSON: { rows, schema, jobMeta, generatedAt }    ┌──────────────────┐
│  analytics-api      │  ─────────────────────────────────────────────►   │  Admin portal    │
│  + query registry   │  ◄─────────────────────────────────────────────   │  Console tab     │
│  + Forge runner     │                                                    │                  │
│  + 60s memo cache   │                                                    │                  │
└─────────────────────┘                                                    └──────────────────┘

Aggregation infrastructure (new)

Aggregate table: `analytics.event_counts_daily`

sql

CREATE TABLE `lantern-app-${env}.analytics.event_counts_daily` (
  day            DATE      NOT NULL,
  event_name     STRING    NOT NULL,
  event_tier     STRING    NOT NULL,   -- 'auto' | 'registered'
  environment    STRING    NOT NULL,   -- 'production' | 'development'
  count          INT64     NOT NULL,
  aggregated_at  TIMESTAMP NOT NULL    -- when this row was computed
)
PARTITION BY day
CLUSTER BY event_name;

Partition by day for cheap pruning on dashboard reads. Cluster by event_name because it's the primary group/filter axis. At ~50 event types × 2 tiers × 2 environments = ~200 rows/day, this stays tiny indefinitely (~73K rows/year). Schema lives at tooling/schemas/bigquery-event-counts-daily.json (new file, BQ JSON schema format).

Aggregation MERGE (idempotent, runs every 24h)

sql

MERGE `lantern-app-${env}.analytics.event_counts_daily` T
USING (
  SELECT
    DATE(timestamp)         AS day,
    event_name,
    event_tier,
    environment,
    COUNT(*)                AS count,
    CURRENT_TIMESTAMP()     AS aggregated_at
  FROM `lantern-app-${env}.analytics.events`
  WHERE DATE(timestamp) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
  GROUP BY day, event_name, event_tier, environment
) S
ON  T.day = S.day
AND T.event_name = S.event_name
AND T.event_tier = S.event_tier
AND T.environment = S.environment
WHEN MATCHED THEN UPDATE SET
  count = S.count,
  aggregated_at = S.aggregated_at
WHEN NOT MATCHED THEN INSERT (day, event_name, event_tier, environment, count, aggregated_at)
  VALUES (S.day, S.event_name, S.event_tier, S.environment, S.count, S.aggregated_at);

Why MERGE not INSERT: re-running is safe. The job can be replayed without producing duplicate rows. Why yesterday only: one full day is the smallest correct unit. Aggregating CURRENT_DATE() would lock in a partial number. Why these grouping columns: preserves dimensions admins are likely to filter on later (tier, env) without bloating the table.

Scheduled route: `POST /scheduled/aggregate-event-counts-daily`

New endpoint in services/api/analytics/src/routes/scheduled.js alongside the existing /daily-aggregation. Runs the MERGE once per invocation. Auth: existing scheduled-route auth (Cloud Scheduler invoker — same pattern as /daily-aggregation). Cloud Scheduler config: cron 0 1 * * * (01:00 UTC daily — late enough that yesterday's events have all landed; existing scheduler entries cluster around this window).

Logs structured pino entry on completion: { event: 'event_counts_aggregation', daysProcessed: 1, rowsAffected, durationMs, jobId }. Failures surface to Cloud Logging error-budget alerts (existing infra).

Backfill route: `POST /admin/aggregations/event-counts/backfill`

Sibling endpoint in routes/admin.js. Body: { days: number } (1–365). Runs the MERGE for the last N days in a single statement (loop unrolled into the SQL by adjusting the WHERE DATE(timestamp) BETWEEN … clause). Admin-only auth. Called manually after first deploy to populate historical data; reusable if re-aggregation is ever needed.

Backend: `packages/forge/query.js` (read-side runner)

New read-side companion to packages/forge/bigquery.js. Sole responsibility: execute a parameterized BQ query and return shaped results.

export async function runQuery({ sql, params }) {
  // Returns { rows, schema, jobMeta: { bytesProcessed, totalBytesBilled, cacheHit, jobId, durationMs }, generatedAt }
  // Logs structured pino entry: { event: 'bq_query_executed', queryName, params, userId, ...jobMeta, costUsd }
  // No cost gates this iteration — aggregate tables make them unnecessary. Gates land with raw-SQL iteration.
}

Why structured logging now even though aggregates make cost negligible: it builds the audit trail (who ran what, when) and the cost data sink that the future raw-SQL iteration's cost-association infrastructure will consume. Captures totalBytesBilled and computes costUsd = (totalBytesBilled / 1e12) * 5 (BQ on-demand price). One line per query, structured, exportable.

Backend: query registry

New module at services/api/analytics/src/services/queryRegistry.js. Plain object mapping queryName → { sql, paramSchema, description, source, audience }.

export const QUERY_REGISTRY = {
  event_counts: {
    description: 'Event types and counts over a chosen time window',
    source: 'aggregate',                     // vs 'raw' — informational this iteration; gates branch on it later
    audience: 'admin',                       // vs 'merchant' — endpoint-level enforcement
    paramSchema: { window: { enum: ['24h', '7d', '30d', 'all'], default: '7d' } },
    sql: ({ window }) => ({
      sql: `
        SELECT event_name, SUM(count) AS count
        FROM \`${getProjectId()}.analytics.event_counts_daily\`
        WHERE day >= @since
        GROUP BY event_name
        ORDER BY count DESC
        LIMIT 5000
      `,
      params: { since: resolveWindowToDate(window) },
    }),
  },
}

Adding the second/third/Nth canned report = adding an entry (and its aggregate table + aggregation job, if it needs one). No new endpoint, no new auth wiring, no new result-table code.

resolveWindowToDate: 24h → today−1, 7d → today−7, 30d → today−30, all → today−365 (aggregate only goes back as far as we've aggregated anyway, but the cap remains as a defensive belt).

Note: LIMIT 5000 here is a query author convention (the registry can decide per-entry), not a runner-enforced cap this iteration. Runner-level row-limit truncation lands with the raw-SQL iteration.

Endpoint: `POST /admin/bq-query`

In services/api/analytics/src/routes/admin.js. Auth via existing verifyFirebaseToken + requireRole (already wired for that route group).

POST chosen over GET: future raw-SQL mode requires bodies. Using POST consistently from day 1 keeps the contract stable. Server-side 60s memo cache gives GET-like cheapness regardless.

Request:

json

{ "queryName": "event_counts", "params": { "window": "7d" } }

Reserved for next iteration: { "sql": "SELECT …" }. This iteration rejects with 400 if sql is present.

Response:

json

{
  "queryName": "event_counts",
  "params": { "window": "7d" },
  "generatedAt": "2026-04-30T18:00:00.000Z",
  "rows": [{ "event_name": "lantern_lit", "count": 1247 }],
  "schema": [
    { "name": "event_name", "type": "STRING" },
    { "name": "count",      "type": "INTEGER" }
  ],
  "jobMeta": {
    "bytesProcessed": 12480,
    "totalBytesBilled": 10485760,
    "costUsd": 0.0000524,
    "cacheHit": false,
    "durationMs": 184
  }
}

Validation:

queryName must exist in registry — else 400.
params validated against paramSchema — else 400 with field-level error.
audience on registry entry must match the route's audience — else 403 (defense-in-depth for the future merchant route).
sql field present in body — 400 (this iteration).

Caching: Server-side memoize for 60s, keyed on (queryName, JSON.stringify(params)). Per Cloud Run instance.

Errors: 400 / 403 / 502, structured { error: { code, message, jobId? } }. No silent fallbacks.

IAM / service account

The Cloud Run SA running analytics-api already has BQ access for writes. For reads + the new aggregation MERGE we need:

roles/bigquery.dataViewer on analytics dataset (for canned-report reads)
roles/bigquery.dataEditor on analytics dataset (for the MERGE — needed whether for events or just event_counts_daily; cleanest to scope to dataset)
roles/bigquery.jobUser on the project (to run jobs)

Verified during plan execution: what the SA currently has. If it has dataEditor already, no change. If it has only dataViewer, add dataEditor on the dataset (needed for the MERGE writes to event_counts_daily). Not narrowing further this iteration — when raw-SQL ships, that path will use a separate, strictly-dataViewer SA so admin-authored queries can't write.

Frontend

Console tab (/analytics/bigquery/console) — working surface this iteration.

┌─────────────────────────────────────────┐
│  Saved Queries                          │
│  ─────────────────                      │
│  ▸ Event counts by type    [Run]        │  ← clickable card, expands to params
│    Window: [7 days ▾]                   │
│                                         │
│  ─────────────────                      │
│  Raw SQL editor — coming next iteration │  ← inline placeholder block
│  ─────────────────                      │
│                                         │
│  Results                                │
│  ─────────────────                      │
│  Event Type      ▼  Count               │  ← sortable
│  lantern_lit         1,247              │
│  wave_sent           312                │
│  …                                      │
│                                         │
│  Scanned: 12 KB · Billed: 10 MB · …     │
└─────────────────────────────────────────┘

Components:

Saved-query card list — driven by a small client-side manifest mirroring registry metadata (name, description, param schema). Initially one entry. Future entries auto-render.
Param form — generated from paramSchema. enum params use <StyledSelect> (CLAUDE.md rule). Defaults pre-applied.
Raw SQL placeholder block — visible inline, links to the follow-up issue once filed.
Result table — generic shared QueryResultTable keyed on response schema. Sortable client-side. Right-aligns numeric columns. Reused by raw-SQL mode later.
Job metadata footer — bytes processed, billed, cost, cache hit/miss, duration. Lightweight observability.
Refresh button — .btn-secondary.

State: local useState/useEffect. No new global state.

Reports tab — stays placeholder. Recurring/scheduled/delivered reports are a separate future iteration on top of the same registry.

Future seam: merchant metrics

Same runQuery, same registry pattern, but:

Merchant-scoped registry entries flagged audience: 'merchant' and take a merchantId param.
Mounted under a different route (e.g. /merchant/bq-query) with merchant-role auth.
Endpoint enforces audience matches the route — admin queries can't be invoked from merchant route, and vice versa.

Out of scope this iteration. Called out so the registry shape doesn't accidentally preclude it.

Testing

Layer	Test
`packages/forge/query.js`	Vitest unit, BQ client mocked. Param binding, schema mapping, error propagation, structured-log emission shape.
Query registry	Vitest unit. Validates `paramSchema`, SQL builders produce expected `{ sql, params }`, window→date resolution at boundaries. Parameterized test exercising every entry with valid + known-bad params (catches missing entries when registry grows).
Aggregation route	Vitest endpoint test, BQ client mocked. Runs MERGE, returns rows-affected, handles BQ errors.
Backfill route	Vitest endpoint test. Admin auth gate, days-range validation, MERGE invocation.
`/admin/bq-query` endpoint	Vitest endpoint test, registry + Forge mocked. Auth gate, queryName validation, param validation, audience enforcement, cache behavior, explicit reject of `{ sql }` payloads.
`QueryResultTable`	Vitest + RTL. Sort, numeric alignment, empty state.
Console tab	Vitest + RTL. Renders saved-query list, param form, runs query, displays results, shows raw-SQL placeholder. Mocked endpoint.
Storybook	Story for `QueryResultTable` with mocked rows + schema.

Coverage threshold (75%) on new code. Existing files stay above their current line.

Open questions

None blocking. Decisions locked:

Endpoint path & method: POST /admin/bq-query in routes/admin.js.
Aggregate table: analytics.event_counts_daily, partitioned by day, clustered by event_name.
Aggregation cadence: every 24h at 0 1 * * * UTC.
Aggregation strategy: MERGE for yesterday's events only (idempotent, replayable).
Default window for event_counts: 7 days. all capped at 365 days.
Cache TTL: 60s.
Result row hint per registry entry (LIMIT 5000 for event_counts); runner-enforced cap deferred to raw-SQL iteration.
This iteration: aggregate-backed only. Raw SQL + cost gates deferred.
Cost logging: structured pino on every query (audit + future cost-association substrate).

To verify during plan execution (non-blocking either way):

Current analytics-api SA's BQ roles (dataViewer vs dataEditor).
Cloud Scheduler config pattern used by existing /daily-aggregation (replicate exactly).

Workflow notes (process, not feature)

Branch: feat/<umbrella-N>-bq-query-infra once umbrella issue is filed.
Worktree: .claude/worktrees/bq-query-infra with .env.local symlinked from main checkout. Setup steps live in user's feedback_avoid_worktrees_for_sequential_prs.md memory.
PR: Not opened until user explicitly says so. Pre-PR pushes are CI-free per feedback_no_unprompted_push.md; PR-creation needs go-ahead.
Issues bundled: This single PR closes the new umbrella issue. Files a follow-up issue for the raw-SQL Console (next iteration). Does not claim to close #272, #332, #333 — those remain on their own tracks.

Risks & mitigations

Risk	Mitigation
Generic endpoint = generic risk surface	Registry-only-emits-SELECT this iteration; `{ sql }` payloads explicitly rejected; SQL written by registry author with parameterized queries (`@param`), never string-interpolated user input.
Aggregation job fails silently	Structured pino log on success/failure; failures surface via Cloud Logging error-budget alerts (existing infra). Re-running the job is safe (MERGE is idempotent).
Aggregation runs but yesterday's data still incomplete (timezone edge)	`0 1 * * *` UTC is well after midnight in any business-relevant timezone; events stream finalizes near-real-time. If late-arriving events become an issue, switch to `interval 2 days` lookback.
Aggregate table schema drifts from raw events	New event_name values flow through automatically (no schema change needed). New dimensions (e.g., user_agent) require a table-schema migration if we want to aggregate by them — accept and revisit when needed.
BQ cost in registry queries grows unexpectedly	Aggregate tables stay tiny (~73K rows/year). Even unbounded scans on `event_counts_daily` cost fractions of a cent. Real cost surface lives with raw-SQL iteration.
Future merchant route misused for admin queries	`audience` flag on registry entries; endpoint-level enforcement that audience matches route. Seam exists, not built.

Success criteria

After first deploy + manual backfill, an admin opens /analytics/bigquery/console, picks "Event counts by type", chooses a window, runs it, and sees a sorted table within ~1s on a warm cache.
The aggregation runs every 24h on schedule, populates event_counts_daily, and re-running it is provably safe (idempotent MERGE).
The endpoint enforces admin auth, rejects unknown queryNames, rejects raw { sql } payloads, and enforces audience matches route.
Adding a hypothetical second aggregate-backed report (e.g., "Active users by day") requires only: a new aggregate table + aggregation job + registry entry + client manifest entry. No changes to endpoint, runner, auth, or result-table code.
The inline raw-SQL placeholder on the Console tab tells the admin clearly that raw SQL is "coming next" without looking abandoned.
Per-query structured logs (event: 'bq_query_executed') are emitted with cost data, queryable in Cloud Logging, ready for cost-association infrastructure when it lands.
All new code is covered by tests; npm run validate passes.
The pattern documented in this spec is sufficient for the follow-up raw-SQL iteration to extend rather than redesign.

BigQuery Query Infrastructure (Admin Portal) — Design Spec ​

Goal ​

Architectural split (the core decision) ​

Non-goals (this iteration) ​

Existing surface area ​

Architecture ​

Aggregation infrastructure (new) ​

Aggregate table: analytics.event_counts_daily ​

Aggregation MERGE (idempotent, runs every 24h) ​

Scheduled route: POST /scheduled/aggregate-event-counts-daily ​

Backfill route: POST /admin/aggregations/event-counts/backfill ​

Backend: packages/forge/query.js (read-side runner) ​

Backend: query registry ​

Endpoint: POST /admin/bq-query ​

IAM / service account ​

Frontend ​

Future seam: merchant metrics ​

Testing ​

Open questions ​

Workflow notes (process, not feature) ​

Risks & mitigations ​

Success criteria ​