Skip to content

BigQuery Query Infrastructure (Admin Portal) β€” Design Spec ​

Date: 2026-04-30 Status: Draft, pending user review Owner: cattreedev Umbrella issue: to be filed (Admin: BigQuery query infrastructure (aggregate-backed canned reports))

Goal ​

Build infrastructure for surfacing BigQuery-derived metrics in the admin portal, with a clean architectural split between aggregate-backed canned reports (curated dashboards reading from small precomputed tables) and raw-events ad-hoc SQL (admin-authored, cost-gated). Land aggregate-backed canned reports end-to-end this iteration with one initial entry β€” event counts by type over a chosen window β€” proving the path. The raw-SQL editor and its cost-gate infrastructure are a deliberate follow-up. Same backend will later serve merchant-facing metrics.

Architectural split (the core decision) ​

Two distinct query modes with different backing strategies:

ModeReads fromCost profileCost gates neededThis iteration
Curated canned reportsPre-aggregated rollup tables (e.g., analytics.event_counts_daily)Near-zero, predictableNo β€” aggregates are tinyβœ… Built
Ad-hoc raw SQLRaw analytics.eventsVariable, can be expensiveYes β€” maximumBytesBilled, row caps, SQL-shape checks, global ceiling⏭️ Deferred

Why the split matters: cost gates only need to exist where humans author SQL. Curated dashboards are server-controlled, so concurrent admin viewers cost the same as one viewer (BQ caches table reads cheaply, and aggregates are small regardless). This collapses ~half the runtime guardrails out of this iteration's scope and crisply names what cost gates protect against.

Non-goals (this iteration) ​

  • Raw-SQL admin editor and the runtime infrastructure that supports it (maximumBytesBilled per-job caps, row-limit truncation, SQL-shape startswith check, global byte ceiling, dryRun cost estimation). All deferred to the follow-up that builds the editor.
  • Charting / visualization. Tables only. Sort/filter on the client.
  • Scheduled / exported / emailed reports. Reports tab stays placeholder.
  • Merchant-facing metrics. Architectural seam left for it; not built.
  • Sub-daily freshness. Aggregation runs every 24h. "Today's count" trails by up to 24h. Acceptable for the canned reports shipping now; revisit if a future report needs minute-level freshness.

Existing surface area ​

  • Data source: analytics.events per Firebase project. Schema in tooling/schemas/bigquery-events.json.
  • BQ writer: packages/forge/bigquery.js streams inserts via @google-cloud/bigquery.
  • Admin scaffolding: apps/admin/src/admin/analytics/BigQueryWorkspace.jsx routes /analytics/bigquery/{export,console,reports,retention} β€” all four currently render placeholder cards. This iteration fills in Console (its existing subtitle: "Run ad-hoc analytics queries and inspect event-level data" β€” the natural home for canned and eventually raw query running).
  • API service: services/api/analytics on Cloud Run, port 8082. Existing routes/admin.js hosts admin reads behind verifyFirebaseToken + requireRole. Existing routes/scheduled.js hosts cron-triggered jobs (e.g., /daily-aggregation) β€” this is where the new aggregation job lands.

Architecture ​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  analytics.events   β”‚  ← raw events stream (Forge writes)
β”‚  (raw, partitioned) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚ MERGE every 24h (Cloud Scheduler β†’ analytics-api)
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  analytics.event_counts_    β”‚  ← daily rollup, partitioned by day, clustered by event_name
β”‚  daily                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚ POST /admin/bq-query { queryName: 'event_counts', params: { window: '7d' } }
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    JSON: { rows, schema, jobMeta, generatedAt }    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  analytics-api      β”‚  ─────────────────────────────────────────────►   β”‚  Admin portal    β”‚
β”‚  + query registry   β”‚  ◄─────────────────────────────────────────────   β”‚  Console tab     β”‚
β”‚  + Forge runner     β”‚                                                    β”‚                  β”‚
β”‚  + 60s memo cache   β”‚                                                    β”‚                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Aggregation infrastructure (new) ​

Aggregate table: analytics.event_counts_daily ​

sql
CREATE TABLE `lantern-app-${env}.analytics.event_counts_daily` (
  day            DATE      NOT NULL,
  event_name     STRING    NOT NULL,
  event_tier     STRING    NOT NULL,   -- 'auto' | 'registered'
  environment    STRING    NOT NULL,   -- 'production' | 'development'
  count          INT64     NOT NULL,
  aggregated_at  TIMESTAMP NOT NULL    -- when this row was computed
)
PARTITION BY day
CLUSTER BY event_name;

Partition by day for cheap pruning on dashboard reads. Cluster by event_name because it's the primary group/filter axis. At ~50 event types Γ— 2 tiers Γ— 2 environments = ~200 rows/day, this stays tiny indefinitely (~73K rows/year). Schema lives at tooling/schemas/bigquery-event-counts-daily.json (new file, BQ JSON schema format).

Aggregation MERGE (idempotent, runs every 24h) ​

sql
MERGE `lantern-app-${env}.analytics.event_counts_daily` T
USING (
  SELECT
    DATE(timestamp)         AS day,
    event_name,
    event_tier,
    environment,
    COUNT(*)                AS count,
    CURRENT_TIMESTAMP()     AS aggregated_at
  FROM `lantern-app-${env}.analytics.events`
  WHERE DATE(timestamp) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
  GROUP BY day, event_name, event_tier, environment
) S
ON  T.day = S.day
AND T.event_name = S.event_name
AND T.event_tier = S.event_tier
AND T.environment = S.environment
WHEN MATCHED THEN UPDATE SET
  count = S.count,
  aggregated_at = S.aggregated_at
WHEN NOT MATCHED THEN INSERT (day, event_name, event_tier, environment, count, aggregated_at)
  VALUES (S.day, S.event_name, S.event_tier, S.environment, S.count, S.aggregated_at);

Why MERGE not INSERT: re-running is safe. The job can be replayed without producing duplicate rows. Why yesterday only: one full day is the smallest correct unit. Aggregating CURRENT_DATE() would lock in a partial number. Why these grouping columns: preserves dimensions admins are likely to filter on later (tier, env) without bloating the table.

Scheduled route: POST /scheduled/aggregate-event-counts-daily ​

New endpoint in services/api/analytics/src/routes/scheduled.js alongside the existing /daily-aggregation. Runs the MERGE once per invocation. Auth: existing scheduled-route auth (Cloud Scheduler invoker β€” same pattern as /daily-aggregation). Cloud Scheduler config: cron 0 1 * * * (01:00 UTC daily β€” late enough that yesterday's events have all landed; existing scheduler entries cluster around this window).

Logs structured pino entry on completion: { event: 'event_counts_aggregation', daysProcessed: 1, rowsAffected, durationMs, jobId }. Failures surface to Cloud Logging error-budget alerts (existing infra).

Backfill route: POST /admin/aggregations/event-counts/backfill ​

Sibling endpoint in routes/admin.js. Body: { days: number } (1–365). Runs the MERGE for the last N days in a single statement (loop unrolled into the SQL by adjusting the WHERE DATE(timestamp) BETWEEN … clause). Admin-only auth. Called manually after first deploy to populate historical data; reusable if re-aggregation is ever needed.

Backend: packages/forge/query.js (read-side runner) ​

New read-side companion to packages/forge/bigquery.js. Sole responsibility: execute a parameterized BQ query and return shaped results.

js
export async function runQuery({ sql, params }) {
  // Returns { rows, schema, jobMeta: { bytesProcessed, totalBytesBilled, cacheHit, jobId, durationMs }, generatedAt }
  // Logs structured pino entry: { event: 'bq_query_executed', queryName, params, userId, ...jobMeta, costUsd }
  // No cost gates this iteration β€” aggregate tables make them unnecessary. Gates land with raw-SQL iteration.
}

Why structured logging now even though aggregates make cost negligible: it builds the audit trail (who ran what, when) and the cost data sink that the future raw-SQL iteration's cost-association infrastructure will consume. Captures totalBytesBilled and computes costUsd = (totalBytesBilled / 1e12) * 5 (BQ on-demand price). One line per query, structured, exportable.

Backend: query registry ​

New module at services/api/analytics/src/services/queryRegistry.js. Plain object mapping queryName β†’ { sql, paramSchema, description, source, audience }.

js
export const QUERY_REGISTRY = {
  event_counts: {
    description: 'Event types and counts over a chosen time window',
    source: 'aggregate',                     // vs 'raw' β€” informational this iteration; gates branch on it later
    audience: 'admin',                       // vs 'merchant' β€” endpoint-level enforcement
    paramSchema: { window: { enum: ['24h', '7d', '30d', 'all'], default: '7d' } },
    sql: ({ window }) => ({
      sql: `
        SELECT event_name, SUM(count) AS count
        FROM \`${getProjectId()}.analytics.event_counts_daily\`
        WHERE day >= @since
        GROUP BY event_name
        ORDER BY count DESC
        LIMIT 5000
      `,
      params: { since: resolveWindowToDate(window) },
    }),
  },
}

Adding the second/third/Nth canned report = adding an entry (and its aggregate table + aggregation job, if it needs one). No new endpoint, no new auth wiring, no new result-table code.

resolveWindowToDate: 24h β†’ todayβˆ’1, 7d β†’ todayβˆ’7, 30d β†’ todayβˆ’30, all β†’ todayβˆ’365 (aggregate only goes back as far as we've aggregated anyway, but the cap remains as a defensive belt).

Note: LIMIT 5000 here is a query author convention (the registry can decide per-entry), not a runner-enforced cap this iteration. Runner-level row-limit truncation lands with the raw-SQL iteration.

Endpoint: POST /admin/bq-query ​

In services/api/analytics/src/routes/admin.js. Auth via existing verifyFirebaseToken + requireRole (already wired for that route group).

POST chosen over GET: future raw-SQL mode requires bodies. Using POST consistently from day 1 keeps the contract stable. Server-side 60s memo cache gives GET-like cheapness regardless.

Request:

json
{ "queryName": "event_counts", "params": { "window": "7d" } }

Reserved for next iteration: { "sql": "SELECT …" }. This iteration rejects with 400 if sql is present.

Response:

json
{
  "queryName": "event_counts",
  "params": { "window": "7d" },
  "generatedAt": "2026-04-30T18:00:00.000Z",
  "rows": [{ "event_name": "lantern_lit", "count": 1247 }],
  "schema": [
    { "name": "event_name", "type": "STRING" },
    { "name": "count",      "type": "INTEGER" }
  ],
  "jobMeta": {
    "bytesProcessed": 12480,
    "totalBytesBilled": 10485760,
    "costUsd": 0.0000524,
    "cacheHit": false,
    "durationMs": 184
  }
}

Validation:

  • queryName must exist in registry β€” else 400.
  • params validated against paramSchema β€” else 400 with field-level error.
  • audience on registry entry must match the route's audience β€” else 403 (defense-in-depth for the future merchant route).
  • sql field present in body β€” 400 (this iteration).

Caching: Server-side memoize for 60s, keyed on (queryName, JSON.stringify(params)). Per Cloud Run instance.

Errors: 400 / 403 / 502, structured { error: { code, message, jobId? } }. No silent fallbacks.

IAM / service account ​

The Cloud Run SA running analytics-api already has BQ access for writes. For reads + the new aggregation MERGE we need:

  • roles/bigquery.dataViewer on analytics dataset (for canned-report reads)
  • roles/bigquery.dataEditor on analytics dataset (for the MERGE β€” needed whether for events or just event_counts_daily; cleanest to scope to dataset)
  • roles/bigquery.jobUser on the project (to run jobs)

Verified during plan execution: what the SA currently has. If it has dataEditor already, no change. If it has only dataViewer, add dataEditor on the dataset (needed for the MERGE writes to event_counts_daily). Not narrowing further this iteration β€” when raw-SQL ships, that path will use a separate, strictly-dataViewer SA so admin-authored queries can't write.

Frontend ​

Console tab (/analytics/bigquery/console) β€” working surface this iteration.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Saved Queries                          β”‚
β”‚  ─────────────────                      β”‚
β”‚  β–Έ Event counts by type    [Run]        β”‚  ← clickable card, expands to params
β”‚    Window: [7 days β–Ύ]                   β”‚
β”‚                                         β”‚
β”‚  ─────────────────                      β”‚
β”‚  Raw SQL editor β€” coming next iteration β”‚  ← inline placeholder block
β”‚  ─────────────────                      β”‚
β”‚                                         β”‚
β”‚  Results                                β”‚
β”‚  ─────────────────                      β”‚
β”‚  Event Type      β–Ό  Count               β”‚  ← sortable
β”‚  lantern_lit         1,247              β”‚
β”‚  wave_sent           312                β”‚
β”‚  …                                      β”‚
β”‚                                         β”‚
β”‚  Scanned: 12 KB Β· Billed: 10 MB Β· …     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components:

  • Saved-query card list β€” driven by a small client-side manifest mirroring registry metadata (name, description, param schema). Initially one entry. Future entries auto-render.
  • Param form β€” generated from paramSchema. enum params use <StyledSelect> (CLAUDE.md rule). Defaults pre-applied.
  • Raw SQL placeholder block β€” visible inline, links to the follow-up issue once filed.
  • Result table β€” generic shared QueryResultTable keyed on response schema. Sortable client-side. Right-aligns numeric columns. Reused by raw-SQL mode later.
  • Job metadata footer β€” bytes processed, billed, cost, cache hit/miss, duration. Lightweight observability.
  • Refresh button β€” .btn-secondary.

State: local useState/useEffect. No new global state.

Reports tab β€” stays placeholder. Recurring/scheduled/delivered reports are a separate future iteration on top of the same registry.

Future seam: merchant metrics ​

Same runQuery, same registry pattern, but:

  • Merchant-scoped registry entries flagged audience: 'merchant' and take a merchantId param.
  • Mounted under a different route (e.g. /merchant/bq-query) with merchant-role auth.
  • Endpoint enforces audience matches the route β€” admin queries can't be invoked from merchant route, and vice versa.

Out of scope this iteration. Called out so the registry shape doesn't accidentally preclude it.

Testing ​

LayerTest
packages/forge/query.jsVitest unit, BQ client mocked. Param binding, schema mapping, error propagation, structured-log emission shape.
Query registryVitest unit. Validates paramSchema, SQL builders produce expected { sql, params }, window→date resolution at boundaries. Parameterized test exercising every entry with valid + known-bad params (catches missing entries when registry grows).
Aggregation routeVitest endpoint test, BQ client mocked. Runs MERGE, returns rows-affected, handles BQ errors.
Backfill routeVitest endpoint test. Admin auth gate, days-range validation, MERGE invocation.
/admin/bq-query endpointVitest endpoint test, registry + Forge mocked. Auth gate, queryName validation, param validation, audience enforcement, cache behavior, explicit reject of { sql } payloads.
QueryResultTableVitest + RTL. Sort, numeric alignment, empty state.
Console tabVitest + RTL. Renders saved-query list, param form, runs query, displays results, shows raw-SQL placeholder. Mocked endpoint.
StorybookStory for QueryResultTable with mocked rows + schema.

Coverage threshold (75%) on new code. Existing files stay above their current line.

Open questions ​

None blocking. Decisions locked:

  • Endpoint path & method: POST /admin/bq-query in routes/admin.js.
  • Aggregate table: analytics.event_counts_daily, partitioned by day, clustered by event_name.
  • Aggregation cadence: every 24h at 0 1 * * * UTC.
  • Aggregation strategy: MERGE for yesterday's events only (idempotent, replayable).
  • Default window for event_counts: 7 days. all capped at 365 days.
  • Cache TTL: 60s.
  • Result row hint per registry entry (LIMIT 5000 for event_counts); runner-enforced cap deferred to raw-SQL iteration.
  • This iteration: aggregate-backed only. Raw SQL + cost gates deferred.
  • Cost logging: structured pino on every query (audit + future cost-association substrate).

To verify during plan execution (non-blocking either way):

  • Current analytics-api SA's BQ roles (dataViewer vs dataEditor).
  • Cloud Scheduler config pattern used by existing /daily-aggregation (replicate exactly).

Workflow notes (process, not feature) ​

  • Branch: feat/<umbrella-N>-bq-query-infra once umbrella issue is filed.
  • Worktree: .claude/worktrees/bq-query-infra with .env.local symlinked from main checkout. Setup steps live in user's feedback_avoid_worktrees_for_sequential_prs.md memory.
  • PR: Not opened until user explicitly says so. Pre-PR pushes are CI-free per feedback_no_unprompted_push.md; PR-creation needs go-ahead.
  • Issues bundled: This single PR closes the new umbrella issue. Files a follow-up issue for the raw-SQL Console (next iteration). Does not claim to close #272, #332, #333 β€” those remain on their own tracks.

Risks & mitigations ​

RiskMitigation
Generic endpoint = generic risk surfaceRegistry-only-emits-SELECT this iteration; { sql } payloads explicitly rejected; SQL written by registry author with parameterized queries (@param), never string-interpolated user input.
Aggregation job fails silentlyStructured pino log on success/failure; failures surface via Cloud Logging error-budget alerts (existing infra). Re-running the job is safe (MERGE is idempotent).
Aggregation runs but yesterday's data still incomplete (timezone edge)0 1 * * * UTC is well after midnight in any business-relevant timezone; events stream finalizes near-real-time. If late-arriving events become an issue, switch to interval 2 days lookback.
Aggregate table schema drifts from raw eventsNew event_name values flow through automatically (no schema change needed). New dimensions (e.g., user_agent) require a table-schema migration if we want to aggregate by them β€” accept and revisit when needed.
BQ cost in registry queries grows unexpectedlyAggregate tables stay tiny (~73K rows/year). Even unbounded scans on event_counts_daily cost fractions of a cent. Real cost surface lives with raw-SQL iteration.
Future merchant route misused for admin queriesaudience flag on registry entries; endpoint-level enforcement that audience matches route. Seam exists, not built.

Success criteria ​

  1. After first deploy + manual backfill, an admin opens /analytics/bigquery/console, picks "Event counts by type", chooses a window, runs it, and sees a sorted table within ~1s on a warm cache.
  2. The aggregation runs every 24h on schedule, populates event_counts_daily, and re-running it is provably safe (idempotent MERGE).
  3. The endpoint enforces admin auth, rejects unknown queryNames, rejects raw { sql } payloads, and enforces audience matches route.
  4. Adding a hypothetical second aggregate-backed report (e.g., "Active users by day") requires only: a new aggregate table + aggregation job + registry entry + client manifest entry. No changes to endpoint, runner, auth, or result-table code.
  5. The inline raw-SQL placeholder on the Console tab tells the admin clearly that raw SQL is "coming next" without looking abandoned.
  6. Per-query structured logs (event: 'bq_query_executed') are emitted with cost data, queryable in Cloud Logging, ready for cost-association infrastructure when it lands.
  7. All new code is covered by tests; npm run validate passes.
  8. The pattern documented in this spec is sufficient for the follow-up raw-SQL iteration to extend rather than redesign.

Built with VitePress