BigQuery Billing Infrastructure — Plan

Last Updated: 2026-05-05 Status: Planning (pre-implementation) Author: drafted by Claude on claude/bigquery-billing-infrastructure-IZf4RRelated Issues / PRs:

#52 — Billing Monitoring (canonical spec for the BQ layout)
#233 — System Health: Add Billing/Cost Data (Phase 2)
#302 — Phase 2 historical snapshots + 1hr cache + BQ pipeline (open draft, partial impl)
#240 — AI assistant billing tracking (originally filed as "OpenAI billing"; the codebase now uses Anthropic — issue title may need a follow-up rename)
#238 — Resend billing integration
#188 — AI usage transparency
docs/plans/2026-02-09_cloud-run-migration-optimized.md — Phase 2B references the combined analytics-api + billing-api Cloud Run deployment

1. Why this doc

Issue #52 describes a full multi-vendor BigQuery billing pipeline (raw → normalized → attributed → marts + ops). PR #302 ships a much narrower first slice: a single denormalized billing_snapshots table written daily by a scheduled Cloud Function.

We have two artefacts pointing in roughly the same direction but at very different levels of ambition, and no single place that says "here is where we are, here is where we want to end up, here is the path between them." This doc is that place. It is not a coding spec — it is a roadmap so the next implementation PR has a clear scope and the team has a shared mental model of the end state.

2. Current state (what already exists in `dev`)

Code in tree

Layer	Location	What it does
Per-vendor billing fetch	`services/functions/firebase/modules/billing.js`	On-demand pulls for GCP, Cloudflare, Railway, GitHub. Includes `getGcpBillingWithBigQuery()` which queries `billing_export_gcp.gcp_billing_export` + `cloud_pricing_export`.
Vendor API helpers	`services/functions/firebase/modules/billingShared.js`	Cloudflare GraphQL, Railway GraphQL, GitHub Actions usage.
System Health surface	`services/functions/firebase/modules/systemHealth.js` + `apps/admin/src/components/SystemHealth.jsx`	5-min health cache, 1-hr separate billing cache, MoM tile (PR #302).
BQ client (events)	`packages/forge/bigquery.js`	Streaming inserts to `analytics.events`.
BQ schema browser	`services/api/analytics/src/services/bqSchema.service.js`	Metadata-only, 5-min cache.
Daily MERGE rollup pattern	`services/api/analytics/src/services/eventCountsAggregation.service.js`	Reference pattern for "ingest-then-rollup" jobs.

Datasets that exist today (per env: `lantern-app-dev`, `lantern-app-prod`)

analytics — events, event_counts_daily (events product, not billing)
billing_export_gcp — gcp_billing_export, cloud_pricing_export (managed by Google Cloud Billing export, always-on, schema owned by Google). This stays exactly as-is — the export is the right way to capture GCP cost and we don't modify it. Our pipeline reads from it; it does not read from us.
(PR #302 adds) billing_export_gcp.billing_snapshots — auto-created on first scheduled-function run, daily denormalized rows, MONTH-partitioned. Co-locates app-owned data inside Google's managed-export dataset, which is the bit we want to fix in Phase 0 (see §6).

What PR #302 is doing right

Captures a daily historical record from day one (the issue-#52 comment correctly flags that historical billing data cannot be retroactively captured)
Idempotent (insertId = YYYY-MM-DD)
Non-fatal BigQuery write — Firestore fast-cache is the fallback
Auto-creates the table with month partitioning + handles the 409 race

What PR #302 leaves on the table

All vendors squashed into one row, one table (cloudflare_cost_usd, railway_cost_usd, …) — fine as a snapshot, useless as a fact table
No GCS landing zone, no raw vendor payloads — just totals
No taxonomy / attribution layer — can't slice by env / app / feature
No anomaly / freshness / pipeline-runs ops tables
Lives in the Firebase Functions module rather than a dedicated billing-api service or Cloud Run job — same place where on-demand fetches happen
Reuses billing_export_gcp (a GCP-managed export dataset) for our app-managed billing_snapshots — convenient but conflates two ownership models

3. Target architecture (issue #52, restated)

Cloud Scheduler
  ├── Cloud Run Job: ingest-cloudflare ─┐
  ├── Cloud Run Job: ingest-railway     ├─→ GCS (raw payloads, optional)
  ├── Cloud Run Job: ingest-anthropic   │       │
  ├── Cloud Run Job: ingest-resend      │       ▼
  └── (always on) GCP Billing Export ───┴─→ BigQuery: billing_raw.*

       Dataform (or dbt) on schedule:
         billing_raw.*
           → billing_norm.fact_cost_line_items     (one canonical schema)
           → billing_attrib.fact_cost_attributed   (env / app / team labels)
           → billing_marts.*                        (BI rollups)

       Cloud Run Job: cost-health-check
         → ops.pipeline_runs / data_freshness / cost_anomalies
         → Slack / email alerts

       Reporting:
         Looker Studio + Admin Console → billing_marts.*

Datasets

Dataset	Purpose	Write pattern
`billing_raw`	Immutable, vendor-shaped raw rows	Insert-only by ingest jobs / managed exports
`billing_norm`	Canonical `fact_cost_line_items` (one row = one billable cost event)	Built by Dataform/dbt from `billing_raw`
`billing_attrib`	`fact_cost_attributed` + `dim_taxonomy`, `dim_cost_allocation_rules`	Built by Dataform/dbt; rules-driven
`billing_marts`	Daily / monthly rollups for dashboards	MERGE jobs (cf. `eventCountsAggregation.service.js`)
`ops`	`pipeline_runs`, `data_freshness`, `cost_anomalies`	Written by every ingest + health-check job
`billing_sandbox` (optional)	Backfills, ad-hoc analysis	Free-form, throwaway

Naming: fact_* for event-grain, dim_* for reference, daily_* / monthly_* for marts, everything in ops.* for observability. (Per issue #52.)

4. Categories of cost question

A useful organizing axis for the rest of this plan is the kind of question the data needs to answer. Five primary categories, with sub-categories where they're meaningfully different shapes. Sub-categories within a primary share fidelity requirements and source datasets — they're slices, not separate products.

Throughout this doc and the codebase, "merchant" is the project nomenclature for the paying business (venue owner). "User" means the Lantern end-user (consumer). The two sub-categories under §4.x reflect that split.

Sub	Question	Source dataset(s)
1	Cost-over-time — "What did we spend last month? Show MoM trend."	`billing_marts.daily_cost_by_vendor` (PR #302's snapshot is a v0 of this)
2.1	Cost-by-service / SKU — "Within GCP, what's costing most?"	`billing_norm.fact_cost_line_items`
2.2	Cost-by-resource — "Which Cloud Run service / Cloudflare zone / Firestore collection is driving spend?"	`billing_norm.fact_cost_line_items` GROUP BY `resource_id`
3.1	Cost-by-environment — "How much is dev vs prod?"	`billing_attrib.fact_cost_attributed`
3.2	Cost-by-app — "web vs api vs admin vs functions vs merchant-portal?"	`billing_attrib.fact_cost_attributed`
3.3	Cost-by-team / cost-center — deferred, single-team org. Documented for future reference; no implementation planned.	(future)
4.1	Cost-per-user — usage-cost statistics for Lantern end-users (consumers). Not for revenue margin (no consumer revenue today) — for understanding consumer-side cost trends as we scale.	`billing_attrib.*` ⋈ `analytics.events` filtered to consumer events
4.2	Cost-per-merchant — highest-priority unit-economics view. Feeds the merchant pricing model in `docs/economics/ECONOMICS.md`.	`billing_attrib.*` ⋈ `analytics.events` keyed on `merchant_id`
4.3	Cost-per-assistant-interaction — Anthropic API cost ÷ assistant message events. Tells us if conversation costs are sustainable per-merchant.	`billing_attrib.*` (Anthropic rows) ⋈ `analytics.events WHERE event_type = 'assistant.message'`
4.4	Cost-per-product-event — cost-per-lantern-lit / cost-per-offer-redemption / cost-per-venue-refresh. Specific events picked at Phase 4 once we know which KPIs the team is tracking.	`billing_attrib.*` ⋈ `analytics.events` filtered to product KPIs
5.1	Anomaly detection — "Did something just get more expensive?" z-score baseline alerting.	`ops.cost_anomalies`
5.2	Forecasting — "Are we on track to exceed budget this month?"	`billing_marts.forecast_month_end`
5.3	Pipeline freshness — "Did every vendor land a row today?"	`ops.data_freshness`, `ops.pipeline_runs`
5.4	Attribution coverage — "What % of cost is unattributed? Are our rules decaying?"	`billing_marts.unattributed_cost_daily`

Why this matters for the rollout

The arc is monotonic in fidelity — each primary category needs everything the previous one needed plus more. That maps almost 1:1 to the phases in §6:

Phase 1 enables Cat 1 (vendor totals over time) and 5.3 (pipeline freshness comes for free with ops.pipeline_runs).
Phase 2 enables 2.1, 2.2, 3.1, 3.2 for one vendor + 5.4 (attribution coverage starts being measurable as soon as attribution runs).
Phase 3 enables full consumption of Cats 1-3 via the admin surface + 5.2 (forecast can be built on MTD attributed cost).
Phase 4 fans out remaining vendors and lights up Cat 4 (event-joined unit economics needs attributed data on multiple vendors to be meaningful) + 5.1 (anomaly detection needs ≥14d trailing baseline, which lands here).

Cat 5 sub-categories are not their own phase — they're folded into Phases 1-4 by their natural prerequisites. The original plan parked all of Cat 5 in a final Phase 5; pulling them forward gives the team optimization-targeting signal (5.1 alongside the Cat 2/3 cost-driver views) at the same time as the visibility layer, not after it. That matters because "where is it expensive?" is one of the highest-value uses of cost data, not a polish item.

5. Gap analysis (current → target)

Capability	Current	Target	Gap
GCP cost capture	Managed export ✅	Same	None
Cloudflare cost capture	On-demand fetch + daily snapshot total	Daily ingest into `billing_raw.cloudflare_invoice_line_items_raw`	Need ingest job, raw schema, GCS landing
Railway cost capture	Same as Cloudflare	Same	Same
Anthropic cost capture	None (#240 open — originally OpenAI, now Anthropic)	Daily ingest from Anthropic Console API, line-item grain	Whole vertical
Resend cost capture	None (#238 open)	Same	Whole vertical
GitHub cost capture	On-demand fetch + daily snapshot total	Daily ingest (Actions usage + Copilot)	Need ingest job
Canonical fact table	Single denormalized snapshot row	`billing_norm.fact_cost_line_items`	Need normalization layer (Dataform/dbt)
Attribution (env / app / team)	None	`billing_attrib.fact_cost_attributed`	Whole layer + taxonomy + rules
Rollups	Snapshot-grain only	`billing_marts.daily_`, `monthly_`, `top_cost_drivers_30d`, `unattributed_cost_daily`, `cost_trend_rolling_7d`	Whole layer
Pipeline observability	App logs only	`ops.pipeline_runs`, `data_freshness`, `cost_anomalies`	Whole layer
Alerting	None	Slack/email on freshness lag + anomaly	Need alerting hooks
Backfill story	N/A (snapshots are daily-going-forward only)	`billing_sandbox.backfill_*` for vendors that expose history	Need per-vendor backfill scripts
Reporting surface	System Health "Billing" tab	+ Looker Studio + admin Cost dashboard	Need marts → admin endpoints
Service home	Cloud Functions	Cloud Run job(s) + `billing-api` service	Need migration (aligns with Phase 2B in cloud-run plan)

6. Phased rollout

Five phases (numbered 0-4). Each phase is independently shippable, leaves the system in a working state, and unlocks specific cost-question categories from §4. PR #302 is roughly two-thirds of Phase 0; land or rebase it before starting Phase 1.

The original plan had a Phase 5 for ops/alerting/forecasting; that phase has been dissolved and its sub-categories (5.1-5.4) folded into Phases 1-4 by their natural prerequisites. See the note at the end of this section.

Phase 0 — Land what's in flight (PR #302) — this week

Resolve PR #302 review and merge (or close + cherry-pick the table-creation helper into Phase 1).
Move the billing_snapshots table out of billing_export_gcp and into a new app-owned dataset billing_app. Rationale: billing_export_gcp is managed by Google Cloud Billing export — its schema, partitioning, and retention policies are controlled by Google. Co-locating our app-written tables inside it conflates two ownership models and risks us having to re-architect later if Google changes export semantics. Hygiene only — the GCP export itself is automatic, correct, and untouched.
Categories enabled: rough Cat 1 at snapshot grain (totals only, no service / resource / attribution slicing yet).
Action items:
1. Decide: keep snapshot table or treat it as throwaway once Phase 1 lands? (Recommendation: keep — it's a useful sanity check vs. the normalized fact table and provides graceful degradation if ingest jobs lag.)
2. Create billing_app dataset in dev + prod, move the table, update BIGQUERY_BILLING_DATASET env (or split into two env vars: one pointing at the Google-managed export, one at the app-owned dataset).

Phase 1 — Stand up `billing_raw` + first ingest job — 2 weeks

Goal: one Cloud Run Job, one vendor (start with Cloudflare — has the cleanest GraphQL billing API and we already have the client code in billingShared.js).

Create billing_raw dataset in dev + prod (manual bq mk, document in docs/economics/billing/PLATFORMS.md).
New Cloud Run Job: services/jobs/ingest-cloudflare/ (mirrors the structure of services/api/*). Daily Cloud Scheduler trigger. Idempotent on (snapshot_date, line_item_id).
Optional GCS landing: gs://lantern-billing-raw/cloudflare/YYYY/MM/DD/*.json (skip in Phase 1 if it complicates things — billing_raw rows are sufficient for an audit trail, and we can backfill GCS later from the raw table).
New ops.pipeline_runs table — every ingest job writes a row at start + end. This single table is also the source for §4 sub-category 5.3 (Pipeline freshness) — cheap to add at this phase, pays off immediately.
Categories enabled: full Cat 1 for Cloudflare (line-item-grain rather than just totals); 5.3 Pipeline freshness.
Done when: billing_raw.cloudflare_invoice_line_items_raw has 7+ days of rows and ops.pipeline_runs shows 7 successful executions.

Phase 1 implementation approach (locked)

No Terraform. Provision dataset, Cloud Run Job, and Scheduler trigger via bq mk / gcloud run jobs deploy / gcloud scheduler jobs create steps in the existing .github/workflows/deploy-{dev,prod}.yml pipeline. Matches the convention already in use; revisit Terraform if infra grows.
IAM bindings included defensively. The Phase 1 workflow runs gcloud projects add-iam-policy-binding for the WIF service account on each role we need (roles/bigquery.dataOwner on the new datasets, roles/run.developer, roles/cloudscheduler.admin, roles/secretmanager.secretAccessor). These are idempotent — safe to run even if scopes are already granted.
Cloudflare token scope verification. PR #302's source issue (#233) flagged that Account → Billing → Read may need to be added to the existing CLOUDFLARE_API_TOKEN. The ingest job will fail loudly on a 401/403 from the billing endpoint and emit a clear error to ops.pipeline_runs.error. Verification step is captured in the Phase 1 PR's "before merge" checklist.
Authentication. The Cloud Run Job runs as the same WIF service account the deploy uses; reads secrets directly from Secret Manager via the existing defineSecret pattern (services/functions/firebase/config.js).

Phase 2 — Normalize + attribute (one vendor) — 2 weeks

Goal: prove the Dataform/dbt layer with one vendor before fanning out.

Decision needed: Dataform vs dbt. (See §8.) Tentative recommendation: Dataform — native to GCP, no extra infra, free for the foreseeable future. dbt is more powerful but adds a runner (dbt Cloud or self-hosted) we don't need.
Define billing_norm.fact_cost_line_items schema (vendor-agnostic columns: vendor, service, sku, usage_amount, usage_unit, cost_usd, usage_start, usage_end, resource_id, labels JSON, …).
First Dataform model: cloudflare_invoice_line_items_raw → fact_cost_line_items.
Define dim_taxonomy (env, app, team, cost_center) — start with three values per dimension, expand later.
Define dim_cost_allocation_rules — initial rules: GCP project-id label → env (*-dev → dev, *-prod → prod); Cloudflare zone → app (lantern-app.com → web, api.lantern-app.com → api).
First attribution model: fact_cost_line_items → fact_cost_attributed.
Add a derived view billing_marts.unattributed_cost_daily (everything in fact_cost_line_items not in fact_cost_attributed for the same period). Sources §4 sub-category 5.4 (Attribution coverage).
Categories enabled: 2.1 (cost-by-service / SKU), 2.2 (cost-by-resource), 3.1 (cost-by-environment), 3.2 (cost-by-app) — all for Cloudflare; 5.4 Attribution coverage.
Done when: a SQL query against fact_cost_attributed returns Cloudflare costs sliced by env + app, matching the totals in billing_snapshots to within 1%.

Phase 2 implementation approach (locked)

Decisions locked at Phase 2 kickoff (2026-05-10). See docs/superpowers/plans/2026-05-10-bigquery-billing-phase-2.md for the detailed spec.

Transform tool: Dataform (D1, already locked in §8 above).
Repo location: in-tree at services/dataform/ — alongside services/jobs/ingest-cloudflare/. SQLX models, JS includes, and the workflow_settings.yaml all live in the GitHub repo and are reviewed together with ingest changes. The GCP-side Dataform repository (lantern-billing-transforms in lantern-app-{dev,prod}) mirrors from GitHub via gcloud dataform repositories create --git-remote-settings.
Schedule: Dataform workflow_config, daily 03:30 PT. Native to Dataform, no extra Cloud Run Job required, timezone-aware. Runs 30 min after ingest-cloudflare (03:00 PT) so the day's raw rows are present before normalization. Provisioned via gcloud dataform calls in the same deploy-{dev,prod}.yml pattern Phase 1 established. No Pub/Sub coupling in Phase 2 — Phase 4's multi-vendor fan-out is the natural revisit point.
CI gating: compile-only. A new dataform-compile job in .github/workflows/ci.yml runs dataform compile on every PR (no GCP credentials required, ~10s). Catches SQL syntax errors and broken refs pre-merge. BQ-against-live dry-runs are deliberately deferred — the daily dev workflow is the live-validation surface.
Taxonomy: env + app + service (D6, locked above).
Shared-resource attribution: omitted from Phase 2 (D8, locked above). fact_cost_attributed carries direct attribution only; no attribution_method / attribution_weight columns yet.
Datasets created defensively in deploy workflow (mirror Phase 1's bq mk pattern): billing_norm, billing_attrib, billing_marts.
IAM bindings (best-effort, same continue-on-error pattern as Phase 1): the WIF SA needs roles/dataform.editor on the project plus roles/bigquery.dataEditor on the three new datasets.

Phase 3 — Marts + Admin surface — 1-2 weeks

Materialize the marts from issue #52 (daily_cost_by_app_env_vendor, monthly_cost_by_app, top_cost_drivers_30d, cost_trend_rolling_7d, forecast_month_end). Use the daily-MERGE pattern from eventCountsAggregation.service.js.
forecast_month_end mart — naive linear projection of MTD attributed spend. Sources §4 sub-category 5.2 (Forecasting).
New endpoints in (or alongside) services/api/analytics/ — initially GET /admin/billing/daily, /admin/billing/top-drivers, /admin/billing/forecast. These read only from billing_marts.*, never from billing_norm or billing_raw (latency + byte-billed cost).
Admin UI: extend apps/admin/src/admin/billing/Billing.jsx with a "Historical" tab that hits the new endpoints. Reuse existing chart components.
Categories enabled: Cats 1 + 2 + 3 consumable end-to-end via admin for Cloudflare; 5.2 Forecasting.
Done when: admin can see a 30-day trend chart by env, by app, and by vendor; forecast tile renders; "unattributed %" tile is < 5%.

Phase 4 — Fan out to remaining vendors + unit economics + anomalies — 2-3 weeks (parallelizable)

Two parallel tracks:

Track A: Vendor fan-out (~1-3 days each given the Phase 1-2 template):

ingest-railway (no tracking issue today; replaces the on-demand fetch in services/functions/firebase/modules/billing.js)
ingest-anthropic (closes #240 — needs ANTHROPIC_ADMIN_API_KEY in Secret Manager; pulls from the Anthropic Console Usage & Cost APIs)
ingest-resend (closes #238)
ingest-github (replaces the on-demand fetch for snapshots)

Each adds raw → norm → attrib stages mirroring Phase 2. Marts and admin endpoints already generalize from Phase 3.

Track B: Cat 4 unit economics + Cat 5.1 anomaly detection (depends on Track A):

First Cat-4 views once Anthropic ingest lands: 4.3 Cost-per-assistant-interaction (Anthropic API cost ⋈ analytics.events WHERE event_type = 'assistant.message') and 4.2 Cost-per-merchant (attributed cost ⋈ events keyed on merchant_id). These are the highest-value views per §4 — start here.
Fast-follows: 4.1 Cost-per-user, 4.4 Cost-per-product-event (pick 1-2 product KPIs based on what the team is tracking).
Surface as new admin endpoints /admin/billing/unit-economics/* and a "Unit Economics" tab in Billing.jsx.
Once trailing 14-day baseline exists across vendors, materialize ops.cost_anomalies — z-score per (vendor, env). Slack alert on |z| > 3. notification_log records every alert sent. Sources §4 sub-category 5.1 (Anomaly detection).
Categories enabled: Cats 1-3 for all vendors; Cat 4 unit economics (4.2, 4.3 first, then 4.1 / 4.4); 5.1 Anomaly detection.
Done when: cost-per-merchant and cost-per-assistant-message tiles render in admin; a deliberately-injected anomaly fires a Slack alert within 1 hour.

Note: where the dissolved Phase 5 work landed

The original plan parked all operational-health work (Cat 5: anomaly, forecasting, freshness, attribution coverage) in a final Phase 5. In practice each Cat 5 sub-category has a prerequisite that aligns with an earlier phase:

Sub	Prereq	Lands in
5.3 Pipeline freshness	`ops.pipeline_runs` (any ingest job)	Phase 1
5.4 Attribution coverage	Attribution model running	Phase 2
5.2 Forecasting	MTD attributed cost	Phase 3
5.1 Anomaly detection	≥14-day trailing baseline across vendors	Phase 4

Pulling Cat 5 forward gives the team optimization-targeting signal — the single highest-value use of cost data — at the same time as the visibility layer, not after it.

7. Build vs Buy — decision and rationale

Decision: Build. A SaaS option (Vantage Starter, free at our scale) was seriously considered. The deciding factor: the integration coverage gap and the unit-economics ceiling.

Pricing landscape (verified May 2026, for context)

Tool	Pricing model	Free tier	Fits us today?
Vantage	Tiered by monthly cloud spend: Starter $0 (≤$2.5k/mo), Pro $30 (≤$7.5k), Business $200 (≤$20k)	✅ Starter	Yes (we're ~$50/mo)
CloudZero	$19 per $1k of monthly AWS spend	❌	No — mid-market floor
Finout	~1% of cloud bill, fixed yearly	❌	No — mid-market floor
Cloudability (IBM)	2-3% of cloud spend	❌	No — enterprise

So the only realistic SaaS at our scale is Vantage Starter, free.

Why Build wins for us

Integration gap is fatal. Vantage GA covers AWS / Azure / GCP / Kubernetes / Datadog / Snowflake / OpenAI / Vercel / Fastly. Anthropic is not in Vantage's GA list at time of writing — relevant because the codebase uses Anthropic, not OpenAI, so the Vantage OpenAI integration doesn't help us. Cloudflare status was uncertain at time of writing (on the public roadmap as of 2022, not confirmed shipped). Railway, Resend, GitHub Actions are not supported. That means we'd be building Cloudflare (maybe), Railway, Resend, and GitHub Actions ingest jobs ourselves regardless — at which point Vantage is a duplicative $0-30/mo dependency covering only ~50% of our cost surface, not a replacement for the homegrown stack.
Unit economics require BQ-native joins. The high-leverage product questions for this codebase aren't "what did we spend?" — they're "what does it cost to serve a merchant / send an assistant message / light a lantern?". Those answers live in joins between billing data and analytics.events (e.g. cost-per-active-merchant = daily_cost_attributed_to_merchant_features / count(distinct merchant_id) FROM events WHERE event_type IN (...)). Vantage cannot see analytics.events. With both datasets in the same BigQuery project, it's a same-day JOIN. This is the durable differentiator that pays for the engineering time, not anomaly detection or forecasting (those are commodity features Vantage ships for free).
Taxonomy + rules in version control. dim_cost_allocation_rules lives in our repo, gets reviewed, has tests. SaaS taxonomy lives in someone else's web UI.

Honest cost of Build

~3-5 weeks of one engineer for Phases 1-5.
Reimplementing commodity features (anomaly z-score, forecasts, alert routing) that Vantage ships out of the box.
Operational ownership of ingest jobs, schema migrations, BQ byte-billed spend on hot queries.

We accept these costs because reasons #1 and #2 above are not workable around. Going in-house regardless of how Vantage's roadmap evolves — even if Cloudflare shipped on their side, reason #2 (BQ-native unit economics) stands on its own.

8. Open decisions

Mixed state: rows marked (decided) were resolved during the planning conversation and are locked. The remaining rows (D4, D5) carry tentative recommendations or research findings, and still need reviewer sign-off before their phase starts.

#	Decision	Tentative	Why it matters
D0	Build vs Buy	Build (decided — see §7)	Resolved. Driven by integration coverage gaps (Railway / Resend / GitHub Actions unsupported by Vantage) and the need for BQ-native joins with `analytics.events` to answer unit-economics questions a SaaS can't see. Sanity-check Cloudflare integration status before Phase 1 starts.
D1	Dataform vs dbt for the transform layer	Dataform (decided)	Native to GCP — same project / auth / UI as BigQuery, free, no external runner. dbt's strengths (large package ecosystem, mature testing) don't apply much to billing transforms (linear SQL over a handful of tables); we'd be paying dbt Cloud or self-hosting dbt Core to win generic-modeling features we don't need. Re-evaluate only if we ever stand up a broader analytics-engineering practice.
D2	GCS landing zone in Phase 1, or defer to Phase 5?	Deferred (decided) — `billing_raw` is the audit trail; revisit at a Phase 4 retro if we hit a need for raw-payload replay. Keeps Phase 1 simpler.
D3	Where does ingest live: Cloud Run Jobs, Cloud Run Services with `/cron` endpoints, or stay in Firebase Functions?	Cloud Run Jobs (decided) — aligns with `2026-02-09_cloud-run-migration-optimized.md` Phase 2B; provisioned via `gcloud run jobs deploy` in the existing GitHub Actions pipeline (no Terraform).
D4	Move `billing_snapshots` out of `billing_export_gcp`?	Yes, into new `billing_app` dataset	Separates app-owned tables from Google-managed exports.
D5	Backfill strategy per vendor (which support history-replay?)	Cloudflare ✅, Railway ✅, Anthropic ❓ (Console Usage & Cost APIs expose historical data; depth + granularity needs to be verified before Phase 4), Resend ✅ (`GET /emails` supports forward + backward cursor pagination — paginate to the start of the account, group by day, derive cost from email-count × tier rate)	Drives whether we treat snapshots as the source of truth or transient.
D6	Taxonomy granularity v1 — env/app, or env/app/team/feature?	env + app + service (decided)	Locked at the Phase 2 kickoff. Three dimensions, ~5-10 allocation rules to maintain. Maps to existing labels: GCP project → env, Cloudflare zone → app, vendor `service` field → service. Team / feature dimensions deferred — re-evaluate at Phase 4 retro per §9 risk note.
D7	Reporting surface — Looker Studio, Grafana, or admin-only?	Admin-only (decided)	Admin console (`apps/admin/src/admin/billing/Billing.jsx`) is the single consumer surface. No Looker Studio / Grafana. Keeps IAM simple (admin role + service account, no extra BI tool access) and reuses existing chart components + auth. Re-evaluate only if a non-engineering audience needs direct BQ access.
D8	Cost attribution for shared resources (e.g. shared Firebase project)	Omit until Phase 4-5 (decided)	Locked at the Phase 2 kickoff. Phase 2's only vendor (Cloudflare) has 1:1 zone → app mapping with no shared-resource ambiguity, so scaffolding `attribution_method` / `attribution_weight` columns now would be speculative. Add columns when the first vendor that needs them (likely Firebase / shared GCP project) lands. Tentative implementation when revisited: proportional split by request count.

9. Risks

billing_export_gcp schema drift. Google can change GCP billing export columns. Already an issue: getGcpBillingWithBigQuery() has fallback handling. Mitigation: Dataform model owns the column-projection, adapt centrally.
BigQuery byte-billing on hot queries. Marts must be partitioned + clustered; admin endpoints must read marts only, not norm/raw. Same posture as bq-query-infrastructure plan (maximumBytesBilled is enforced there for ad-hoc; we should re-use for canned reports).
Vendor API rate limits + auth rotation. Cloudflare/Railway/GitHub tokens rotate; ingest jobs should fail loudly to ops.pipeline_runs and Slack.
Cardinality explosion in dim_cost_allocation_rules. Resist the temptation to add per-feature attribution before measuring overhead. Re-evaluate at Phase 4 retro.
Cost of the cost system. BigQuery storage + query for billing_raw/norm is itself a line-item we'll see in the data. Set a soft budget (suggested: $5/mo) and alarm if exceeded.

10. Out of scope (for this plan)

Merchant-facing billing (invoices, Stripe, payment collection). Tracked in docs/economics/ECONOMICS.md and docs/business/MERCHANT_INTEGRATION_POA.md; separate concern from infrastructure cost monitoring.
Line-item-grain attribution to specific events (e.g. "this exact Claude API call cost $0.0021 and was for assistant message ID msg_abc123"). Cat 4 in §4 covers aggregate unit economics — SUM(cost) / COUNT(events) over a time window — which is what we actually need for pricing and cost trends. True line-item-level attribution requires capturing a usage receipt on every API call into a separate analytics.cost_attribution_events table; possible later if a use case demands it, but not on the roadmap.
FinOps showback / chargeback to feature teams. We're a small team — revisit when org structure justifies it.
Expense reports, accounts payable, vendor payments, merchant payouts. This pipeline is read-only observability ("what did we spend?"). Anything involving moving money — corporate-card expense reports (Ramp, Brex), vendor invoice payment (Bill.com), merchant payouts (Stripe Connect) — is fiduciary / write-side and will use vetted SaaS, not anything we build. The natural one-way export from this pipeline (someday): billing_marts.* → Ramp coding rules so an inbound Cloudflare invoice gets auto-coded to engineering > infra > prod. That's a thin downstream integration, not a phase of this plan, and not something to design for now.

11. Next concrete step

This PR adds only this planning doc. The follow-up PR(s) should be:

First (1-2 days): Resolve PR #302 — either land it, rebase it onto this plan's Phase 0, or close in favour of Phase 1 starting fresh.
Then (Phase 1 PR): Create billing_raw dataset + services/jobs/ingest-cloudflare/ Cloud Run Job + ops.pipeline_runs table. Aim for ~400 LOC + a Cloud Scheduler config. One vendor end-to-end is more valuable than five vendors at the snapshot layer.

When picking up Phase 1, open a new issue ("Phase 1: Cloudflare ingest job + billing_raw dataset") that links back to this doc and to issue #52.

BigQuery Billing Infrastructure — Plan ​

1. Why this doc ​

2. Current state (what already exists in dev) ​

Code in tree ​

Datasets that exist today (per env: lantern-app-dev, lantern-app-prod) ​

What PR #302 is doing right ​

What PR #302 leaves on the table ​

3. Target architecture (issue #52, restated) ​

Datasets ​

4. Categories of cost question ​

Why this matters for the rollout ​

5. Gap analysis (current → target) ​

6. Phased rollout ​

Phase 0 — Land what's in flight (PR #302) — this week ​

Phase 1 — Stand up billing_raw + first ingest job — 2 weeks ​

Phase 1 implementation approach (locked) ​

Phase 2 — Normalize + attribute (one vendor) — 2 weeks ​

Phase 2 implementation approach (locked) ​

Phase 3 — Marts + Admin surface — 1-2 weeks ​

Phase 4 — Fan out to remaining vendors + unit economics + anomalies — 2-3 weeks (parallelizable) ​

Note: where the dissolved Phase 5 work landed ​

7. Build vs Buy — decision and rationale ​

Pricing landscape (verified May 2026, for context) ​

Why Build wins for us ​

Honest cost of Build ​

8. Open decisions ​

9. Risks ​

10. Out of scope (for this plan) ​

11. Next concrete step ​