Skip to content

BigQuery Billing Infrastructure โ€” Plan โ€‹

Last Updated: 2026-05-05 Status: Planning (pre-implementation) Author: drafted by Claude on claude/bigquery-billing-infrastructure-IZf4RRelated Issues / PRs:

  • #52 โ€” Billing Monitoring (canonical spec for the BQ layout)
  • #233 โ€” System Health: Add Billing/Cost Data (Phase 2)
  • #302 โ€” Phase 2 historical snapshots + 1hr cache + BQ pipeline (open draft, partial impl)
  • #240 โ€” AI assistant billing tracking (originally filed as "OpenAI billing"; the codebase now uses Anthropic โ€” issue title may need a follow-up rename)
  • #238 โ€” Resend billing integration
  • #188 โ€” AI usage transparency
  • docs/plans/2026-02-09_cloud-run-migration-optimized.md โ€” Phase 2B references the combined analytics-api + billing-api Cloud Run deployment

1. Why this doc โ€‹

Issue #52 describes a full multi-vendor BigQuery billing pipeline (raw โ†’ normalized โ†’ attributed โ†’ marts + ops). PR #302 ships a much narrower first slice: a single denormalized billing_snapshots table written daily by a scheduled Cloud Function.

We have two artefacts pointing in roughly the same direction but at very different levels of ambition, and no single place that says "here is where we are, here is where we want to end up, here is the path between them." This doc is that place. It is not a coding spec โ€” it is a roadmap so the next implementation PR has a clear scope and the team has a shared mental model of the end state.

2. Current state (what already exists in dev) โ€‹

Code in tree โ€‹

LayerLocationWhat it does
Per-vendor billing fetchservices/functions/firebase/modules/billing.jsOn-demand pulls for GCP, Cloudflare, Railway, GitHub. Includes getGcpBillingWithBigQuery() which queries billing_export_gcp.gcp_billing_export + cloud_pricing_export.
Vendor API helpersservices/functions/firebase/modules/billingShared.jsCloudflare GraphQL, Railway GraphQL, GitHub Actions usage.
System Health surfaceservices/functions/firebase/modules/systemHealth.js + apps/admin/src/components/SystemHealth.jsx5-min health cache, 1-hr separate billing cache, MoM tile (PR #302).
BQ client (events)packages/forge/bigquery.jsStreaming inserts to analytics.events.
BQ schema browserservices/api/analytics/src/services/bqSchema.service.jsMetadata-only, 5-min cache.
Daily MERGE rollup patternservices/api/analytics/src/services/eventCountsAggregation.service.jsReference pattern for "ingest-then-rollup" jobs.

Datasets that exist today (per env: lantern-app-dev, lantern-app-prod) โ€‹

  • analytics โ€” events, event_counts_daily (events product, not billing)
  • billing_export_gcp โ€” gcp_billing_export, cloud_pricing_export (managed by Google Cloud Billing export, always-on, schema owned by Google). This stays exactly as-is โ€” the export is the right way to capture GCP cost and we don't modify it. Our pipeline reads from it; it does not read from us.
  • (PR #302 adds) billing_export_gcp.billing_snapshots โ€” auto-created on first scheduled-function run, daily denormalized rows, MONTH-partitioned. Co-locates app-owned data inside Google's managed-export dataset, which is the bit we want to fix in Phase 0 (see ยง6).

What PR #302 is doing right โ€‹

  • Captures a daily historical record from day one (the issue-#52 comment correctly flags that historical billing data cannot be retroactively captured)
  • Idempotent (insertId = YYYY-MM-DD)
  • Non-fatal BigQuery write โ€” Firestore fast-cache is the fallback
  • Auto-creates the table with month partitioning + handles the 409 race

What PR #302 leaves on the table โ€‹

  • All vendors squashed into one row, one table (cloudflare_cost_usd, railway_cost_usd, โ€ฆ) โ€” fine as a snapshot, useless as a fact table
  • No GCS landing zone, no raw vendor payloads โ€” just totals
  • No taxonomy / attribution layer โ€” can't slice by env / app / feature
  • No anomaly / freshness / pipeline-runs ops tables
  • Lives in the Firebase Functions module rather than a dedicated billing-api service or Cloud Run job โ€” same place where on-demand fetches happen
  • Reuses billing_export_gcp (a GCP-managed export dataset) for our app-managed billing_snapshots โ€” convenient but conflates two ownership models

3. Target architecture (issue #52, restated) โ€‹

Cloud Scheduler
  โ”œโ”€โ”€ Cloud Run Job: ingest-cloudflare โ”€โ”
  โ”œโ”€โ”€ Cloud Run Job: ingest-railway     โ”œโ”€โ†’ GCS (raw payloads, optional)
  โ”œโ”€โ”€ Cloud Run Job: ingest-anthropic   โ”‚       โ”‚
  โ”œโ”€โ”€ Cloud Run Job: ingest-resend      โ”‚       โ–ผ
  โ””โ”€โ”€ (always on) GCP Billing Export โ”€โ”€โ”€โ”ดโ”€โ†’ BigQuery: billing_raw.*

       Dataform (or dbt) on schedule:
         billing_raw.*
           โ†’ billing_norm.fact_cost_line_items     (one canonical schema)
           โ†’ billing_attrib.fact_cost_attributed   (env / app / team labels)
           โ†’ billing_marts.*                        (BI rollups)

       Cloud Run Job: cost-health-check
         โ†’ ops.pipeline_runs / data_freshness / cost_anomalies
         โ†’ Slack / email alerts

       Reporting:
         Looker Studio + Admin Console โ†’ billing_marts.*

Datasets โ€‹

DatasetPurposeWrite pattern
billing_rawImmutable, vendor-shaped raw rowsInsert-only by ingest jobs / managed exports
billing_normCanonical fact_cost_line_items (one row = one billable cost event)Built by Dataform/dbt from billing_raw
billing_attribfact_cost_attributed + dim_taxonomy, dim_cost_allocation_rulesBuilt by Dataform/dbt; rules-driven
billing_martsDaily / monthly rollups for dashboardsMERGE jobs (cf. eventCountsAggregation.service.js)
opspipeline_runs, data_freshness, cost_anomaliesWritten by every ingest + health-check job
billing_sandbox (optional)Backfills, ad-hoc analysisFree-form, throwaway

Naming: fact_* for event-grain, dim_* for reference, daily_* / monthly_* for marts, everything in ops.* for observability. (Per issue #52.)

4. Categories of cost question โ€‹

A useful organizing axis for the rest of this plan is the kind of question the data needs to answer. Five primary categories, with sub-categories where they're meaningfully different shapes. Sub-categories within a primary share fidelity requirements and source datasets โ€” they're slices, not separate products.

Throughout this doc and the codebase, "merchant" is the project nomenclature for the paying business (venue owner). "User" means the Lantern end-user (consumer). The two sub-categories under ยง4.x reflect that split.

SubQuestionSource dataset(s)
1Cost-over-time โ€” "What did we spend last month? Show MoM trend."billing_marts.daily_cost_by_vendor (PR #302's snapshot is a v0 of this)
2.1Cost-by-service / SKU โ€” "Within GCP, what's costing most?"billing_norm.fact_cost_line_items
2.2Cost-by-resource โ€” "Which Cloud Run service / Cloudflare zone / Firestore collection is driving spend?"billing_norm.fact_cost_line_items GROUP BY resource_id
3.1Cost-by-environment โ€” "How much is dev vs prod?"billing_attrib.fact_cost_attributed
3.2Cost-by-app โ€” "web vs api vs admin vs functions vs merchant-portal?"billing_attrib.fact_cost_attributed
3.3Cost-by-team / cost-center โ€” deferred, single-team org. Documented for future reference; no implementation planned.(future)
4.1Cost-per-user โ€” usage-cost statistics for Lantern end-users (consumers). Not for revenue margin (no consumer revenue today) โ€” for understanding consumer-side cost trends as we scale.billing_attrib.* โ‹ˆ analytics.events filtered to consumer events
4.2Cost-per-merchant โ€” highest-priority unit-economics view. Feeds the merchant pricing model in docs/economics/ECONOMICS.md.billing_attrib.* โ‹ˆ analytics.events keyed on merchant_id
4.3Cost-per-assistant-interaction โ€” Anthropic API cost รท assistant message events. Tells us if conversation costs are sustainable per-merchant.billing_attrib.* (Anthropic rows) โ‹ˆ analytics.events WHERE event_type = 'assistant.message'
4.4Cost-per-product-event โ€” cost-per-lantern-lit / cost-per-offer-redemption / cost-per-venue-refresh. Specific events picked at Phase 4 once we know which KPIs the team is tracking.billing_attrib.* โ‹ˆ analytics.events filtered to product KPIs
5.1Anomaly detection โ€” "Did something just get more expensive?" z-score baseline alerting.ops.cost_anomalies
5.2Forecasting โ€” "Are we on track to exceed budget this month?"billing_marts.forecast_month_end
5.3Pipeline freshness โ€” "Did every vendor land a row today?"ops.data_freshness, ops.pipeline_runs
5.4Attribution coverage โ€” "What % of cost is unattributed? Are our rules decaying?"billing_marts.unattributed_cost_daily

Why this matters for the rollout โ€‹

The arc is monotonic in fidelity โ€” each primary category needs everything the previous one needed plus more. That maps almost 1:1 to the phases in ยง6:

  • Phase 1 enables Cat 1 (vendor totals over time) and 5.3 (pipeline freshness comes for free with ops.pipeline_runs).
  • Phase 2 enables 2.1, 2.2, 3.1, 3.2 for one vendor + 5.4 (attribution coverage starts being measurable as soon as attribution runs).
  • Phase 3 enables full consumption of Cats 1-3 via the admin surface + 5.2 (forecast can be built on MTD attributed cost).
  • Phase 4 fans out remaining vendors and lights up Cat 4 (event-joined unit economics needs attributed data on multiple vendors to be meaningful) + 5.1 (anomaly detection needs โ‰ฅ14d trailing baseline, which lands here).

Cat 5 sub-categories are not their own phase โ€” they're folded into Phases 1-4 by their natural prerequisites. The original plan parked all of Cat 5 in a final Phase 5; pulling them forward gives the team optimization-targeting signal (5.1 alongside the Cat 2/3 cost-driver views) at the same time as the visibility layer, not after it. That matters because "where is it expensive?" is one of the highest-value uses of cost data, not a polish item.

5. Gap analysis (current โ†’ target) โ€‹

CapabilityCurrentTargetGap
GCP cost captureManaged export โœ…SameNone
Cloudflare cost captureOn-demand fetch + daily snapshot totalDaily ingest into billing_raw.cloudflare_invoice_line_items_rawNeed ingest job, raw schema, GCS landing
Railway cost captureSame as CloudflareSameSame
Anthropic cost captureNone (#240 open โ€” originally OpenAI, now Anthropic)Daily ingest from Anthropic Console API, line-item grainWhole vertical
Resend cost captureNone (#238 open)SameWhole vertical
GitHub cost captureOn-demand fetch + daily snapshot totalDaily ingest (Actions usage + Copilot)Need ingest job
Canonical fact tableSingle denormalized snapshot rowbilling_norm.fact_cost_line_itemsNeed normalization layer (Dataform/dbt)
Attribution (env / app / team)Nonebilling_attrib.fact_cost_attributedWhole layer + taxonomy + rules
RollupsSnapshot-grain onlybilling_marts.daily_*, monthly_*, top_cost_drivers_30d, unattributed_cost_daily, cost_trend_rolling_7dWhole layer
Pipeline observabilityApp logs onlyops.pipeline_runs, data_freshness, cost_anomaliesWhole layer
AlertingNoneSlack/email on freshness lag + anomalyNeed alerting hooks
Backfill storyN/A (snapshots are daily-going-forward only)billing_sandbox.backfill_* for vendors that expose historyNeed per-vendor backfill scripts
Reporting surfaceSystem Health "Billing" tab+ Looker Studio + admin Cost dashboardNeed marts โ†’ admin endpoints
Service homeCloud FunctionsCloud Run job(s) + billing-api serviceNeed migration (aligns with Phase 2B in cloud-run plan)

6. Phased rollout โ€‹

Five phases (numbered 0-4). Each phase is independently shippable, leaves the system in a working state, and unlocks specific cost-question categories from ยง4. PR #302 is roughly two-thirds of Phase 0; land or rebase it before starting Phase 1.

The original plan had a Phase 5 for ops/alerting/forecasting; that phase has been dissolved and its sub-categories (5.1-5.4) folded into Phases 1-4 by their natural prerequisites. See the note at the end of this section.

Phase 0 โ€” Land what's in flight (PR #302) โ€” this week โ€‹

  • Resolve PR #302 review and merge (or close + cherry-pick the table-creation helper into Phase 1).
  • Move the billing_snapshots table out of billing_export_gcp and into a new app-owned dataset billing_app. Rationale: billing_export_gcp is managed by Google Cloud Billing export โ€” its schema, partitioning, and retention policies are controlled by Google. Co-locating our app-written tables inside it conflates two ownership models and risks us having to re-architect later if Google changes export semantics. Hygiene only โ€” the GCP export itself is automatic, correct, and untouched.
  • Categories enabled: rough Cat 1 at snapshot grain (totals only, no service / resource / attribution slicing yet).
  • Action items:
    1. Decide: keep snapshot table or treat it as throwaway once Phase 1 lands? (Recommendation: keep โ€” it's a useful sanity check vs. the normalized fact table and provides graceful degradation if ingest jobs lag.)
    2. Create billing_app dataset in dev + prod, move the table, update BIGQUERY_BILLING_DATASET env (or split into two env vars: one pointing at the Google-managed export, one at the app-owned dataset).

Phase 1 โ€” Stand up billing_raw + first ingest job โ€” 2 weeks โ€‹

Goal: one Cloud Run Job, one vendor (start with Cloudflare โ€” has the cleanest GraphQL billing API and we already have the client code in billingShared.js).

  • Create billing_raw dataset in dev + prod (manual bq mk, document in docs/economics/billing/PLATFORMS.md).
  • New Cloud Run Job: services/jobs/ingest-cloudflare/ (mirrors the structure of services/api/*). Daily Cloud Scheduler trigger. Idempotent on (snapshot_date, line_item_id).
  • Optional GCS landing: gs://lantern-billing-raw/cloudflare/YYYY/MM/DD/*.json (skip in Phase 1 if it complicates things โ€” billing_raw rows are sufficient for an audit trail, and we can backfill GCS later from the raw table).
  • New ops.pipeline_runs table โ€” every ingest job writes a row at start + end. This single table is also the source for ยง4 sub-category 5.3 (Pipeline freshness) โ€” cheap to add at this phase, pays off immediately.
  • Categories enabled: full Cat 1 for Cloudflare (line-item-grain rather than just totals); 5.3 Pipeline freshness.
  • Done when: billing_raw.cloudflare_invoice_line_items_raw has 7+ days of rows and ops.pipeline_runs shows 7 successful executions.

Phase 1 implementation approach (locked) โ€‹

  • No Terraform. Provision dataset, Cloud Run Job, and Scheduler trigger via bq mk / gcloud run jobs deploy / gcloud scheduler jobs create steps in the existing .github/workflows/deploy-{dev,prod}.yml pipeline. Matches the convention already in use; revisit Terraform if infra grows.
  • IAM bindings included defensively. The Phase 1 workflow runs gcloud projects add-iam-policy-binding for the WIF service account on each role we need (roles/bigquery.dataOwner on the new datasets, roles/run.developer, roles/cloudscheduler.admin, roles/secretmanager.secretAccessor). These are idempotent โ€” safe to run even if scopes are already granted.
  • Cloudflare token scope verification. PR #302's source issue (#233) flagged that Account โ†’ Billing โ†’ Read may need to be added to the existing CLOUDFLARE_API_TOKEN. The ingest job will fail loudly on a 401/403 from the billing endpoint and emit a clear error to ops.pipeline_runs.error. Verification step is captured in the Phase 1 PR's "before merge" checklist.
  • Authentication. The Cloud Run Job runs as the same WIF service account the deploy uses; reads secrets directly from Secret Manager via the existing defineSecret pattern (services/functions/firebase/config.js).

Phase 2 โ€” Normalize + attribute (one vendor) โ€” 2 weeks โ€‹

Goal: prove the Dataform/dbt layer with one vendor before fanning out.

  • Decision needed: Dataform vs dbt. (See ยง8.) Tentative recommendation: Dataform โ€” native to GCP, no extra infra, free for the foreseeable future. dbt is more powerful but adds a runner (dbt Cloud or self-hosted) we don't need.
  • Define billing_norm.fact_cost_line_items schema (vendor-agnostic columns: vendor, service, sku, usage_amount, usage_unit, cost_usd, usage_start, usage_end, resource_id, labels JSON, โ€ฆ).
  • First Dataform model: cloudflare_invoice_line_items_raw โ†’ fact_cost_line_items.
  • Define dim_taxonomy (env, app, team, cost_center) โ€” start with three values per dimension, expand later.
  • Define dim_cost_allocation_rules โ€” initial rules: GCP project-id label โ†’ env (*-dev โ†’ dev, *-prod โ†’ prod); Cloudflare zone โ†’ app (lantern-app.com โ†’ web, api.lantern-app.com โ†’ api).
  • First attribution model: fact_cost_line_items โ†’ fact_cost_attributed.
  • Add a derived view billing_marts.unattributed_cost_daily (everything in fact_cost_line_items not in fact_cost_attributed for the same period). Sources ยง4 sub-category 5.4 (Attribution coverage).
  • Categories enabled: 2.1 (cost-by-service / SKU), 2.2 (cost-by-resource), 3.1 (cost-by-environment), 3.2 (cost-by-app) โ€” all for Cloudflare; 5.4 Attribution coverage.
  • Done when: a SQL query against fact_cost_attributed returns Cloudflare costs sliced by env + app, matching the totals in billing_snapshots to within 1%.

Phase 2 implementation approach (locked) โ€‹

Decisions locked at Phase 2 kickoff (2026-05-10). See docs/superpowers/plans/2026-05-10-bigquery-billing-phase-2.md for the detailed spec.

  • Transform tool: Dataform (D1, already locked in ยง8 above).
  • Repo location: in-tree at services/dataform/ โ€” alongside services/jobs/ingest-cloudflare/. SQLX models, JS includes, and the workflow_settings.yaml all live in the GitHub repo and are reviewed together with ingest changes. The GCP-side Dataform repository (lantern-billing-transforms in lantern-app-{dev,prod}) mirrors from GitHub via gcloud dataform repositories create --git-remote-settings.
  • Schedule: Dataform workflow_config, daily 03:30 PT. Native to Dataform, no extra Cloud Run Job required, timezone-aware. Runs 30 min after ingest-cloudflare (03:00 PT) so the day's raw rows are present before normalization. Provisioned via gcloud dataform calls in the same deploy-{dev,prod}.yml pattern Phase 1 established. No Pub/Sub coupling in Phase 2 โ€” Phase 4's multi-vendor fan-out is the natural revisit point.
  • CI gating: compile-only. A new dataform-compile job in .github/workflows/ci.yml runs dataform compile on every PR (no GCP credentials required, ~10s). Catches SQL syntax errors and broken refs pre-merge. BQ-against-live dry-runs are deliberately deferred โ€” the daily dev workflow is the live-validation surface.
  • Taxonomy: env + app + service (D6, locked above).
  • Shared-resource attribution: omitted from Phase 2 (D8, locked above). fact_cost_attributed carries direct attribution only; no attribution_method / attribution_weight columns yet.
  • Datasets created defensively in deploy workflow (mirror Phase 1's bq mk pattern): billing_norm, billing_attrib, billing_marts.
  • IAM bindings (best-effort, same continue-on-error pattern as Phase 1): the WIF SA needs roles/dataform.editor on the project plus roles/bigquery.dataEditor on the three new datasets.

Phase 3 โ€” Marts + Admin surface โ€” 1-2 weeks โ€‹

  • Materialize the marts from issue #52 (daily_cost_by_app_env_vendor, monthly_cost_by_app, top_cost_drivers_30d, cost_trend_rolling_7d, forecast_month_end). Use the daily-MERGE pattern from eventCountsAggregation.service.js.
  • forecast_month_end mart โ€” naive linear projection of MTD attributed spend. Sources ยง4 sub-category 5.2 (Forecasting).
  • New endpoints in (or alongside) services/api/analytics/ โ€” initially GET /admin/billing/daily, /admin/billing/top-drivers, /admin/billing/forecast. These read only from billing_marts.*, never from billing_norm or billing_raw (latency + byte-billed cost).
  • Admin UI: extend apps/admin/src/admin/billing/Billing.jsx with a "Historical" tab that hits the new endpoints. Reuse existing chart components.
  • Categories enabled: Cats 1 + 2 + 3 consumable end-to-end via admin for Cloudflare; 5.2 Forecasting.
  • Done when: admin can see a 30-day trend chart by env, by app, and by vendor; forecast tile renders; "unattributed %" tile is < 5%.

Phase 4 โ€” Fan out to remaining vendors + unit economics + anomalies โ€” 2-3 weeks (parallelizable) โ€‹

Two parallel tracks:

Track A: Vendor fan-out (~1-3 days each given the Phase 1-2 template):

  • ingest-railway (no tracking issue today; replaces the on-demand fetch in services/functions/firebase/modules/billing.js)
  • ingest-anthropic (closes #240 โ€” needs ANTHROPIC_ADMIN_API_KEY in Secret Manager; pulls from the Anthropic Console Usage & Cost APIs)
  • ingest-resend (closes #238)
  • ingest-github (replaces the on-demand fetch for snapshots)

Each adds raw โ†’ norm โ†’ attrib stages mirroring Phase 2. Marts and admin endpoints already generalize from Phase 3.

Track B: Cat 4 unit economics + Cat 5.1 anomaly detection (depends on Track A):

  • First Cat-4 views once Anthropic ingest lands: 4.3 Cost-per-assistant-interaction (Anthropic API cost โ‹ˆ analytics.events WHERE event_type = 'assistant.message') and 4.2 Cost-per-merchant (attributed cost โ‹ˆ events keyed on merchant_id). These are the highest-value views per ยง4 โ€” start here.
  • Fast-follows: 4.1 Cost-per-user, 4.4 Cost-per-product-event (pick 1-2 product KPIs based on what the team is tracking).
  • Surface as new admin endpoints /admin/billing/unit-economics/* and a "Unit Economics" tab in Billing.jsx.
  • Once trailing 14-day baseline exists across vendors, materialize ops.cost_anomalies โ€” z-score per (vendor, env). Slack alert on |z| > 3. notification_log records every alert sent. Sources ยง4 sub-category 5.1 (Anomaly detection).
  • Categories enabled: Cats 1-3 for all vendors; Cat 4 unit economics (4.2, 4.3 first, then 4.1 / 4.4); 5.1 Anomaly detection.
  • Done when: cost-per-merchant and cost-per-assistant-message tiles render in admin; a deliberately-injected anomaly fires a Slack alert within 1 hour.

Note: where the dissolved Phase 5 work landed โ€‹

The original plan parked all operational-health work (Cat 5: anomaly, forecasting, freshness, attribution coverage) in a final Phase 5. In practice each Cat 5 sub-category has a prerequisite that aligns with an earlier phase:

SubPrereqLands in
5.3 Pipeline freshnessops.pipeline_runs (any ingest job)Phase 1
5.4 Attribution coverageAttribution model runningPhase 2
5.2 ForecastingMTD attributed costPhase 3
5.1 Anomaly detectionโ‰ฅ14-day trailing baseline across vendorsPhase 4

Pulling Cat 5 forward gives the team optimization-targeting signal โ€” the single highest-value use of cost data โ€” at the same time as the visibility layer, not after it.

7. Build vs Buy โ€” decision and rationale โ€‹

Decision: Build. A SaaS option (Vantage Starter, free at our scale) was seriously considered. The deciding factor: the integration coverage gap and the unit-economics ceiling.

Pricing landscape (verified May 2026, for context) โ€‹

ToolPricing modelFree tierFits us today?
VantageTiered by monthly cloud spend: Starter $0 (โ‰ค$2.5k/mo), Pro $30 (โ‰ค$7.5k), Business $200 (โ‰ค$20k)โœ… StarterYes (we're ~$50/mo)
CloudZero$19 per $1k of monthly AWS spendโŒNo โ€” mid-market floor
Finout~1% of cloud bill, fixed yearlyโŒNo โ€” mid-market floor
Cloudability (IBM)2-3% of cloud spendโŒNo โ€” enterprise

So the only realistic SaaS at our scale is Vantage Starter, free.

Why Build wins for us โ€‹

  1. Integration gap is fatal. Vantage GA covers AWS / Azure / GCP / Kubernetes / Datadog / Snowflake / OpenAI / Vercel / Fastly. Anthropic is not in Vantage's GA list at time of writing โ€” relevant because the codebase uses Anthropic, not OpenAI, so the Vantage OpenAI integration doesn't help us. Cloudflare status was uncertain at time of writing (on the public roadmap as of 2022, not confirmed shipped). Railway, Resend, GitHub Actions are not supported. That means we'd be building Cloudflare (maybe), Railway, Resend, and GitHub Actions ingest jobs ourselves regardless โ€” at which point Vantage is a duplicative $0-30/mo dependency covering only ~50% of our cost surface, not a replacement for the homegrown stack.

  2. Unit economics require BQ-native joins. The high-leverage product questions for this codebase aren't "what did we spend?" โ€” they're "what does it cost to serve a merchant / send an assistant message / light a lantern?". Those answers live in joins between billing data and analytics.events (e.g. cost-per-active-merchant = daily_cost_attributed_to_merchant_features / count(distinct merchant_id) FROM events WHERE event_type IN (...)). Vantage cannot see analytics.events. With both datasets in the same BigQuery project, it's a same-day JOIN. This is the durable differentiator that pays for the engineering time, not anomaly detection or forecasting (those are commodity features Vantage ships for free).

  3. Taxonomy + rules in version control. dim_cost_allocation_rules lives in our repo, gets reviewed, has tests. SaaS taxonomy lives in someone else's web UI.

Honest cost of Build โ€‹

  • ~3-5 weeks of one engineer for Phases 1-5.
  • Reimplementing commodity features (anomaly z-score, forecasts, alert routing) that Vantage ships out of the box.
  • Operational ownership of ingest jobs, schema migrations, BQ byte-billed spend on hot queries.

We accept these costs because reasons #1 and #2 above are not workable around. Going in-house regardless of how Vantage's roadmap evolves โ€” even if Cloudflare shipped on their side, reason #2 (BQ-native unit economics) stands on its own.

8. Open decisions โ€‹

Mixed state: rows marked (decided) were resolved during the planning conversation and are locked. The remaining rows (D4, D5) carry tentative recommendations or research findings, and still need reviewer sign-off before their phase starts.

#DecisionTentativeWhy it matters
D0Build vs BuyBuild (decided โ€” see ยง7)Resolved. Driven by integration coverage gaps (Railway / Resend / GitHub Actions unsupported by Vantage) and the need for BQ-native joins with analytics.events to answer unit-economics questions a SaaS can't see. Sanity-check Cloudflare integration status before Phase 1 starts.
D1Dataform vs dbt for the transform layerDataform (decided)Native to GCP โ€” same project / auth / UI as BigQuery, free, no external runner. dbt's strengths (large package ecosystem, mature testing) don't apply much to billing transforms (linear SQL over a handful of tables); we'd be paying dbt Cloud or self-hosting dbt Core to win generic-modeling features we don't need. Re-evaluate only if we ever stand up a broader analytics-engineering practice.
D2GCS landing zone in Phase 1, or defer to Phase 5?Deferred (decided) โ€” billing_raw is the audit trail; revisit at a Phase 4 retro if we hit a need for raw-payload replay. Keeps Phase 1 simpler.
D3Where does ingest live: Cloud Run Jobs, Cloud Run Services with /cron endpoints, or stay in Firebase Functions?Cloud Run Jobs (decided) โ€” aligns with 2026-02-09_cloud-run-migration-optimized.md Phase 2B; provisioned via gcloud run jobs deploy in the existing GitHub Actions pipeline (no Terraform).
D4Move billing_snapshots out of billing_export_gcp?Yes, into new billing_app datasetSeparates app-owned tables from Google-managed exports.
D5Backfill strategy per vendor (which support history-replay?)Cloudflare โœ…, Railway โœ…, Anthropic โ“ (Console Usage & Cost APIs expose historical data; depth + granularity needs to be verified before Phase 4), Resend โœ… (GET /emails supports forward + backward cursor pagination โ€” paginate to the start of the account, group by day, derive cost from email-count ร— tier rate)Drives whether we treat snapshots as the source of truth or transient.
D6Taxonomy granularity v1 โ€” env/app, or env/app/team/feature?env + app + service (decided)Locked at the Phase 2 kickoff. Three dimensions, ~5-10 allocation rules to maintain. Maps to existing labels: GCP project โ†’ env, Cloudflare zone โ†’ app, vendor service field โ†’ service. Team / feature dimensions deferred โ€” re-evaluate at Phase 4 retro per ยง9 risk note.
D7Reporting surface โ€” Looker Studio, Grafana, or admin-only?Admin-only (decided)Admin console (apps/admin/src/admin/billing/Billing.jsx) is the single consumer surface. No Looker Studio / Grafana. Keeps IAM simple (admin role + service account, no extra BI tool access) and reuses existing chart components + auth. Re-evaluate only if a non-engineering audience needs direct BQ access.
D8Cost attribution for shared resources (e.g. shared Firebase project)Omit until Phase 4-5 (decided)Locked at the Phase 2 kickoff. Phase 2's only vendor (Cloudflare) has 1:1 zone โ†’ app mapping with no shared-resource ambiguity, so scaffolding attribution_method / attribution_weight columns now would be speculative. Add columns when the first vendor that needs them (likely Firebase / shared GCP project) lands. Tentative implementation when revisited: proportional split by request count.

9. Risks โ€‹

  • billing_export_gcp schema drift. Google can change GCP billing export columns. Already an issue: getGcpBillingWithBigQuery() has fallback handling. Mitigation: Dataform model owns the column-projection, adapt centrally.
  • BigQuery byte-billing on hot queries. Marts must be partitioned + clustered; admin endpoints must read marts only, not norm/raw. Same posture as bq-query-infrastructure plan (maximumBytesBilled is enforced there for ad-hoc; we should re-use for canned reports).
  • Vendor API rate limits + auth rotation. Cloudflare/Railway/GitHub tokens rotate; ingest jobs should fail loudly to ops.pipeline_runs and Slack.
  • Cardinality explosion in dim_cost_allocation_rules. Resist the temptation to add per-feature attribution before measuring overhead. Re-evaluate at Phase 4 retro.
  • Cost of the cost system. BigQuery storage + query for billing_raw/norm is itself a line-item we'll see in the data. Set a soft budget (suggested: $5/mo) and alarm if exceeded.

10. Out of scope (for this plan) โ€‹

  • Merchant-facing billing (invoices, Stripe, payment collection). Tracked in docs/economics/ECONOMICS.md and docs/business/MERCHANT_INTEGRATION_POA.md; separate concern from infrastructure cost monitoring.
  • Line-item-grain attribution to specific events (e.g. "this exact Claude API call cost $0.0021 and was for assistant message ID msg_abc123"). Cat 4 in ยง4 covers aggregate unit economics โ€” SUM(cost) / COUNT(events) over a time window โ€” which is what we actually need for pricing and cost trends. True line-item-level attribution requires capturing a usage receipt on every API call into a separate analytics.cost_attribution_events table; possible later if a use case demands it, but not on the roadmap.
  • FinOps showback / chargeback to feature teams. We're a small team โ€” revisit when org structure justifies it.
  • Expense reports, accounts payable, vendor payments, merchant payouts. This pipeline is read-only observability ("what did we spend?"). Anything involving moving money โ€” corporate-card expense reports (Ramp, Brex), vendor invoice payment (Bill.com), merchant payouts (Stripe Connect) โ€” is fiduciary / write-side and will use vetted SaaS, not anything we build. The natural one-way export from this pipeline (someday): billing_marts.* โ†’ Ramp coding rules so an inbound Cloudflare invoice gets auto-coded to engineering > infra > prod. That's a thin downstream integration, not a phase of this plan, and not something to design for now.

11. Next concrete step โ€‹

This PR adds only this planning doc. The follow-up PR(s) should be:

  1. First (1-2 days): Resolve PR #302 โ€” either land it, rebase it onto this plan's Phase 0, or close in favour of Phase 1 starting fresh.
  2. Then (Phase 1 PR): Create billing_raw dataset + services/jobs/ingest-cloudflare/ Cloud Run Job + ops.pipeline_runs table. Aim for ~400 LOC + a Cloud Scheduler config. One vendor end-to-end is more valuable than five vendors at the snapshot layer.

When picking up Phase 1, open a new issue ("Phase 1: Cloudflare ingest job + billing_raw dataset") that links back to this doc and to issue #52.

Built with VitePress