BigQuery Billing Infrastructure โ Plan โ
Last Updated: 2026-05-05 Status: Planning (pre-implementation) Author: drafted by Claude on claude/bigquery-billing-infrastructure-IZf4RRelated Issues / PRs:
- #52 โ Billing Monitoring (canonical spec for the BQ layout)
- #233 โ System Health: Add Billing/Cost Data (Phase 2)
- #302 โ Phase 2 historical snapshots + 1hr cache + BQ pipeline (open draft, partial impl)
- #240 โ AI assistant billing tracking (originally filed as "OpenAI billing"; the codebase now uses Anthropic โ issue title may need a follow-up rename)
- #238 โ Resend billing integration
- #188 โ AI usage transparency
docs/plans/2026-02-09_cloud-run-migration-optimized.mdโ Phase 2B references the combinedanalytics-api+billing-apiCloud Run deployment
1. Why this doc โ
Issue #52 describes a full multi-vendor BigQuery billing pipeline (raw โ normalized โ attributed โ marts + ops). PR #302 ships a much narrower first slice: a single denormalized billing_snapshots table written daily by a scheduled Cloud Function.
We have two artefacts pointing in roughly the same direction but at very different levels of ambition, and no single place that says "here is where we are, here is where we want to end up, here is the path between them." This doc is that place. It is not a coding spec โ it is a roadmap so the next implementation PR has a clear scope and the team has a shared mental model of the end state.
2. Current state (what already exists in dev) โ
Code in tree โ
| Layer | Location | What it does |
|---|---|---|
| Per-vendor billing fetch | services/functions/firebase/modules/billing.js | On-demand pulls for GCP, Cloudflare, Railway, GitHub. Includes getGcpBillingWithBigQuery() which queries billing_export_gcp.gcp_billing_export + cloud_pricing_export. |
| Vendor API helpers | services/functions/firebase/modules/billingShared.js | Cloudflare GraphQL, Railway GraphQL, GitHub Actions usage. |
| System Health surface | services/functions/firebase/modules/systemHealth.js + apps/admin/src/components/SystemHealth.jsx | 5-min health cache, 1-hr separate billing cache, MoM tile (PR #302). |
| BQ client (events) | packages/forge/bigquery.js | Streaming inserts to analytics.events. |
| BQ schema browser | services/api/analytics/src/services/bqSchema.service.js | Metadata-only, 5-min cache. |
| Daily MERGE rollup pattern | services/api/analytics/src/services/eventCountsAggregation.service.js | Reference pattern for "ingest-then-rollup" jobs. |
Datasets that exist today (per env: lantern-app-dev, lantern-app-prod) โ
analyticsโevents,event_counts_daily(events product, not billing)billing_export_gcpโgcp_billing_export,cloud_pricing_export(managed by Google Cloud Billing export, always-on, schema owned by Google). This stays exactly as-is โ the export is the right way to capture GCP cost and we don't modify it. Our pipeline reads from it; it does not read from us.- (PR #302 adds)
billing_export_gcp.billing_snapshotsโ auto-created on first scheduled-function run, daily denormalized rows, MONTH-partitioned. Co-locates app-owned data inside Google's managed-export dataset, which is the bit we want to fix in Phase 0 (see ยง6).
What PR #302 is doing right โ
- Captures a daily historical record from day one (the issue-#52 comment correctly flags that historical billing data cannot be retroactively captured)
- Idempotent (
insertId = YYYY-MM-DD) - Non-fatal BigQuery write โ Firestore fast-cache is the fallback
- Auto-creates the table with month partitioning + handles the 409 race
What PR #302 leaves on the table โ
- All vendors squashed into one row, one table (
cloudflare_cost_usd,railway_cost_usd, โฆ) โ fine as a snapshot, useless as a fact table - No GCS landing zone, no raw vendor payloads โ just totals
- No taxonomy / attribution layer โ can't slice by env / app / feature
- No anomaly / freshness / pipeline-runs ops tables
- Lives in the Firebase Functions module rather than a dedicated
billing-apiservice or Cloud Run job โ same place where on-demand fetches happen - Reuses
billing_export_gcp(a GCP-managed export dataset) for our app-managedbilling_snapshotsโ convenient but conflates two ownership models
3. Target architecture (issue #52, restated) โ
Cloud Scheduler
โโโ Cloud Run Job: ingest-cloudflare โโ
โโโ Cloud Run Job: ingest-railway โโโ GCS (raw payloads, optional)
โโโ Cloud Run Job: ingest-anthropic โ โ
โโโ Cloud Run Job: ingest-resend โ โผ
โโโ (always on) GCP Billing Export โโโโดโโ BigQuery: billing_raw.*
Dataform (or dbt) on schedule:
billing_raw.*
โ billing_norm.fact_cost_line_items (one canonical schema)
โ billing_attrib.fact_cost_attributed (env / app / team labels)
โ billing_marts.* (BI rollups)
Cloud Run Job: cost-health-check
โ ops.pipeline_runs / data_freshness / cost_anomalies
โ Slack / email alerts
Reporting:
Looker Studio + Admin Console โ billing_marts.*Datasets โ
| Dataset | Purpose | Write pattern |
|---|---|---|
billing_raw | Immutable, vendor-shaped raw rows | Insert-only by ingest jobs / managed exports |
billing_norm | Canonical fact_cost_line_items (one row = one billable cost event) | Built by Dataform/dbt from billing_raw |
billing_attrib | fact_cost_attributed + dim_taxonomy, dim_cost_allocation_rules | Built by Dataform/dbt; rules-driven |
billing_marts | Daily / monthly rollups for dashboards | MERGE jobs (cf. eventCountsAggregation.service.js) |
ops | pipeline_runs, data_freshness, cost_anomalies | Written by every ingest + health-check job |
billing_sandbox (optional) | Backfills, ad-hoc analysis | Free-form, throwaway |
Naming: fact_* for event-grain, dim_* for reference, daily_* / monthly_* for marts, everything in ops.* for observability. (Per issue #52.)
4. Categories of cost question โ
A useful organizing axis for the rest of this plan is the kind of question the data needs to answer. Five primary categories, with sub-categories where they're meaningfully different shapes. Sub-categories within a primary share fidelity requirements and source datasets โ they're slices, not separate products.
Throughout this doc and the codebase, "merchant" is the project nomenclature for the paying business (venue owner). "User" means the Lantern end-user (consumer). The two sub-categories under ยง4.x reflect that split.
| Sub | Question | Source dataset(s) |
|---|---|---|
| 1 | Cost-over-time โ "What did we spend last month? Show MoM trend." | billing_marts.daily_cost_by_vendor (PR #302's snapshot is a v0 of this) |
| 2.1 | Cost-by-service / SKU โ "Within GCP, what's costing most?" | billing_norm.fact_cost_line_items |
| 2.2 | Cost-by-resource โ "Which Cloud Run service / Cloudflare zone / Firestore collection is driving spend?" | billing_norm.fact_cost_line_items GROUP BY resource_id |
| 3.1 | Cost-by-environment โ "How much is dev vs prod?" | billing_attrib.fact_cost_attributed |
| 3.2 | Cost-by-app โ "web vs api vs admin vs functions vs merchant-portal?" | billing_attrib.fact_cost_attributed |
| 3.3 | Cost-by-team / cost-center โ deferred, single-team org. Documented for future reference; no implementation planned. | (future) |
| 4.1 | Cost-per-user โ usage-cost statistics for Lantern end-users (consumers). Not for revenue margin (no consumer revenue today) โ for understanding consumer-side cost trends as we scale. | billing_attrib.* โ analytics.events filtered to consumer events |
| 4.2 | Cost-per-merchant โ highest-priority unit-economics view. Feeds the merchant pricing model in docs/economics/ECONOMICS.md. | billing_attrib.* โ analytics.events keyed on merchant_id |
| 4.3 | Cost-per-assistant-interaction โ Anthropic API cost รท assistant message events. Tells us if conversation costs are sustainable per-merchant. | billing_attrib.* (Anthropic rows) โ analytics.events WHERE event_type = 'assistant.message' |
| 4.4 | Cost-per-product-event โ cost-per-lantern-lit / cost-per-offer-redemption / cost-per-venue-refresh. Specific events picked at Phase 4 once we know which KPIs the team is tracking. | billing_attrib.* โ analytics.events filtered to product KPIs |
| 5.1 | Anomaly detection โ "Did something just get more expensive?" z-score baseline alerting. | ops.cost_anomalies |
| 5.2 | Forecasting โ "Are we on track to exceed budget this month?" | billing_marts.forecast_month_end |
| 5.3 | Pipeline freshness โ "Did every vendor land a row today?" | ops.data_freshness, ops.pipeline_runs |
| 5.4 | Attribution coverage โ "What % of cost is unattributed? Are our rules decaying?" | billing_marts.unattributed_cost_daily |
Why this matters for the rollout โ
The arc is monotonic in fidelity โ each primary category needs everything the previous one needed plus more. That maps almost 1:1 to the phases in ยง6:
- Phase 1 enables Cat 1 (vendor totals over time) and 5.3 (pipeline freshness comes for free with
ops.pipeline_runs). - Phase 2 enables 2.1, 2.2, 3.1, 3.2 for one vendor + 5.4 (attribution coverage starts being measurable as soon as attribution runs).
- Phase 3 enables full consumption of Cats 1-3 via the admin surface + 5.2 (forecast can be built on MTD attributed cost).
- Phase 4 fans out remaining vendors and lights up Cat 4 (event-joined unit economics needs attributed data on multiple vendors to be meaningful) + 5.1 (anomaly detection needs โฅ14d trailing baseline, which lands here).
Cat 5 sub-categories are not their own phase โ they're folded into Phases 1-4 by their natural prerequisites. The original plan parked all of Cat 5 in a final Phase 5; pulling them forward gives the team optimization-targeting signal (5.1 alongside the Cat 2/3 cost-driver views) at the same time as the visibility layer, not after it. That matters because "where is it expensive?" is one of the highest-value uses of cost data, not a polish item.
5. Gap analysis (current โ target) โ
| Capability | Current | Target | Gap |
|---|---|---|---|
| GCP cost capture | Managed export โ | Same | None |
| Cloudflare cost capture | On-demand fetch + daily snapshot total | Daily ingest into billing_raw.cloudflare_invoice_line_items_raw | Need ingest job, raw schema, GCS landing |
| Railway cost capture | Same as Cloudflare | Same | Same |
| Anthropic cost capture | None (#240 open โ originally OpenAI, now Anthropic) | Daily ingest from Anthropic Console API, line-item grain | Whole vertical |
| Resend cost capture | None (#238 open) | Same | Whole vertical |
| GitHub cost capture | On-demand fetch + daily snapshot total | Daily ingest (Actions usage + Copilot) | Need ingest job |
| Canonical fact table | Single denormalized snapshot row | billing_norm.fact_cost_line_items | Need normalization layer (Dataform/dbt) |
| Attribution (env / app / team) | None | billing_attrib.fact_cost_attributed | Whole layer + taxonomy + rules |
| Rollups | Snapshot-grain only | billing_marts.daily_*, monthly_*, top_cost_drivers_30d, unattributed_cost_daily, cost_trend_rolling_7d | Whole layer |
| Pipeline observability | App logs only | ops.pipeline_runs, data_freshness, cost_anomalies | Whole layer |
| Alerting | None | Slack/email on freshness lag + anomaly | Need alerting hooks |
| Backfill story | N/A (snapshots are daily-going-forward only) | billing_sandbox.backfill_* for vendors that expose history | Need per-vendor backfill scripts |
| Reporting surface | System Health "Billing" tab | + Looker Studio + admin Cost dashboard | Need marts โ admin endpoints |
| Service home | Cloud Functions | Cloud Run job(s) + billing-api service | Need migration (aligns with Phase 2B in cloud-run plan) |
6. Phased rollout โ
Five phases (numbered 0-4). Each phase is independently shippable, leaves the system in a working state, and unlocks specific cost-question categories from ยง4. PR #302 is roughly two-thirds of Phase 0; land or rebase it before starting Phase 1.
The original plan had a Phase 5 for ops/alerting/forecasting; that phase has been dissolved and its sub-categories (5.1-5.4) folded into Phases 1-4 by their natural prerequisites. See the note at the end of this section.
Phase 0 โ Land what's in flight (PR #302) โ this week โ
- Resolve PR #302 review and merge (or close + cherry-pick the table-creation helper into Phase 1).
- Move the
billing_snapshotstable out ofbilling_export_gcpand into a new app-owned datasetbilling_app. Rationale:billing_export_gcpis managed by Google Cloud Billing export โ its schema, partitioning, and retention policies are controlled by Google. Co-locating our app-written tables inside it conflates two ownership models and risks us having to re-architect later if Google changes export semantics. Hygiene only โ the GCP export itself is automatic, correct, and untouched. - Categories enabled: rough Cat 1 at snapshot grain (totals only, no service / resource / attribution slicing yet).
- Action items:
- Decide: keep snapshot table or treat it as throwaway once Phase 1 lands? (Recommendation: keep โ it's a useful sanity check vs. the normalized fact table and provides graceful degradation if ingest jobs lag.)
- Create
billing_appdataset in dev + prod, move the table, updateBIGQUERY_BILLING_DATASETenv (or split into two env vars: one pointing at the Google-managed export, one at the app-owned dataset).
Phase 1 โ Stand up billing_raw + first ingest job โ 2 weeks โ
Goal: one Cloud Run Job, one vendor (start with Cloudflare โ has the cleanest GraphQL billing API and we already have the client code in billingShared.js).
- Create
billing_rawdataset in dev + prod (manualbq mk, document indocs/economics/billing/PLATFORMS.md). - New Cloud Run Job:
services/jobs/ingest-cloudflare/(mirrors the structure ofservices/api/*). Daily Cloud Scheduler trigger. Idempotent on(snapshot_date, line_item_id). - Optional GCS landing:
gs://lantern-billing-raw/cloudflare/YYYY/MM/DD/*.json(skip in Phase 1 if it complicates things โbilling_rawrows are sufficient for an audit trail, and we can backfill GCS later from the raw table). - New
ops.pipeline_runstable โ every ingest job writes a row at start + end. This single table is also the source for ยง4 sub-category 5.3 (Pipeline freshness) โ cheap to add at this phase, pays off immediately. - Categories enabled: full Cat 1 for Cloudflare (line-item-grain rather than just totals); 5.3 Pipeline freshness.
- Done when:
billing_raw.cloudflare_invoice_line_items_rawhas 7+ days of rows andops.pipeline_runsshows 7 successful executions.
Phase 1 implementation approach (locked) โ
- No Terraform. Provision dataset, Cloud Run Job, and Scheduler trigger via
bq mk/gcloud run jobs deploy/gcloud scheduler jobs createsteps in the existing.github/workflows/deploy-{dev,prod}.ymlpipeline. Matches the convention already in use; revisit Terraform if infra grows. - IAM bindings included defensively. The Phase 1 workflow runs
gcloud projects add-iam-policy-bindingfor the WIF service account on each role we need (roles/bigquery.dataOwneron the new datasets,roles/run.developer,roles/cloudscheduler.admin,roles/secretmanager.secretAccessor). These are idempotent โ safe to run even if scopes are already granted. - Cloudflare token scope verification. PR #302's source issue (#233) flagged that
Account โ Billing โ Readmay need to be added to the existingCLOUDFLARE_API_TOKEN. The ingest job will fail loudly on a 401/403 from the billing endpoint and emit a clear error toops.pipeline_runs.error. Verification step is captured in the Phase 1 PR's "before merge" checklist. - Authentication. The Cloud Run Job runs as the same WIF service account the deploy uses; reads secrets directly from Secret Manager via the existing
defineSecretpattern (services/functions/firebase/config.js).
Phase 2 โ Normalize + attribute (one vendor) โ 2 weeks โ
Goal: prove the Dataform/dbt layer with one vendor before fanning out.
- Decision needed: Dataform vs dbt. (See ยง8.) Tentative recommendation: Dataform โ native to GCP, no extra infra, free for the foreseeable future. dbt is more powerful but adds a runner (dbt Cloud or self-hosted) we don't need.
- Define
billing_norm.fact_cost_line_itemsschema (vendor-agnostic columns:vendor,service,sku,usage_amount,usage_unit,cost_usd,usage_start,usage_end,resource_id,labelsJSON, โฆ). - First Dataform model:
cloudflare_invoice_line_items_rawโfact_cost_line_items. - Define
dim_taxonomy(env, app, team, cost_center) โ start with three values per dimension, expand later. - Define
dim_cost_allocation_rulesโ initial rules: GCP project-id label โ env (*-devโdev,*-prodโprod); Cloudflare zone โ app (lantern-app.comโ web,api.lantern-app.comโ api). - First attribution model:
fact_cost_line_itemsโfact_cost_attributed. - Add a derived view
billing_marts.unattributed_cost_daily(everything infact_cost_line_itemsnot infact_cost_attributedfor the same period). Sources ยง4 sub-category 5.4 (Attribution coverage). - Categories enabled: 2.1 (cost-by-service / SKU), 2.2 (cost-by-resource), 3.1 (cost-by-environment), 3.2 (cost-by-app) โ all for Cloudflare; 5.4 Attribution coverage.
- Done when: a SQL query against
fact_cost_attributedreturns Cloudflare costs sliced by env + app, matching the totals inbilling_snapshotsto within 1%.
Phase 2 implementation approach (locked) โ
Decisions locked at Phase 2 kickoff (2026-05-10). See docs/superpowers/plans/2026-05-10-bigquery-billing-phase-2.md for the detailed spec.
- Transform tool: Dataform (D1, already locked in ยง8 above).
- Repo location: in-tree at
services/dataform/โ alongsideservices/jobs/ingest-cloudflare/. SQLX models, JS includes, and theworkflow_settings.yamlall live in the GitHub repo and are reviewed together with ingest changes. The GCP-side Dataform repository (lantern-billing-transformsinlantern-app-{dev,prod}) mirrors from GitHub viagcloud dataform repositories create --git-remote-settings. - Schedule: Dataform
workflow_config, daily 03:30 PT. Native to Dataform, no extra Cloud Run Job required, timezone-aware. Runs 30 min afteringest-cloudflare(03:00 PT) so the day's raw rows are present before normalization. Provisioned viagcloud dataformcalls in the samedeploy-{dev,prod}.ymlpattern Phase 1 established. No Pub/Sub coupling in Phase 2 โ Phase 4's multi-vendor fan-out is the natural revisit point. - CI gating: compile-only. A new
dataform-compilejob in.github/workflows/ci.ymlrunsdataform compileon every PR (no GCP credentials required, ~10s). Catches SQL syntax errors and broken refs pre-merge. BQ-against-live dry-runs are deliberately deferred โ the daily dev workflow is the live-validation surface. - Taxonomy: env + app + service (D6, locked above).
- Shared-resource attribution: omitted from Phase 2 (D8, locked above).
fact_cost_attributedcarries direct attribution only; noattribution_method/attribution_weightcolumns yet. - Datasets created defensively in deploy workflow (mirror Phase 1's
bq mkpattern):billing_norm,billing_attrib,billing_marts. - IAM bindings (best-effort, same
continue-on-errorpattern as Phase 1): the WIF SA needsroles/dataform.editoron the project plusroles/bigquery.dataEditoron the three new datasets.
Phase 3 โ Marts + Admin surface โ 1-2 weeks โ
- Materialize the marts from issue #52 (
daily_cost_by_app_env_vendor,monthly_cost_by_app,top_cost_drivers_30d,cost_trend_rolling_7d,forecast_month_end). Use the daily-MERGE pattern fromeventCountsAggregation.service.js. forecast_month_endmart โ naive linear projection of MTD attributed spend. Sources ยง4 sub-category 5.2 (Forecasting).- New endpoints in (or alongside)
services/api/analytics/โ initiallyGET /admin/billing/daily,/admin/billing/top-drivers,/admin/billing/forecast. These read only frombilling_marts.*, never frombilling_normorbilling_raw(latency + byte-billed cost). - Admin UI: extend
apps/admin/src/admin/billing/Billing.jsxwith a "Historical" tab that hits the new endpoints. Reuse existing chart components. - Categories enabled: Cats 1 + 2 + 3 consumable end-to-end via admin for Cloudflare; 5.2 Forecasting.
- Done when: admin can see a 30-day trend chart by env, by app, and by vendor; forecast tile renders; "unattributed %" tile is < 5%.
Phase 4 โ Fan out to remaining vendors + unit economics + anomalies โ 2-3 weeks (parallelizable) โ
Two parallel tracks:
Track A: Vendor fan-out (~1-3 days each given the Phase 1-2 template):
ingest-railway(no tracking issue today; replaces the on-demand fetch inservices/functions/firebase/modules/billing.js)ingest-anthropic(closes #240 โ needsANTHROPIC_ADMIN_API_KEYin Secret Manager; pulls from the Anthropic Console Usage & Cost APIs)ingest-resend(closes #238)ingest-github(replaces the on-demand fetch for snapshots)
Each adds raw โ norm โ attrib stages mirroring Phase 2. Marts and admin endpoints already generalize from Phase 3.
Track B: Cat 4 unit economics + Cat 5.1 anomaly detection (depends on Track A):
- First Cat-4 views once Anthropic ingest lands: 4.3 Cost-per-assistant-interaction (Anthropic API cost โ
analytics.events WHERE event_type = 'assistant.message') and 4.2 Cost-per-merchant (attributed cost โ events keyed onmerchant_id). These are the highest-value views per ยง4 โ start here. - Fast-follows: 4.1 Cost-per-user, 4.4 Cost-per-product-event (pick 1-2 product KPIs based on what the team is tracking).
- Surface as new admin endpoints
/admin/billing/unit-economics/*and a "Unit Economics" tab inBilling.jsx. - Once trailing 14-day baseline exists across vendors, materialize
ops.cost_anomaliesโ z-score per (vendor, env). Slack alert on |z| > 3.notification_logrecords every alert sent. Sources ยง4 sub-category 5.1 (Anomaly detection). - Categories enabled: Cats 1-3 for all vendors; Cat 4 unit economics (4.2, 4.3 first, then 4.1 / 4.4); 5.1 Anomaly detection.
- Done when: cost-per-merchant and cost-per-assistant-message tiles render in admin; a deliberately-injected anomaly fires a Slack alert within 1 hour.
Note: where the dissolved Phase 5 work landed โ
The original plan parked all operational-health work (Cat 5: anomaly, forecasting, freshness, attribution coverage) in a final Phase 5. In practice each Cat 5 sub-category has a prerequisite that aligns with an earlier phase:
| Sub | Prereq | Lands in |
|---|---|---|
| 5.3 Pipeline freshness | ops.pipeline_runs (any ingest job) | Phase 1 |
| 5.4 Attribution coverage | Attribution model running | Phase 2 |
| 5.2 Forecasting | MTD attributed cost | Phase 3 |
| 5.1 Anomaly detection | โฅ14-day trailing baseline across vendors | Phase 4 |
Pulling Cat 5 forward gives the team optimization-targeting signal โ the single highest-value use of cost data โ at the same time as the visibility layer, not after it.
7. Build vs Buy โ decision and rationale โ
Decision: Build. A SaaS option (Vantage Starter, free at our scale) was seriously considered. The deciding factor: the integration coverage gap and the unit-economics ceiling.
Pricing landscape (verified May 2026, for context) โ
| Tool | Pricing model | Free tier | Fits us today? |
|---|---|---|---|
| Vantage | Tiered by monthly cloud spend: Starter $0 (โค$2.5k/mo), Pro $30 (โค$7.5k), Business $200 (โค$20k) | โ Starter | Yes (we're ~$50/mo) |
| CloudZero | $19 per $1k of monthly AWS spend | โ | No โ mid-market floor |
| Finout | ~1% of cloud bill, fixed yearly | โ | No โ mid-market floor |
| Cloudability (IBM) | 2-3% of cloud spend | โ | No โ enterprise |
So the only realistic SaaS at our scale is Vantage Starter, free.
Why Build wins for us โ
Integration gap is fatal. Vantage GA covers AWS / Azure / GCP / Kubernetes / Datadog / Snowflake / OpenAI / Vercel / Fastly. Anthropic is not in Vantage's GA list at time of writing โ relevant because the codebase uses Anthropic, not OpenAI, so the Vantage OpenAI integration doesn't help us. Cloudflare status was uncertain at time of writing (on the public roadmap as of 2022, not confirmed shipped). Railway, Resend, GitHub Actions are not supported. That means we'd be building Cloudflare (maybe), Railway, Resend, and GitHub Actions ingest jobs ourselves regardless โ at which point Vantage is a duplicative $0-30/mo dependency covering only ~50% of our cost surface, not a replacement for the homegrown stack.
Unit economics require BQ-native joins. The high-leverage product questions for this codebase aren't "what did we spend?" โ they're "what does it cost to serve a merchant / send an assistant message / light a lantern?". Those answers live in joins between billing data and
analytics.events(e.g.cost-per-active-merchant = daily_cost_attributed_to_merchant_features / count(distinct merchant_id) FROM events WHERE event_type IN (...)). Vantage cannot seeanalytics.events. With both datasets in the same BigQuery project, it's a same-day JOIN. This is the durable differentiator that pays for the engineering time, not anomaly detection or forecasting (those are commodity features Vantage ships for free).Taxonomy + rules in version control.
dim_cost_allocation_ruleslives in our repo, gets reviewed, has tests. SaaS taxonomy lives in someone else's web UI.
Honest cost of Build โ
- ~3-5 weeks of one engineer for Phases 1-5.
- Reimplementing commodity features (anomaly z-score, forecasts, alert routing) that Vantage ships out of the box.
- Operational ownership of ingest jobs, schema migrations, BQ byte-billed spend on hot queries.
We accept these costs because reasons #1 and #2 above are not workable around. Going in-house regardless of how Vantage's roadmap evolves โ even if Cloudflare shipped on their side, reason #2 (BQ-native unit economics) stands on its own.
8. Open decisions โ
Mixed state: rows marked (decided) were resolved during the planning conversation and are locked. The remaining rows (D4, D5) carry tentative recommendations or research findings, and still need reviewer sign-off before their phase starts.
| # | Decision | Tentative | Why it matters |
|---|---|---|---|
| D0 | Build vs Buy | Build (decided โ see ยง7) | Resolved. Driven by integration coverage gaps (Railway / Resend / GitHub Actions unsupported by Vantage) and the need for BQ-native joins with analytics.events to answer unit-economics questions a SaaS can't see. Sanity-check Cloudflare integration status before Phase 1 starts. |
| D1 | Dataform vs dbt for the transform layer | Dataform (decided) | Native to GCP โ same project / auth / UI as BigQuery, free, no external runner. dbt's strengths (large package ecosystem, mature testing) don't apply much to billing transforms (linear SQL over a handful of tables); we'd be paying dbt Cloud or self-hosting dbt Core to win generic-modeling features we don't need. Re-evaluate only if we ever stand up a broader analytics-engineering practice. |
| D2 | GCS landing zone in Phase 1, or defer to Phase 5? | Deferred (decided) โ billing_raw is the audit trail; revisit at a Phase 4 retro if we hit a need for raw-payload replay. Keeps Phase 1 simpler. | |
| D3 | Where does ingest live: Cloud Run Jobs, Cloud Run Services with /cron endpoints, or stay in Firebase Functions? | Cloud Run Jobs (decided) โ aligns with 2026-02-09_cloud-run-migration-optimized.md Phase 2B; provisioned via gcloud run jobs deploy in the existing GitHub Actions pipeline (no Terraform). | |
| D4 | Move billing_snapshots out of billing_export_gcp? | Yes, into new billing_app dataset | Separates app-owned tables from Google-managed exports. |
| D5 | Backfill strategy per vendor (which support history-replay?) | Cloudflare โ
, Railway โ
, Anthropic โ (Console Usage & Cost APIs expose historical data; depth + granularity needs to be verified before Phase 4), Resend โ
(GET /emails supports forward + backward cursor pagination โ paginate to the start of the account, group by day, derive cost from email-count ร tier rate) | Drives whether we treat snapshots as the source of truth or transient. |
| D6 | Taxonomy granularity v1 โ env/app, or env/app/team/feature? | env + app + service (decided) | Locked at the Phase 2 kickoff. Three dimensions, ~5-10 allocation rules to maintain. Maps to existing labels: GCP project โ env, Cloudflare zone โ app, vendor service field โ service. Team / feature dimensions deferred โ re-evaluate at Phase 4 retro per ยง9 risk note. |
| D7 | Reporting surface โ Looker Studio, Grafana, or admin-only? | Admin-only (decided) | Admin console (apps/admin/src/admin/billing/Billing.jsx) is the single consumer surface. No Looker Studio / Grafana. Keeps IAM simple (admin role + service account, no extra BI tool access) and reuses existing chart components + auth. Re-evaluate only if a non-engineering audience needs direct BQ access. |
| D8 | Cost attribution for shared resources (e.g. shared Firebase project) | Omit until Phase 4-5 (decided) | Locked at the Phase 2 kickoff. Phase 2's only vendor (Cloudflare) has 1:1 zone โ app mapping with no shared-resource ambiguity, so scaffolding attribution_method / attribution_weight columns now would be speculative. Add columns when the first vendor that needs them (likely Firebase / shared GCP project) lands. Tentative implementation when revisited: proportional split by request count. |
9. Risks โ
billing_export_gcpschema drift. Google can change GCP billing export columns. Already an issue:getGcpBillingWithBigQuery()has fallback handling. Mitigation: Dataform model owns the column-projection, adapt centrally.- BigQuery byte-billing on hot queries. Marts must be partitioned + clustered; admin endpoints must read marts only, not norm/raw. Same posture as
bq-query-infrastructureplan (maximumBytesBilledis enforced there for ad-hoc; we should re-use for canned reports). - Vendor API rate limits + auth rotation. Cloudflare/Railway/GitHub tokens rotate; ingest jobs should fail loudly to
ops.pipeline_runsand Slack. - Cardinality explosion in
dim_cost_allocation_rules. Resist the temptation to add per-feature attribution before measuring overhead. Re-evaluate at Phase 4 retro. - Cost of the cost system. BigQuery storage + query for billing_raw/norm is itself a line-item we'll see in the data. Set a soft budget (suggested: $5/mo) and alarm if exceeded.
10. Out of scope (for this plan) โ
- Merchant-facing billing (invoices, Stripe, payment collection). Tracked in
docs/economics/ECONOMICS.mdanddocs/business/MERCHANT_INTEGRATION_POA.md; separate concern from infrastructure cost monitoring. - Line-item-grain attribution to specific events (e.g. "this exact Claude API call cost $0.0021 and was for assistant message ID
msg_abc123"). Cat 4 in ยง4 covers aggregate unit economics โSUM(cost) / COUNT(events)over a time window โ which is what we actually need for pricing and cost trends. True line-item-level attribution requires capturing a usage receipt on every API call into a separateanalytics.cost_attribution_eventstable; possible later if a use case demands it, but not on the roadmap. - FinOps showback / chargeback to feature teams. We're a small team โ revisit when org structure justifies it.
- Expense reports, accounts payable, vendor payments, merchant payouts. This pipeline is read-only observability ("what did we spend?"). Anything involving moving money โ corporate-card expense reports (Ramp, Brex), vendor invoice payment (Bill.com), merchant payouts (Stripe Connect) โ is fiduciary / write-side and will use vetted SaaS, not anything we build. The natural one-way export from this pipeline (someday):
billing_marts.*โ Ramp coding rules so an inbound Cloudflare invoice gets auto-coded toengineering > infra > prod. That's a thin downstream integration, not a phase of this plan, and not something to design for now.
11. Next concrete step โ
This PR adds only this planning doc. The follow-up PR(s) should be:
- First (1-2 days): Resolve PR #302 โ either land it, rebase it onto this plan's Phase 0, or close in favour of Phase 1 starting fresh.
- Then (Phase 1 PR): Create
billing_rawdataset +services/jobs/ingest-cloudflare/Cloud Run Job +ops.pipeline_runstable. Aim for ~400 LOC + a Cloud Scheduler config. One vendor end-to-end is more valuable than five vendors at the snapshot layer.
When picking up Phase 1, open a new issue ("Phase 1: Cloudflare ingest job + billing_raw dataset") that links back to this doc and to issue #52.