BigQuery Billing โ Phase 2: Normalize + Attribute (Cloudflare) โ
Last Updated: 2026-05-10 Status: Planning checkpoint (pre-implementation) Branch: claude/bigquery-billing-phase-2-v3K7yParent plan: docs/plans/2026-05-05_bigquery-billing-infrastructure.md ยง6 Phase 2 Phase 1 reference: PR #462 (scaffold) โ PR #465 (flesh-out); same pattern intended here.
1. Goal โ
Stand up the Dataform transform layer for the BigQuery billing pipeline, end-to-end for one vendor (Cloudflare). After Phase 2:
billing_raw.cloudflare_invoice_line_items_raw(Phase 1 output) is normalized into a vendor-agnosticbilling_norm.fact_cost_line_itemsdaily.- Rules in
billing_attrib.dim_cost_allocation_rulesjoinfact_cost_line_itemsagainstdim_taxonomyto producebilling_attrib.fact_cost_attributed. billing_marts.unattributed_cost_daily(view) surfaces attribution coverage.
Categories enabled per parent ยง4: 2.1 (cost-by-service / SKU), 2.2 (cost-by-resource), 3.1 (cost-by-environment), 3.2 (cost-by-app), 5.4 (attribution coverage). All for Cloudflare only.
2. Done when โ
-- 1. Cloudflare attributed totals match raw within 1%, sliced by env + app:
WITH raw AS (
SELECT snapshot_date, SUM(amount_usd) AS raw_total
FROM `lantern-app-dev.billing_raw.cloudflare_invoice_line_items_raw`
WHERE snapshot_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE()
GROUP BY snapshot_date
),
attr AS (
SELECT DATE(usage_start) AS snapshot_date, env, app,
SUM(cost_usd) AS attr_total
FROM `lantern-app-dev.billing_attrib.fact_cost_attributed`
WHERE vendor = 'cloudflare'
AND DATE(usage_start) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE()
GROUP BY snapshot_date, env, app
)
SELECT raw.snapshot_date, raw.raw_total,
SUM(attr.attr_total) AS attributed_total,
ABS(raw.raw_total - SUM(attr.attr_total)) / NULLIF(raw.raw_total, 0) AS delta_pct
FROM raw LEFT JOIN attr USING (snapshot_date)
GROUP BY raw.snapshot_date, raw.raw_total
ORDER BY raw.snapshot_date DESC;
-- All rows: delta_pct < 0.01
-- 2. Unattributed coverage check:
SELECT * FROM `lantern-app-dev.billing_marts.unattributed_cost_daily`
WHERE snapshot_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY);
-- All rows: unattributed_pct < 5%3. Locked decisions โ
| Ref | Decision | Source |
|---|---|---|
| D1 | Dataform (not dbt) | Parent ยง8 |
| D6 | Taxonomy: env + app + service | Phase 2 kickoff (2026-05-10) |
| D8 | Omit shared-resource attribution columns until Phase 4-5 | Phase 2 kickoff (2026-05-10) |
| P2-A | Repo: in-tree at services/dataform/, mirrored to GCP-side lantern-billing-transforms | Phase 2 kickoff |
| P2-B | Schedule: Dataform workflow_config, daily 03:30 PT (30 min after ingest-cloudflare) | Phase 2 kickoff |
| P2-C | CI: dataform compile only (no live BQ dry-run) on every PR | Phase 2 kickoff |
4. Repo layout โ
services/dataform/
โโโ workflow_settings.yaml # defaultProject, defaultLocation, defaultDataset
โโโ package.json # @dataform/core pin
โโโ definitions/
โ โโโ sources/
โ โ โโโ cloudflare_raw.sqlx # declares billing_raw.cloudflare_invoice_line_items_raw
โ โโโ norm/
โ โ โโโ fact_cost_line_items.sqlx # incremental, partitioned, clustered
โ โโโ attrib/
โ โ โโโ dim_taxonomy.sqlx # static seed table
โ โ โโโ dim_cost_allocation_rules.sqlx # static seed table
โ โ โโโ fact_cost_attributed.sqlx # incremental
โ โโโ marts/
โ โโโ unattributed_cost_daily.sqlx # view
โโโ includes/
โ โโโ constants.js # dataset names, partition keys
โ โโโ taxonomy_seed.js # initial dim_taxonomy rows
โ โโโ allocation_rules_seed.js # initial dim_cost_allocation_rules rows
โโโ README.mdworkflow_settings.yaml keyed off ${dataformCore.envOverrides} so the same compiled artifact targets lantern-app-dev or lantern-app-prod based on the workflow that invokes it.
5. Schemas โ
5.1 billing_norm.fact_cost_line_items โ
Vendor-agnostic. One row per billable cost event.
| Column | Type | Notes |
|---|---|---|
vendor | STRING REQUIRED | 'cloudflare' in Phase 2; 'gcp', 'railway', 'anthropic', 'resend', 'github' in Phase 4 |
service | STRING NULLABLE | Vendor's product/service name (e.g. 'workers', 'r2', 'zone_plan') |
sku | STRING NULLABLE | Vendor SKU / line-item code |
usage_amount | FLOAT NULLABLE | Quantity used |
usage_unit | STRING NULLABLE | e.g. 'requests', 'GB', 'hours' |
cost_usd | FLOAT REQUIRED | Cost in USD |
usage_start | TIMESTAMP REQUIRED | Period start; falls back to snapshot_date if vendor doesn't provide |
usage_end | TIMESTAMP NULLABLE | Period end |
resource_id | STRING NULLABLE | Vendor-shaped resource identifier โ for Cloudflare, zone_id; for GCP, project/instance |
labels | JSON NULLABLE | Vendor-specific extras (e.g. Cloudflare account_id, raw description) |
source_line_item_id | STRING REQUIRED | Foreign key into billing_raw.* for audit |
loaded_at | TIMESTAMP REQUIRED | Dataform run timestamp |
Partition: DAY on DATE(usage_start). Cluster: (vendor, service, resource_id). Incremental: WHERE DATE(usage_start) >= DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY) to handle late-arriving raw rows.
5.2 billing_attrib.dim_taxonomy โ
Static reference. Three dimensions per D6.
| Column | Type | Example |
|---|---|---|
env | STRING REQUIRED | 'dev', 'prod', 'shared' |
app | STRING REQUIRED | 'web', 'api', 'admin', 'functions', 'merchant-portal', 'analytics', 'unknown' |
service | STRING REQUIRED | Free-form vendor service name; '*' matches any |
description | STRING NULLABLE | Human-readable for review tooling |
Seeded by includes/taxonomy_seed.js. ~15-20 rows for the cartesian we care about. (Allocation rules โ ยง5.3 โ are seeded separately from includes/allocation_rules_seed.js; one seed file per dim table.)
5.3 billing_attrib.dim_cost_allocation_rules โ
Static. Match rules from raw fields to taxonomy.
| Column | Type | Notes |
|---|---|---|
rule_id | STRING REQUIRED | Stable identifier for tooling |
priority | INTEGER REQUIRED | Lower wins ties; default catch-alls have highest priority numbers |
vendor | STRING REQUIRED | Scope rule to a vendor |
match_field | STRING REQUIRED | One of 'resource_id', 'service', 'sku', or 'labels.<key>' (e.g. 'labels.account_id' for Cloudflare, 'labels.gcp_project_id' for GCP). The labels.<key> form is resolved with JSON_VALUE(labels, '$.' || <key>) at evaluation time. |
match_pattern | STRING REQUIRED | Exact match for v1; SQL LIKE pattern allowed ('%' wildcard) |
assigned_env | STRING NULLABLE | NULL means "don't override" |
assigned_app | STRING NULLABLE | NULL means "don't override" |
assigned_service | STRING NULLABLE | NULL โ use vendor's service field |
effective_from | DATE REQUIRED | For backfill correctness |
effective_to | DATE NULLABLE | NULL = open-ended |
Initial rules (per parent ยง6 Phase 2). Columns: vendor, match_field, match_pattern, assignment, priority.
gcp, labels.gcp_project_id, %-dev, env=dev (priority 100)
gcp, labels.gcp_project_id, %-prod, env=prod (priority 100)
cloudflare, resource_id, <ZONE_LANTERN_APP_COM>, app=web (priority 50)
cloudflare, resource_id, <ZONE_API_LANTERN_APP_COM>, app=api (priority 50)
cloudflare, service, zone_plan, service=zone_plan(priority 200)
cloudflare, *, %, app=unknown,env=shared (priority 999, catch-all)Phase 2 seeds Cloudflare rules only. GCP rules become live when GCP ingest lands (Phase 4) but the rule shape stays.
5.4 billing_attrib.fact_cost_attributed โ
Same shape as fact_cost_line_items plus three attribution columns:
| Added Column | Type | Notes |
|---|---|---|
env | STRING REQUIRED | Resolved from rules |
app | STRING REQUIRED | Resolved from rules |
attributed_service | STRING NULLABLE | Resolved (rules can override raw service) |
matched_rule_id | STRING NULLABLE | NULL when only the catch-all hit โ drives ยง5 unattributed marts |
No attribution_method / attribution_weight columns (D8 โ deferred).
Partition + cluster: same as fact_cost_line_items.
5.5 billing_marts.unattributed_cost_daily (view) โ
SELECT
DATE(usage_start) AS snapshot_date,
vendor,
SUM(cost_usd) AS total_cost_usd,
SUM(IF(matched_rule_id IS NULL OR app = 'unknown', cost_usd, 0)) AS unattributed_cost_usd,
SAFE_DIVIDE(
SUM(IF(matched_rule_id IS NULL OR app = 'unknown', cost_usd, 0)),
SUM(cost_usd)
) AS unattributed_pct
FROM ${ref('fact_cost_attributed')}
GROUP BY snapshot_date, vendor;Sources Category 5.4 in parent ยง4.
6. Cloudflare normalization mapping โ
billing_raw.cloudflare_invoice_line_items_raw โ fact_cost_line_items:
| Raw column | Norm column | Transform |
|---|---|---|
line_item_id | source_line_item_id | direct |
service | service | direct (NULL โ 'unknown') |
sku | sku | direct |
quantity | usage_amount | direct |
unit | usage_unit | direct |
amount_usd | cost_usd | direct |
period_start | usage_start | COALESCE(period_start, TIMESTAMP(snapshot_date)) |
period_end | usage_end | direct |
zone_id | resource_id | direct |
account_id, description, raw_payload | labels | JSON_OBJECT('account_id', account_id, 'description', description, 'raw_payload', raw_payload) โ produces a JSON value matching the column type. raw_payload is already JSON in the raw schema (see services/jobs/ingest-cloudflare/src/schemas.js), so it embeds without stringification. |
| (constant) | vendor | 'cloudflare' |
| (Dataform) | loaded_at | CURRENT_TIMESTAMP() |
Phase 1 finding (parent handoff): the /accounts/{id}/subscriptions endpoint returns two $0-cost free-tier subscriptions and misses zone-plan cost unless the fallback fires. Out of scope here โ fix in Phase 4 retro by always emitting zone-plan rows alongside subscriptions in cloudflareClient.js. Phase 2 normalization works correctly either way.
7. Deploy workflow changes โ
Mirror Phase 1's deploy-ingest-cloudflare-job step in both deploy-dev.yml and deploy-prod.yml. New job: deploy-dataform-billing-transforms. The deploy workflow ensures BigQuery datasets exist defensively (bq show then bq mk if missing) and provisions the Dataform repo + workflow config. It does not grant IAM โ see PR #474 for the manual-grant convention. Before this job's first run, a project owner adds roles/dataform.editor (project) and roles/bigquery.dataEditor on billing_norm, billing_attrib, billing_marts to the WIF SA via the loop documented in services/jobs/ingest-cloudflare/README.md.
deploy-dataform-billing-transforms:
needs: [check-ci-success, changes]
if: needs.changes.outputs.dataform == 'true' || needs.changes.outputs.shared-deps == 'true'
env:
PROJECT_ID: lantern-app-dev
REGION: us-central1
DATAFORM_REPO: lantern-billing-transforms
WORKFLOW_NAME: billing-daily
SCHEDULE: '30 3 * * *'
SCHEDULE_TZ: America/Los_Angeles
steps:
# ... auth + setup ...
- name: Ensure BigQuery datasets exist
run: |
for ds in billing_norm billing_attrib billing_marts; do
# bq show first; bq mk only when missing โ same pattern as Phase 1
done
# No IAM step here โ bindings are granted manually per
# services/jobs/ingest-cloudflare/README.md. New roles for Phase 2:
# roles/dataform.editor (project)
# roles/bigquery.dataEditor on billing_norm, billing_attrib, billing_marts
# Append to the loop in that README and re-run.
- name: Ensure Dataform repository
run: |
if ! gcloud dataform repositories describe $DATAFORM_REPO \
--region=$REGION --project=$PROJECT_ID > /dev/null 2>&1; then
gcloud dataform repositories create $DATAFORM_REPO \
--region=$REGION --project=$PROJECT_ID \
--git-remote-settings-url=https://github.com/cattreedev/lantern_app \
--git-remote-settings-default-branch=dev \
--git-remote-settings-token-secret-version=projects/$PROJECT_ID/secrets/GITHUB_DATAFORM_TOKEN/versions/latest
fi
- name: Ensure release configuration
run: |
# release_config = "billing-dev" tracking branch dev,
# compile_config.default_database = lantern-app-dev
- name: Ensure workflow configuration
run: |
# workflow_config = "billing-daily" referencing the release_config,
# cron_schedule = SCHEDULE, time_zone = SCHEDULE_TZchanges.outputs.dataform filter watches services/dataform/** (add to the existing dorny/paths-filter block).
8. CI changes (.github/workflows/ci.yml) โ
New dataform-compile job:
dataform-compile:
if: <changes in services/dataform/**>
steps:
- uses: actions/checkout@v6
- uses: actions/setup-node@v6
with: { node-version: 22, cache: npm, cache-dependency-path: services/dataform/package-lock.json }
- run: npm ci
working-directory: services/dataform
- run: npx @dataform/cli compile
working-directory: services/dataformNo GCP credentials. ~10s. Catches SQL syntax + broken ${ref(...)}.
9. Out of scope (Phase 4+) โ
- Other vendors (Railway, Anthropic, Resend, GitHub, GCP). Each adds a
definitions/<vendor>/directory and rules but reuses thefact_cost_line_items/fact_cost_attributedschema. - Phase 1 Cloudflare zone-plan fix (
cloudflareClient.js). - Cat 4 unit-economics views and Cat 5.1 anomaly detection.
attribution_method/attribution_weightcolumns (D8).- Team / feature taxonomy dimensions (D6).
- Phase 3 marts (
daily_cost_by_app_env_vendor,top_cost_drivers_30d, etc.) and admin endpoints.
10. Open follow-ups (non-blocking, from handoff) โ
- [x] Close PR #302 (superseded by Phase 1 PR #465) โ already closed; courtesy supersession comment posted.
- [x] Rename issue #240 โ was "OpenAI billing tracking", now "Add Anthropic billing to Admin Portal".
- [x] Drop or simplify the "Apply defensive IAM bindings (best-effort)" step in
deploy-{dev,prod}.ymlโ handled by PR #474 (deleted; manual-grant convention documented inservices/jobs/ingest-cloudflare/README.md).
11. Suggested implementation PR shape โ
Mirror Phase 1's #462 scaffold โ #465 flesh-out split:
- Scaffold PR (~200 LOC):
services/dataform/skeleton,workflow_settings.yaml, source declaration, emptyfact_cost_line_items.sqlx, CI compile job. No deploy-workflow changes yet. - Flesh-out PR (~400-500 LOC): Full normalize + attribute + marts SQLX, allocation-rules seed, deploy-workflow
deploy-dataform-billing-transformsjob, README. Done-when queries from ยง2 above run green againstlantern-app-dev.
Both PRs base on dev.