Cloud Run Migration Strategy (Optimized)

Refactor: Server-Side Configs (Cloud Run vs Cloud Functions & Client-to-Server Migrations)

Last Updated: 2026-02-09 Status: Planning Related Issues:

#224 (OSM Import Security)
#394 (Venue Refresh Abuse)
#52 (BigQuery Billing Pipeline)
#233 (System Health Billing)
#240 (OpenAI Billing)
#238 (Resend Billing)
#188 (AI Usage Transparency)

🎯 Executive Summary

This document outlines an optimized migration strategy from Cloud Functions to Cloud Run for services that require:

Long-running operations (>1 minute)
Batch processing (100+ items)
WebSocket/persistent connections
Multi-step workflows
Native/container dependencies
Complex state management
Multi-vendor data aggregation (billing, analytics)
Historical data retention (BigQuery)

Key Optimizations:

Consolidate related services to reduce operational overhead and improve code reuse
Combine analytics + billing infrastructure to maximize BigQuery investment and capture historical data from day 1
Address 8 open issues in unified deployments (security, billing, transparency)

🚀 High Priority – Cloud Run Services

1. Venue API (Combined Import + Refresh) 🔴 CRITICAL

Current State:

OSM import: CLI script with client-side triggers
Venue refresh: Client-side triggers with public write access (firestore.rules:394)
No Google Places integration yet

Why Combined Service:

Shared Domain Logic: Both handle venue data validation, deduplication, normalization
Unified Rate Limiting: Reusable infrastructure across OSM, Nominatim, Google Places
API Source Flexibility: Adding Google Places becomes a new route, not a new service
Operational Efficiency: Single deployment, shared connection pools, lower costs
Future-Proof: Easy to add more sources (Yelp, Foursquare), centralized quality scoring

Recommended Architecture:

Cloud Run Service: venue-api

javascript

// ===== Import Operations =====
POST /venues/import/osm
  Body: { lat, lng, radius, limit, filters }
  Returns: { jobId, status, estimatedTime }
  Description: Import venues from OSM Overpass API

POST /venues/import/google
  Body: { lat, lng, radius, types, keyword }
  Returns: { jobId, status, estimatedTime }
  Description: Import venues from Google Places API (future)

POST /venues/import/hybrid
  Body: { lat, lng, radius, strategy: "osm-primary" | "google-primary" }
  Returns: { jobId, status, estimatedTime }
  Description: Combined OSM + Google enrichment with deduplication

GET /venues/import/jobs/:jobId
  Returns: { status, progress, venuesProcessed, errors, results }
  Description: Check import job status and results

// ===== Enrichment & Refresh =====
POST /venues/:venueId/enrich
  Body: { fields: ["photos", "hours", "contact", "rating"] }
  Returns: { venue, enrichedFields, source }
  Description: Enrich single venue with external API data

POST /venues/refresh/batch
  Body: { venueIds: [...], priority: "high" | "normal" }
  Returns: { jobId, status, count }
  Description: Batch refresh venue metadata

POST /venues/refresh/scheduled
  Headers: X-CloudScheduler-JobName
  Returns: { processed, updated, errors }
  Description: Cloud Scheduler endpoint for automated refresh

// ===== Utilities =====
POST /venues/consolidate
  Body: { region: { lat, lng, radius }, strategy: "strict" | "fuzzy" }
  Returns: { duplicatesFound, merged, deleted }
  Description: Deduplicate venues in region

POST /venues/geocode
  Body: { address } or { lat, lng }
  Returns: { address, coordinates, confidence }
  Description: Forward or reverse geocoding via Nominatim

GET /venues/:venueId/metadata
  Returns: { venue, sources, lastUpdated, quality }
  Description: Get venue details with provenance info

// ===== Admin Operations =====
POST /venues/cleanup/orphaned
  Returns: { deleted, errors }
  Description: Remove venues with no activity for 90+ days

POST /venues/validate/all
  Returns: { total, invalid, fixed, errors }
  Description: Data quality check and auto-fix

POST /venues/rebuild-index
  Returns: { indexed, errors }
  Description: Rebuild search/geo indexes

Internal Structure:

venue-api/
├── src/
│   ├── index.js                   # Express app entry point
│   ├── routes/
│   │   ├── import.routes.js       # OSM, Google, hybrid imports
│   │   ├── refresh.routes.js      # Enrichment & scheduled tasks
│   │   ├── utility.routes.js      # Geocoding, consolidation
│   │   ├── admin.routes.js        # Cleanup, validation
│   ├── services/
│   │   ├── osm.service.js         # OSM Overpass API client
│   │   ├── google.service.js      # Google Places API client
│   │   ├── nominatim.service.js   # Geocoding service
│   │   ├── venue.service.js       # Core venue logic
│   │   ├── enrichment.service.js  # Data enrichment
│   ├── queue/
│   │   ├── job-manager.js         # Queue handling (Cloud Tasks)
│   │   ├── rate-limiter.js        # Global rate limiting
│   ├── middleware/
│   │   ├── auth.js                # Firebase Admin validation
│   │   ├── rate-limit.js          # Per-source rate limits
│   │   ├── error-handler.js       # Unified error handling
│   ├── utils/
│   │   ├── deduplication.js       # Venue matching algorithms
│   │   ├── validation.js          # Data validation
│   │   ├── cache.js               # In-memory caching
├── Dockerfile
├── package.json
└── cloudbuild.yaml

Benefits:

Security: Server-side validation prevents client abuse (fixes Issue #224)
Rate Limiting: Global throttling for OSM (5 req/sec), Nominatim (1 req/sec), Google Places
Batch Processing: Import 1000+ venues in single job (vs Cloud Functions 9-min timeout)
Deduplication: Automatic merging of OSM + Google Places data
Scheduled Jobs: Cloud Scheduler → nightly venue refresh for high-traffic locations
Audit Trail: Full logging of all import operations
Cost Efficiency: Single service vs two separate deployments

Migration Priority: 🔴 Phase 1 (Weeks 1-2) - Addresses critical security issues

2. Admin API (Consolidated) 🟡 HIGH PRIORITY

Current State:

10+ separate Cloud Functions for admin operations
Each function has separate auth, logging, error handling
No batch operations support
Difficult local development

Why Consolidated Service:

Shared Middleware: Auth, validation, logging, error handling
Unified API: Single endpoint for admin portal
Batch Operations: Process 100+ users/venues in single request
Better DX: Local development with docker run
Reduced Cold Starts: 1 warm service vs 10+ separate functions
Complex Workflows: Multi-step admin actions in single transaction

Recommended Architecture:

Cloud Run Service: admin-api

javascript

// ===== User Management =====
// Migrate existing Cloud Functions
GET  /admin/users/:uid
  Description: Get user by ID (replaces getUserByEmail function)

POST /admin/users/:uid/role
  Body: { role: "admin" | "merchant" | "user" }
  Description: Set user role (replaces setUserRole function)

POST /admin/users/:uid/ban
  Body: { reason, duration }
  Description: Ban user (replaces banUser function)

DELETE /admin/users/:uid/ban
  Description: Unban user (replaces unbanUser function)

// New batch operations
POST /admin/users/bulk-import
  Body: { users: [...], sendInvites: boolean }
  Returns: { jobId, status, count }
  Description: Import users from CSV

POST /admin/users/bulk-ban
  Body: { userIds: [...], reason }
  Returns: { banned, errors }
  Description: Batch ban users

POST /admin/users/bulk-role
  Body: { userIds: [...], role }
  Returns: { updated, errors }
  Description: Batch update user roles

GET /admin/users/search
  Query: ?q=email@example.com&role=admin&limit=50
  Returns: { users, total, hasMore }
  Description: Search users with filters

// ===== Venue Management =====
POST /admin/venues/bulk-update
  Body: { venueIds: [...], updates: {...} }
  Returns: { updated, errors }
  Description: Batch update venue metadata

POST /admin/venues/cleanup
  Body: { criteria: "inactive" | "duplicate" | "invalid" }
  Returns: { deleted, merged, errors }
  Description: Cleanup venues by criteria

GET /admin/venues/audit
  Returns: { venues, issues, recommendations }
  Description: Venue data quality audit

// ===== Content Moderation =====
GET /admin/moderation/queue
  Query: ?type=profile|lantern|chat&status=pending
  Returns: { items, total }
  Description: Get moderation queue

POST /admin/moderation/:itemId/approve
  Returns: { status }
  Description: Approve content

POST /admin/moderation/:itemId/reject
  Body: { reason, action: "hide" | "delete" | "ban-user" }
  Returns: { status }
  Description: Reject content

// ===== Analytics & Reports =====
POST /admin/reports/generate
  Body: { type: "engagement" | "revenue" | "venues", period: "daily" | "weekly" | "monthly" }
  Returns: { jobId, status }
  Description: Generate report job

GET /admin/reports/:jobId
  Returns: { status, downloadUrl, data }
  Description: Get report results

GET /admin/analytics/dashboard
  Returns: { metrics: {...}, trends: {...} }
  Description: Real-time dashboard metrics

POST /admin/analytics/export
  Body: { collection, filters, format: "csv" | "json" }
  Returns: { downloadUrl }
  Description: Export data

// ===== System Operations =====
POST /admin/system/cache/clear
  Body: { scope: "all" | "venues" | "users" }
  Returns: { cleared }
  Description: Clear system caches

GET /admin/system/health
  Returns: { status, services, errors }
  Description: System health check

POST /admin/system/maintenance
  Body: { enabled: boolean, message }
  Returns: { status }
  Description: Toggle maintenance mode

Internal Structure:

admin-api/
├── src/
│   ├── index.js                   # Express app
│   ├── routes/
│   │   ├── users.routes.js        # User management
│   │   ├── venues.routes.js       # Venue management
│   │   ├── moderation.routes.js   # Content moderation
│   │   ├── reports.routes.js      # Analytics & reports
│   │   ├── system.routes.js       # System operations
│   ├── services/
│   │   ├── user.service.js        # User operations
│   │   ├── venue.service.js       # Venue operations
│   │   ├── report.service.js      # Report generation
│   │   ├── analytics.service.js   # Metrics aggregation
│   ├── middleware/
│   │   ├── admin-auth.js          # Admin-only auth
│   │   ├── rate-limit.js          # Rate limiting
│   │   ├── audit-log.js           # Action logging
│   │   ├── error-handler.js       # Error handling
│   ├── utils/
│   │   ├── csv-parser.js          # CSV import/export
│   │   ├── pdf-generator.js       # PDF reports
├── Dockerfile
├── package.json

Benefits:

Unified Auth: Single admin authentication middleware
Batch Operations: Process 100+ records in one request
Better Logging: Centralized audit trail of all admin actions
Easier Development: Run entire admin backend locally
Complex Workflows: Multi-step operations in single transaction
Reduced Costs: 1 service vs 10+ Cloud Functions

Migration Priority: 🟡 Phase 2 (Weeks 3-4) - Improve admin operations

3. Analytics Service 🟡 HIGH PRIORITY

Current State:

No analytics infrastructure
No merchant-facing metrics
No cost tracking or usage monitoring
Manual reporting and data exports

Why Cloud Run:

Data Aggregation: Complex Firestore queries across multiple collections
Large Data Processing: Aggregate 100k+ events for reports
Scheduled Jobs: Daily/weekly/monthly report generation
Memory Intensive: Process 10k+ records for trend analysis
Export Generation: CSV/PDF reports can take minutes
Real-time Metrics: In-memory caching for dashboard
Cost Tracking: Parse Firebase/GCP billing data

Why High Priority:

Product Feature: Merchants need analytics to see venue performance
Business Intelligence: Track engagement, conversion, revenue
Cost Management: Monitor Firebase/API costs before they spike
Growth Metrics: Understand user behavior and retention

Recommended Architecture:

Cloud Run Service: analytics-api

javascript

// ===== Merchant Analytics =====
GET /analytics/merchant/dashboard
  Query: ?merchantId={id}&period=7d|30d|90d
  Returns: {
    venueStats: { views, lanterns, waves, conversion },
    offerStats: { impressions, claims, revenue },
    trends: { daily, weekly },
    topVenues: [...]
  }
  Description: Merchant dashboard metrics

GET /analytics/merchant/venue/:venueId
  Query: ?period=7d|30d|90d
  Returns: {
    activity: { lanterns, waves, uniqueUsers },
    peakHours: [...],
    demographics: { ageGroups, genderSplit },
    trends: { daily, hourly }
  }
  Description: Individual venue performance

GET /analytics/merchant/offer/:offerId
  Returns: {
    impressions: number,
    claims: number,
    redemptions: number,
    conversionRate: number,
    revenue: number,
    timeline: [...]
  }
  Description: Offer engagement metrics

POST /analytics/merchant/export
  Body: {
    merchantId,
    type: "venue-performance" | "offer-analytics" | "revenue",
    period: "7d" | "30d" | "90d",
    format: "csv" | "pdf"
  }
  Returns: { jobId, status }
  Description: Generate export job for merchant data

// ===== Admin/Internal Analytics =====
GET /analytics/admin/overview
  Returns: {
    users: { total, active, new, churn },
    venues: { total, active, topPerforming },
    engagement: { lanterns, waves, chats, matches },
    revenue: { total, byMerchant, byVenue },
    costs: { firebase, cloudRun, apis }
  }
  Description: High-level platform metrics

GET /analytics/admin/engagement
  Query: ?period=7d|30d|90d
  Returns: {
    dau: number,  // Daily active users
    wau: number,  // Weekly active users
    mau: number,  // Monthly active users
    retention: { day1, day7, day30 },
    sessionDuration: { avg, median, p95 },
    actions: { lanterns, waves, chats, matches }
  }
  Description: User engagement metrics

GET /analytics/admin/costs
  Query: ?period=7d|30d|90d
  Returns: {
    total: number,
    breakdown: {
      firestore: { reads, writes, deletes, cost },
      storage: { gb, cost },
      cloudRun: { requests, cpu, memory, cost },
      apis: { osm, nominatim, googlePlaces, cost }
    },
    trends: [...],
    projections: { monthly, annual }
  }
  Description: Cost tracking and projections

GET /analytics/admin/venues
  Query: ?sortBy=activity|lanterns|revenue&limit=50
  Returns: {
    venues: [{
      id, name, location,
      stats: { lanterns, waves, uniqueUsers },
      revenue: number,
      merchantShare: number
    }],
    total: number
  }
  Description: Venue performance rankings

GET /analytics/admin/merchants
  Returns: {
    merchants: [{
      id, name,
      venues: number,
      offers: number,
      revenue: number,
      engagement: { impressions, conversions }
    }]
  }
  Description: Merchant performance

// ===== Real-time Metrics =====
GET /analytics/realtime/activity
  Returns: {
    activeLanterns: number,
    activeWaves: number,
    onlineUsers: number,
    topVenues: [{ venueId, name, activeLanterns }],
    recentMatches: number
  }
  Description: Real-time platform activity

GET /analytics/realtime/venue/:venueId
  Returns: {
    activeLanterns: number,
    recentWaves: number,
    onlineUsers: number
  }
  Description: Real-time venue activity

// ===== Reports & Exports =====
POST /analytics/reports/generate
  Body: {
    type: "engagement" | "revenue" | "costs" | "venues" | "merchants",
    period: "daily" | "weekly" | "monthly" | "custom",
    startDate: "2026-01-01",
    endDate: "2026-01-31",
    format: "csv" | "pdf" | "json",
    recipients: ["email@example.com"]  // Optional email delivery
  }
  Returns: { jobId, status, estimatedTime }
  Description: Generate scheduled report

GET /analytics/reports/:jobId
  Returns: {
    status: "queued" | "processing" | "completed" | "failed",
    progress: number,
    downloadUrl: string,
    expiresAt: timestamp
  }
  Description: Get report job status and download link

GET /analytics/reports/scheduled
  Returns: { reports: [{...}] }
  Description: List scheduled/recurring reports

// ===== Data Export =====
POST /analytics/export/raw
  Body: {
    collection: "lanterns" | "waves" | "chats" | "venues",
    filters: {...},
    dateRange: { start, end },
    format: "csv" | "json" | "parquet"
  }
  Returns: { jobId, status }
  Description: Export raw data (admin-only)

// ===== Usage Tracking (Internal) =====
POST /analytics/track/api-usage
  Body: { service: "osm" | "nominatim" | "googlePlaces", requests: number }
  Returns: { recorded: true }
  Description: Track external API usage for cost monitoring

POST /analytics/track/feature-usage
  Body: { feature: string, userId: string, metadata: {...} }
  Returns: { recorded: true }
  Description: Track feature usage for product analytics

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171

Internal Structure:

analytics-api/
├── src/
│   ├── index.js                   # Express app entry point
│   ├── routes/
│   │   ├── merchant.routes.js     # Merchant analytics
│   │   ├── admin.routes.js        # Admin/internal analytics
│   │   ├── realtime.routes.js     # Real-time metrics
│   │   ├── reports.routes.js      # Report generation
│   │   ├── export.routes.js       # Data exports
│   ├── services/
│   │   ├── aggregation.service.js # Data aggregation from Firestore
│   │   ├── metrics.service.js     # Metrics calculation
│   │   ├── cost-tracking.service.js # Parse billing data
│   │   ├── report-generator.service.js # PDF/CSV generation
│   │   ├── realtime.service.js    # Real-time metrics (cached)
│   ├── jobs/
│   │   ├── daily-aggregation.js   # Daily metrics aggregation
│   │   ├── cost-sync.js           # Sync billing data
│   │   ├── report-scheduler.js    # Scheduled reports
│   ├── middleware/
│   │   ├── auth.js                # Role-based auth (merchant vs admin)
│   │   ├── rate-limit.js          # Rate limiting
│   │   ├── cache.js               # Redis/in-memory cache
│   ├── utils/
│   │   ├── csv-generator.js       # CSV export
│   │   ├── pdf-generator.js       # PDF reports (puppeteer)
│   │   ├── date-utils.js          # Period calculations
│   │   ├── aggregators.js         # Data aggregation helpers
├── Dockerfile
├── package.json
└── cloudbuild.yaml

Data Storage Strategy (Hybrid with BigQuery):

Recommended: Three-Tier Architecture

javascript

// Tier 1: Real-time Metrics (Redis Cache)
// - Active lanterns/waves right now
// - Online users count
// - Top venues by current activity
// - TTL: 60 seconds
// - Read latency: <10ms

// Tier 2: Recent Analytics (Firestore)
// - Pre-aggregated daily/weekly/monthly metrics
// - Last 90 days of data
// - Fast reads for merchant dashboards
// - Read latency: 50-200ms
analytics/
  daily/
    {venueId}_{date} → { lanterns, waves, uniqueUsers, revenue, ... }
  weekly/
    {venueId}_{weekStart} → { ... }
  monthly/
    {venueId}_{monthStart} → { ... }

// Tier 3: Historical Analytics (BigQuery)
// - All historical data (unlimited retention)
// - Complex multi-table joins and aggregations
// - Ad-hoc queries and deep analytics
// - Cost tracking and projections
// - Machine learning / predictive analytics (future)
// - Read latency: 1-10 seconds (acceptable for reports)

BigQuery Architecture:

yaml

# BigQuery Dataset Structure
lantern_analytics:

  # Raw event tables (streamed from Firestore)
  raw_lanterns:              # All lantern events
    - lantern_id
    - user_id
    - venue_id
    - created_at
    - ended_at
    - status
    - metadata (JSON)

  raw_waves:                 # All wave events
    - wave_id
    - sender_id
    - receiver_id
    - venue_id
    - created_at
    - status
    - response

  raw_chats:                 # All chat events
    - chat_id
    - connection_id
    - sender_id
    - created_at
    - message_length

  raw_offers:                # Offer impressions/claims
    - offer_id
    - merchant_id
    - venue_id
    - user_id
    - action (impression/claim/redemption)
    - created_at
    - metadata (JSON)

  # Pre-aggregated tables (for fast queries)
  daily_venue_stats:         # Daily rollups
    - date
    - venue_id
    - lantern_count
    - wave_count
    - unique_users
    - revenue
    - avg_session_duration

  daily_merchant_stats:      # Daily merchant metrics
    - date
    - merchant_id
    - impression_count
    - claim_count
    - revenue
    - conversion_rate

  # Cost tracking tables
  daily_costs:               # GCP billing data
    - date
    - service (firestore/storage/cloudrun/apis)
    - operation_type
    - cost_usd
    - quantity (reads/writes/requests)

  # User cohort analysis
  user_cohorts:              # Retention analysis
    - cohort_date
    - users_count
    - day_1_retention
    - day_7_retention
    - day_30_retention

Data Flow Architecture:

┌─────────────────────────────────────────────────────────────┐
│  Real-time Events (Firestore)                                │
│  • User creates lantern → writes to Firestore               │
│  • Triggers Cloud Function → streams to BigQuery            │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  BigQuery Streaming Insert (Cloud Function)                  │
│  • Low latency (~1 second)                                   │
│  • Automatic schema detection                                │
│  • Cost: $0.05 per GB streamed                              │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  BigQuery Tables (Raw Events)                                │
│  • Partitioned by date (cost optimization)                   │
│  • Clustered by venue_id, user_id                           │
│  • Retention: Unlimited (storage $0.02/GB/month)            │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  Scheduled Queries (BigQuery → Aggregated Tables)            │
│  • Run daily at 2 AM UTC                                    │
│  • Create daily_venue_stats, daily_merchant_stats           │
│  • Cost: $5 per TB scanned (optimized with partitions)      │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  Analytics API (Cloud Run)                                   │
│  • Recent data (90 days): Read from Firestore               │
│  • Historical data (>90 days): Query BigQuery               │
│  • Reports: BigQuery → CSV/PDF export                       │
│  • Cache results in Redis (5-minute TTL)                    │
└─────────────────────────────────────────────────────────────┘

Implementation Approach:

Phase 1: Initial Setup (Week 1)

bash

# 1. Create BigQuery dataset
bq mk --dataset --location=us-central1 lantern_analytics

# 2. Create raw event tables with partitioning
bq mk --table \
  --time_partitioning_field created_at \
  --clustering_fields venue_id,user_id \
  lantern_analytics.raw_lanterns \
  schema_lanterns.json

# 3. Deploy Cloud Functions for streaming
# Trigger: onCreate("lanterns/{lanternId}")
# Action: Stream insert to BigQuery

Phase 2: Streaming Pipeline (Week 2)

javascript

// Cloud Function: streamLanternToBigQuery
exports.streamLanternToBigQuery = functions.firestore
  .document('lanterns/{lanternId}')
  .onCreate(async (snap, context) => {
    const bigquery = new BigQuery();
    const dataset = bigquery.dataset('lantern_analytics');
    const table = dataset.table('raw_lanterns');

    const row = {
      lantern_id: context.params.lanternId,
      user_id: snap.data().userId,
      venue_id: snap.data().venueId,
      created_at: snap.data().createdAt.toDate().toISOString(),
      status: snap.data().status,
      metadata: JSON.stringify(snap.data())
    };

    await table.insert([row]);
  });

Phase 3: Scheduled Aggregations (Week 3)

sql

-- Scheduled query: daily_venue_stats (runs daily at 2 AM)
CREATE OR REPLACE TABLE lantern_analytics.daily_venue_stats
PARTITION BY date
AS
SELECT
  DATE(created_at) as date,
  venue_id,
  COUNT(DISTINCT lantern_id) as lantern_count,
  COUNT(DISTINCT user_id) as unique_users,
  AVG(TIMESTAMP_DIFF(ended_at, created_at, MINUTE)) as avg_duration_minutes
FROM lantern_analytics.raw_lanterns
WHERE DATE(created_at) = CURRENT_DATE() - 1  -- Yesterday only
GROUP BY date, venue_id;

Phase 4: Analytics API Integration (Week 4)

javascript

// analytics-api service
async function getVenueAnalytics(venueId, period) {
  const startDate = calculateStartDate(period);

  // Recent data: Use Firestore (fast)
  if (startDate > Date.now() - 90 * 24 * 60 * 60 * 1000) {
    return getFromFirestore(venueId, startDate);
  }

  // Historical data: Use BigQuery
  const query = `
    SELECT
      date,
      lantern_count,
      unique_users,
      avg_duration_minutes
    FROM lantern_analytics.daily_venue_stats
    WHERE venue_id = @venueId
      AND date >= @startDate
    ORDER BY date DESC
  `;

  const [rows] = await bigquery.query({
    query,
    params: { venueId, startDate }
  });

  return rows;
}

Benefits:

Merchant Value: Data-driven insights to improve venue performance
Cost Control: Early warning for runaway Firebase/API costs
Product Intelligence: Understand feature usage and user behavior
Revenue Optimization: Track offer performance and conversion
Scheduled Reports: Automated daily/weekly/monthly reports
Real-time Dashboards: Fast metrics without hitting Firestore constantly
Unlimited Historical Data: BigQuery retains all data forever at low cost
Complex Analytics: Multi-table joins, cohort analysis, funnel tracking
Cost-Effective at Scale: Firestore reads cost $0.36/million; BigQuery $5/TB scanned
ML/AI Ready: BigQuery ML for predictive analytics and recommendations

BigQuery Cost Analysis (10,000 MAU):

Streaming Inserts:
- 10k users × 5 events/day × 30 days = 1.5M events/month
- 1.5M events × 1KB avg = 1.5GB/month
- Cost: 1.5GB × $0.05/GB = $0.08/month

Storage:
- Year 1: 18GB × $0.02/GB/month = $0.36/month
- Year 2: 36GB × $0.02/GB/month = $0.72/month
- Year 3: 54GB × $0.02/GB/month = $1.08/month

Query Costs (with partitioning):
- 100 merchant dashboard queries/day × 30 days = 3,000 queries/month
- Avg 10MB scanned per query (90-day window, partitioned)
- 3,000 × 10MB = 30GB = 0.03TB
- Cost: 0.03TB × $5/TB = $0.15/month

Scheduled Aggregations:
- Daily aggregation: 500MB scanned/day × 30 = 15GB/month
- Cost: 0.015TB × $5/TB = $0.08/month

Total BigQuery Cost: ~$0.70/month (Year 1)

Compare to Firestore-only approach:
- 3,000 queries × 50 document reads = 150k reads/month
- 150k × $0.36/million = $0.05/month (limited to 90 days)

BigQuery enables unlimited historical analysis for <$1/month!

When to Use Each Storage Tier:

Use Case	Storage	Read Latency	Cost	Best For
Real-time metrics	Redis	<10ms	Free (in-memory)	Live dashboards, current activity
Recent analytics (90d)	Firestore	50-200ms	$0.36/million reads	Merchant dashboards, daily reports
Historical analytics	BigQuery	1-10s	$5/TB scanned	Trend analysis, cohort reports, ML
Cost tracking	BigQuery	1-10s	$5/TB scanned	GCP billing, projections, alerts

Migration Priority: 🟡 Phase 2B (Weeks 4-6) - High priority for product/business

3B. Billing & Cost Monitoring Service 🟡 HIGH PRIORITY

Current State:

No centralized billing visibility
Manual cost tracking across 3+ vendors (GCP, Cloudflare, Railway)
No historical cost data
No anomaly detection or cost alerts
No OpenAI/Resend usage tracking

Why Cloud Run + BigQuery:

Multi-Vendor Consolidation: Aggregate billing from all providers
Historical Analysis: Unlimited retention of billing data
Automated Ingestion: Daily jobs pull data from each vendor API
Anomaly Detection: Alert on cost spikes before invoices arrive
Attribution: Tag costs by environment, service, feature
Forecasting: Predict month-end costs based on trends
Compliance: Audit trail for all billing data

Why High Priority:

Cost Control: Prevent runaway bills before they happen (Issue #52)
Budget Planning: Accurate forecasts for investor/board reporting
Vendor Comparison: Identify where to optimize spend
Transparency: AI usage visibility for users (Issue #188)
Operational Health: Billing visibility in admin portal (Issue #233)
Historical Data: Capture data from day 1 (can't retroactively get history)

Related Issues: #52, #233, #240, #238, #188

Recommended Architecture:

Cloud Run Service: billing-api

javascript

// ===== Billing Dashboard APIs =====
GET /billing/overview
  Query: ?period=current|last|custom&startDate&endDate
  Returns: {
    totalCost: number,
    byVendor: { gcp, cloudflare, railway, openai, resend },
    byService: { firestore, storage, cloudRun, functions, pages },
    trend: { daily, weekly, monthly },
    projectedMonthEnd: number,
    alerts: [{ type, severity, message }]
  }
  Description: High-level billing overview

GET /billing/vendor/:vendor
  Params: vendor = "gcp" | "cloudflare" | "railway" | "openai" | "resend"
  Query: ?period=30d|90d|custom
  Returns: {
    invoices: [{...}],
    lineItems: [{...}],
    usageMetrics: {...},
    costTrend: [...]
  }
  Description: Vendor-specific billing details

GET /billing/service/:service
  Params: service = "firestore" | "storage" | "cloudRun" | "functions" | etc.
  Returns: {
    cost: number,
    usage: { reads, writes, storage, bandwidth },
    costPerUnit: {...},
    optimization: [{ recommendation, estimatedSavings }]
  }
  Description: Service-level cost breakdown

GET /billing/environment/:env
  Params: env = "dev" | "prod"
  Returns: { cost, breakdown, trend }
  Description: Environment-based cost attribution

// ===== Cost Tracking & Analytics =====
GET /billing/trends
  Query: ?metric=cost|usage&groupBy=vendor|service|env&period=90d
  Returns: { dataPoints: [...], trend, forecast }
  Description: Cost trend analysis

GET /billing/anomalies
  Query: ?severity=critical|warning|info&period=30d
  Returns: {
    anomalies: [{
      date, service, expectedCost, actualCost, variance, severity
    }]
  }
  Description: Cost anomaly detection

GET /billing/forecast
  Query: ?horizon=7d|30d|90d
  Returns: {
    projectedCost: number,
    confidence: number,
    trend: "increasing" | "stable" | "decreasing",
    breakdown: {...}
  }
  Description: Cost forecasting based on trends

// ===== AI Usage Tracking (Issue #240, #188) =====
GET /billing/ai/usage
  Query: ?service=openai|anthropic&period=30d
  Returns: {
    totalTokens: number,
    totalCost: number,
    byModel: { "gpt-4": {...}, "claude-3": {...} },
    byFeature: { "chatbot": {...}, "triage": {...} },
    dailyUsage: [...]
  }
  Description: AI API usage and costs

POST /billing/ai/track
  Body: { service, model, tokens, cost, feature, userId }
  Returns: { recorded: true }
  Description: Track AI usage (called by services)

// ===== Export & Reporting =====
POST /billing/reports/generate
  Body: {
    type: "monthly-summary" | "vendor-comparison" | "cost-attribution",
    period: { start, end },
    format: "csv" | "pdf" | "json",
    recipients: ["email@example.com"]  // Optional
  }
  Returns: { jobId, status }
  Description: Generate billing report

GET /billing/reports/:jobId
  Returns: { status, downloadUrl, expiresAt }
  Description: Get report status and download

GET /billing/export/raw
  Query: ?vendor&startDate&endDate&format=csv|json
  Returns: { downloadUrl }
  Description: Export raw billing data (admin-only)

// ===== Admin Operations =====
POST /billing/sync/trigger
  Body: { vendor: "all" | "gcp" | "cloudflare" | "railway" }
  Returns: { jobId, status }
  Description: Manually trigger billing data sync

GET /billing/sync/status
  Returns: {
    lastSync: { vendor, timestamp, status, rowsImported },
    nextSync: { vendor, scheduledTime }
  }
  Description: Check sync job status

POST /billing/alerts/configure
  Body: {
    type: "daily-threshold" | "spike-detection" | "forecast-exceeded",
    threshold: number,
    recipients: [...]
  }
  Returns: { alertId, status }
  Description: Configure cost alerts

GET /billing/health
  Returns: {
    dataFreshness: { vendor, lastUpdate, status },
    pipelineStatus: "healthy" | "degraded" | "failing",
    errors: [...]
  }
  Description: Billing pipeline health check

Cloud Run Jobs (Scheduled):

javascript

// Daily ingestion jobs
job: ingest-gcp-billing        // Runs: Daily 2 AM UTC
  - Reads from GCP Billing Export (auto-configured)
  - Writes to billing_raw.gcp_billing_export

job: ingest-cloudflare         // Runs: Daily 3 AM UTC
  - Pulls billing API data
  - Writes to billing_raw.cloudflare_invoices_raw

job: ingest-railway            // Runs: Daily 3:15 AM UTC
  - Pulls usage metrics
  - Writes to billing_raw.railway_invoices_raw

job: ingest-openai             // Runs: Daily 3:30 AM UTC (Issue #240)
  - Pulls usage data
  - Writes to billing_raw.openai_usage_raw

job: ingest-resend             // Runs: Daily 3:45 AM UTC (Issue #238)
  - Pulls email sending metrics
  - Writes to billing_raw.resend_usage_raw

// Hourly health check
job: cost-health-check         // Runs: Every hour
  - Validate data freshness
  - Detect anomalies (cost spikes >20% day-over-day)
  - Check unattributed costs
  - Send Slack/email alerts if issues found

Internal Structure:

billing-api/
├── src/
│   ├── index.js                   # Express app
│   ├── routes/
│   │   ├── dashboard.routes.js    # Overview, trends, forecasts
│   │   ├── vendor.routes.js       # Vendor-specific billing
│   │   ├── ai.routes.js           # AI usage tracking
│   │   ├── reports.routes.js      # Report generation
│   │   ├── admin.routes.js        # Sync, alerts, health
│   ├── services/
│   │   ├── gcp.service.js         # GCP billing API client
│   │   ├── cloudflare.service.js  # Cloudflare billing API
│   │   ├── railway.service.js     # Railway API client
│   │   ├── openai.service.js      # OpenAI usage tracking
│   │   ├── resend.service.js      # Resend metrics
│   │   ├── bigquery.service.js    # BigQuery read/write
│   │   ├── anomaly.service.js     # Anomaly detection
│   │   ├── forecast.service.js    # Cost forecasting
│   ├── jobs/
│   │   ├── ingest-gcp.js          # GCP billing ingestion
│   │   ├── ingest-cloudflare.js   # Cloudflare ingestion
│   │   ├── ingest-railway.js      # Railway ingestion
│   │   ├── ingest-openai.js       # OpenAI ingestion
│   │   ├── ingest-resend.js       # Resend ingestion
│   │   ├── normalize.js           # Transform to canonical schema
│   │   ├── attribute.js           # Apply attribution rules
│   │   ├── health-check.js        # Pipeline health validation
│   ├── middleware/
│   │   ├── admin-auth.js          # Admin-only endpoints
│   │   ├── rate-limit.js          # Rate limiting
│   │   ├── cache.js               # Redis caching (1hr TTL)
│   ├── utils/
│   │   ├── schema-mapper.js       # Vendor → canonical mapping
│   │   ├── attribution.js         # Cost attribution logic
│   │   ├── pdf-generator.js       # PDF reports
│   │   ├── alert-sender.js        # Slack/email alerts
├── Dockerfile
├── package.json
└── cloudbuild.yaml

BigQuery Architecture (Expanded from Issue #52):

yaml

# Combined Analytics + Billing Dataset Structure

# ========== APP ANALYTICS (existing plan) ==========
lantern_analytics:
  # User engagement data
  raw_lanterns:              # All lantern events
  raw_waves:                 # All wave events
  raw_chats:                 # All chat events
  raw_offers:                # Offer impressions/claims

  # Pre-aggregated app metrics
  daily_venue_stats:         # Daily venue performance
  daily_merchant_stats:      # Daily merchant metrics
  user_cohorts:              # Retention analysis

# ========== BILLING DATA (new - Issue #52) ==========
billing_raw:
  # Immutable vendor-shaped data
  gcp_billing_export:               # GCP auto-export (native)
  gcp_billing_export_resource_v1:   # Detailed resource costs

  cloudflare_invoices_raw:          # Cloudflare invoices
  cloudflare_invoice_line_items:    # Line items
  cloudflare_usage_raw:             # Usage metrics

  railway_invoices_raw:             # Railway invoices
  railway_invoice_line_items:       # Line items
  railway_usage_raw:                # Usage metrics

  openai_usage_raw:                 # OpenAI API usage (Issue #240)
  resend_usage_raw:                 # Resend email metrics (Issue #238)

billing_norm:
  # Canonical cost schema (all vendors normalized)
  fact_cost_line_items:     # One row = one cost event
    - date
    - vendor (gcp|cloudflare|railway|openai|resend)
    - service (firestore|storage|cloudRun|etc.)
    - sku
    - quantity
    - unit_price
    - total_cost
    - currency
    - metadata (JSON)

  # Helper dimensions
  dim_vendor_sku:           # SKU → human-readable mapping
  dim_resource_map:         # Resource → service mapping

billing_attrib:
  # Attribution rules and taxonomy
  dim_taxonomy:             # Allowed env/app/team/cost_center
  dim_cost_allocation_rules: # Rule-based cost attribution

  # Attributed costs
  fact_cost_attributed:     # fact_cost_line_items + attribution
    - All fields from fact_cost_line_items
    - environment (dev|prod)
    - service_name (venue-api|admin-api|analytics-api)
    - feature (lanterns|waves|venues|ai-triage)
    - cost_center
    - team

  # Shared cost allocation (optional)
  fact_cost_allocated:      # Proportional cost splits

billing_marts:
  # BI-friendly rollups (fast dashboards)
  daily_cost_by_vendor:             # Vendor totals per day
  daily_cost_by_service:            # Service costs per day
  daily_cost_by_environment:        # Dev vs prod costs
  monthly_cost_by_vendor:           # Monthly vendor totals
  monthly_cost_by_service:          # Monthly service costs
  top_cost_drivers_30d:             # Biggest cost sources
  unattributed_cost_daily:          # Costs without attribution
  cost_trend_rolling_7d:            # 7-day rolling averages
  cost_forecast_30d:                # Forecasted costs
  ai_usage_by_model:                # AI API usage breakdown (Issue #188)

ops:
  # Pipeline health and monitoring
  pipeline_runs:            # Job execution logs
  data_freshness:           # Vendor data freshness checks
  cost_anomalies:           # Spend spikes, attribution drift
  alert_history:            # Alert notifications sent

Data Flow:

┌─────────────────────────────────────────────────────────────┐
│  Vendor APIs (Daily Scheduled Jobs)                          │
│  • GCP Billing Export → BigQuery (auto)                      │
│  • Cloudflare API → Cloud Run Job → billing_raw.cloudflare_*│
│  • Railway API → Cloud Run Job → billing_raw.railway_*      │
│  • OpenAI API → Cloud Run Job → billing_raw.openai_*        │
│  • Resend API → Cloud Run Job → billing_raw.resend_*        │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  BigQuery Scheduled Queries (Daily 4 AM)                     │
│  • Normalize vendor data → billing_norm.fact_cost_line_items│
│  • Apply attribution rules → billing_attrib.fact_cost_attributed│
│  • Build dashboard marts → billing_marts.*                   │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  Cost Health Check Job (Hourly)                              │
│  • Validate data freshness (last 24 hours)                   │
│  • Detect anomalies (>20% day-over-day spike)               │
│  • Check unattributed % (<5% threshold)                     │
│  • Write to ops.cost_anomalies                              │
│  • Send alerts if issues found                              │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  Billing API (Cloud Run Service)                             │
│  • Admin portal dashboard (real-time metrics)                │
│  • Cost reports (CSV/PDF exports)                            │
│  • AI usage transparency (public-facing, Issue #188)         │
│  • Forecasting and trend analysis                            │
│  • Reads from billing_marts.* (fast queries)                │
└─────────────────────────────────────────────────────────────┘

Cost Attribution Example:

sql

-- Example: Attribute Firestore costs to specific features
INSERT INTO billing_attrib.fact_cost_attributed
SELECT
  fcl.*,
  CASE
    WHEN resource LIKE '%/lanterns/%' THEN 'lanterns'
    WHEN resource LIKE '%/waves/%' THEN 'waves'
    WHEN resource LIKE '%/venues/%' THEN 'venues'
    WHEN resource LIKE '%/chats/%' THEN 'chats'
    ELSE 'unattributed'
  END as feature,
  CASE
    WHEN project_id = 'lantern-app-dev' THEN 'dev'
    WHEN project_id = 'lantern-app-prod' THEN 'prod'
  END as environment
FROM billing_norm.fact_cost_line_items fcl
WHERE vendor = 'gcp' AND service = 'firestore';

Benefits:

Multi-Vendor Visibility: Single dashboard for all billing (GCP, Cloudflare, Railway, OpenAI, Resend)
Historical Analysis: Unlimited retention for trend analysis and forecasting
Early Warnings: Anomaly detection alerts before invoices arrive
Cost Attribution: Understand which features/environments drive costs
AI Transparency: Public-facing AI usage dashboard (Issue #188)
Automated Reports: Daily/weekly/monthly billing summaries
Forecasting: Predict month-end costs with 7-day trend analysis
Optimization: Identify cost-saving opportunities by service
Audit Trail: Complete billing history for compliance
Addresses 5 Issues: #52 (billing pipeline), #233 (system health), #240 (OpenAI), #238 (Resend), #188 (AI transparency)

Migration Priority: 🟡 Phase 2B (Weeks 4-6) - Combined with analytics-api deployment

4. Image Processing Service 🟢 MEDIUM PRIORITY

Current State:

Direct Firebase Storage uploads
Client-side validation only
No image optimization or resizing

Why Cloud Run:

Multi-step Processing: Resize, compress, watermark, format conversion
Native Libraries: Sharp, ImageMagick for advanced operations
Large Files: Handle 5MB+ uploads without timeout
Multiple Outputs: Generate thumbnails, previews, full-size in one request
Security: Malware scanning, EXIF stripping, content moderation

Recommended Architecture:

Cloud Run Service: image-service

javascript

POST /images/upload
  Body: multipart/form-data image file
  Returns: { imageId, urls: { thumbnail, preview, full }, metadata }
  Description: Upload and process image (auto-generates all sizes)

POST /images/:imageId/resize
  Body: { width, height, fit: "cover" | "contain" | "fill" }
  Returns: { url, dimensions }
  Description: Generate custom size

POST /images/:imageId/compress
  Body: { quality: 1-100, format: "webp" | "jpeg" }
  Returns: { url, originalSize, compressedSize, savings }
  Description: Compress image

GET /images/:imageId/metadata
  Returns: { dimensions, format, size, exif, colors }
  Description: Get image metadata

POST /images/:imageId/moderate
  Returns: { safe, labels, nsfw_score }
  Description: Content moderation check

DELETE /images/:imageId
  Returns: { deleted }
  Description: Delete image and all variants

Benefits:

Unified Pipeline: All image processing in one place
Consistent Quality: Automatic optimization and sizing
Security: Malware scanning and EXIF stripping
Multiple Formats: WebP for modern browsers, JPEG fallback
Watermarking: For merchant offer photos

Migration Priority: 🟢 Phase 3 (Weeks 6-8) - If image uploads are high-volume

5. Real-time Updates Service (Future) 🔵 LOW PRIORITY

Current State:

Firestore listeners for real-time updates
Works well for current scale

When to Build: Only if you need:

WebSocket connections for 10,000+ concurrent users
Sub-100ms latency for live updates
Presence system (online/offline, typing indicators)
Live lantern count updates at high traffic venues

Recommended Architecture:

Cloud Run Service: realtime-service

javascript

WS   /ws/connect
  Description: Establish WebSocket connection

WS   /ws/venue/:venueId
  Description: Subscribe to venue updates

WS   /ws/chat/:connectionId
  Description: Subscribe to chat messages

HTTP /presence/status
  Returns: { online, venueActivity }
  Description: Get current presence status

Migration Priority: 🔵 Phase 4+ (Weeks 9+) - Only if Firestore listeners don't scale

6. Full-Text Search Service (Future) 🔵 LOW PRIORITY

Current State:

Basic Firestore queries
Limited search capabilities

When to Build: Only if you need:

Fuzzy search ("cofee" → "coffee")
Multi-field search (name + address + tags)
Real-time autocomplete
Ranking and relevance scoring

Recommended Architecture:

Cloud Run Service: search-service

javascript

GET  /search/venues?q=coffee+downtown&limit=20
  Returns: { results, total, suggestions }
  Description: Search venues with fuzzy matching

GET  /search/users?q=alice
  Returns: { results, total }
  Description: Search users (admin-only)

POST /search/index/rebuild
  Returns: { indexed, errors }
  Description: Rebuild search index

WS   /search/suggest
  Description: Real-time autocomplete

Technology Options:

MeiliSearch (easy setup, good DX)
Typesense (fast, typo-tolerant)
Elasticsearch (powerful, complex)

Migration Priority: 🔵 Phase 4+ (Weeks 9+) - Only if basic Firestore queries insufficient

🔵 Keep as Cloud Functions

Well-suited for Cloud Functions:

Short-lived operations (<30 seconds)
Event-driven triggers (onCreate, onUpdate, etc.)
Simple HTTP endpoints
Lightweight utilities

Functions to Keep:

javascript

// Auth Triggers
onCreate("users/{uid}") → initializeUserProfile
onDelete("users/{uid}") → cleanupUserData

// Firestore Triggers
onUpdate("lanterns/{lanternId}") → updateVenueActivityCount
onDelete("venues/{venueId}") → cleanupVenueLanterns

// Scheduled Functions (if simple)
schedule("every 24 hours") → cleanupExpiredLanterns
schedule("every 1 hour") → cleanupExpiredWaves

// Simple HTTP Endpoints
POST /logClientError → Log client error to Cloud Logging
GET /health → Service health check

📊 Cloud Functions vs Cloud Run Comparison

Feature	Cloud Functions	Cloud Run
Max Execution Time	9 minutes	60 minutes (configurable)
Cold Start	~1-2 seconds	~2-5 seconds
Concurrency	1-1000 requests/instance	Configurable (≤1000)
WebSocket Support	❌ No	✅ Yes
Container Size	Limited	Up to 10GB
Memory	256MB - 16GB	128MB - 32GB
CPU	Always allocated	Can be CPU-only during request
Pricing Model	Invocation + GB-sec	Request + CPU-time + Memory
Best For	Event-driven, simple HTTP	APIs, batch jobs, long tasks
Local Development	Firebase Emulator	Docker
State Management	None (stateless)	In-memory caching supported
Native Dependencies	Limited	Full container support

🎯 Optimized Migration Roadmap

Phase 1 – Immediate (Weeks 1-2) 🔴 CRITICAL

Service: venue-api (combined import + refresh)

Why First:

Fixes critical security issue (Issue #224 - OSM import abuse)
Addresses firestore.rules:394 public write vulnerability
Unblocks Google Places API integration
High impact, clear requirements

Deliverables:

[ ] Cloud Run service deployed to dev
[ ] OSM import endpoint with job queue
[ ] Venue refresh endpoint with rate limiting
[ ] Admin consolidation endpoint (dedupe)
[ ] Firestore security rules updated (remove public write)
[ ] Documentation and API reference
[ ] Integration tests

Success Metrics:

OSM imports no longer client-accessible
Venue refresh rate limited to 1 req/sec
Import jobs handle 1000+ venues without timeout

Phase 2 – Short Term (Weeks 3-4) 🟡 HIGH PRIORITY

Service: admin-api (consolidated admin operations)

Why Second:

Improves admin portal backend
Enables batch operations for efficiency
Better local development experience
Reduces cold start latency

Deliverables:

[ ] Cloud Run service deployed to dev
[ ] Migrate existing Cloud Functions to Express routes:
- getUserByEmail → GET /admin/users/:uid
- setUserRole → POST /admin/users/:uid/role
- banUser → POST /admin/users/:uid/ban
- unbanUser → DELETE /admin/users/:uid/ban
[ ] New batch operations:
- Bulk ban users
- Bulk import users
- Bulk venue updates
[ ] Admin audit logging
[ ] Admin portal updated to use new API
[ ] Deprecation plan for old Cloud Functions

Success Metrics:

Admin operations 50% faster (reduced cold starts)
Batch operations process 100+ records in <10 seconds
Local development setup documented

Phase 2B – Analytics + Billing (Weeks 4-6) 🟡 HIGH PRIORITY

Services: analytics-api + billing-api (combined deployment)

Why Critical:

Merchant-facing feature (product requirement)
Cost tracking and usage monitoring (Issues #52, #233)
Business intelligence for growth
Revenue optimization for merchants
Multi-vendor billing consolidation (GCP, Cloudflare, Railway, OpenAI, Resend)
AI usage transparency (Issue #188)
Historical billing data (can't be retroactively captured)

Related Issues: #52, #233, #240, #238, #188

Deliverables:

BigQuery Infrastructure:

[ ] Create lantern_analytics dataset:
- Create partitioned raw event tables (lanterns, waves, chats, offers)
- Deploy Cloud Functions for streaming inserts
- Create scheduled queries for daily aggregations
[ ] Create billing_raw dataset:
- Set up GCP Billing Export (auto-configured)
- Create tables: cloudflare_invoices_raw, railway_invoices_raw
- Create tables: openai_usage_raw, resend_usage_raw
[ ] Create billing_norm dataset:
- fact_cost_line_items (canonical schema)
- dim_vendor_sku, dim_resource_map
[ ] Create billing_attrib dataset:
- dim_taxonomy, dim_cost_allocation_rules
- fact_cost_attributed (with env/service/feature tags)
[ ] Create billing_marts dataset:
- daily_cost_by_vendor, daily_cost_by_service
- monthly_cost_by_vendor, cost_trend_rolling_7d
- cost_forecast_30d, ai_usage_by_model
[ ] Create ops dataset:
- pipeline_runs, data_freshness, cost_anomalies

Analytics API (analytics-api):

[ ] Merchant analytics endpoints:
- Dashboard metrics (venue/offer performance)
- Venue analytics (activity, demographics, trends)
- Offer analytics (impressions, claims, conversions)
- Export reports (CSV/PDF)
[ ] Admin analytics endpoints:
- Platform overview (users, engagement, revenue)
- Venue rankings, merchant performance
[ ] Real-time metrics (cached):
- Active lanterns/waves, online users, top venues
[ ] Hybrid query logic:
- Recent data (90 days): Read from Firestore
- Historical data (>90 days): Query BigQuery

Billing API (billing-api):

[ ] Billing dashboard endpoints:
- Overview (total cost, by vendor, by service, trends)
- Vendor-specific details (GCP, Cloudflare, Railway, OpenAI, Resend)
- Service-level cost breakdown
- Environment-based attribution (dev vs prod)
[ ] Cost tracking & analytics:
- Trend analysis, anomaly detection, forecasting
- AI usage tracking (Issue #240 - OpenAI, Issue #188 - transparency)
- Cost attribution by feature/service
[ ] Export & reporting:
- Generate monthly billing reports (CSV/PDF)
- Raw data exports (admin-only)
- Email delivery of scheduled reports
[ ] Admin operations:
- Manual sync triggers
- Alert configuration (thresholds, spike detection)
- Pipeline health checks

Cloud Run Jobs (Scheduled):

[ ] Daily billing ingestion jobs (3 AM UTC):
- ingest-cloudflare (Cloudflare billing API)
- ingest-railway (Railway usage API)
- ingest-openai (OpenAI usage API, Issue #240)
- ingest-resend (Resend metrics API, Issue #238)
[ ] Hourly cost health check:
- Validate data freshness (<24hr)
- Detect cost anomalies (>20% day-over-day)
- Alert on unattributed costs (>5%)
- Send Slack/email alerts
[ ] Daily analytics aggregation:
- Firestore → BigQuery streaming
- BigQuery scheduled queries (normalize, attribute, rollup)
- Weekly/monthly reports (BigQuery queries)
- Cost sync from GCP billing API Portal Integration:
[ ] Merchant portal updates:
- Analytics dashboard (venue/offer performance)
- Export buttons (CSV/PDF reports)
- Real-time venue stats
[ ] Admin portal updates:
- System Health "Billing" tab (Issue #233)
- Multi-vendor cost dashboard (GCP, Cloudflare, Railway, OpenAI, Resend)
- Platform metrics (users, engagement, revenue)
- Merchant reports
- Cost anomaly alerts
- AI usage transparency page (Issue #188 - public-facing)

Success Metrics:

Merchants can view venue performance in <2 seconds
Multi-vendor billing data visible in admin portal (Issues #52, #233)
Cost tracking updates daily from all vendors
Anomaly detection alerts fire within 1 hour of spike
Reports generate in <30 seconds for 90-day periods
Real-time metrics cached and served in <100ms
AI usage transparency page loads in <1 second (Issue #188)
Historical billing data captured from deployment date forward

Phase 3 – Medium Term (Weeks 7-9) 🟢 MEDIUM PRIORITY

Service: image-service (if needed)

Why Third:

Image processing only needed if uploads become high-volume
Can start with direct Firebase Storage uploads
Migrate when optimization/security becomes critical

Deliverables:

[ ] Assess current image upload volume and requirements
[ ] If needed: Deploy image-service with Sharp/ImageMagick
[ ] Multi-size image generation (thumbnail, preview, full)
[ ] Image compression and format optimization
[ ] Content moderation integration
[ ] Watermarking for merchant offers

Phase 4 – Long Term (Weeks 10+) 🔵 LOW PRIORITY

Services (build only if needed):

realtime-service (if Firestore listeners don't scale)
search-service (if Firestore queries insufficient)

Why Last:

Current solutions may be sufficient for 1-2 years
Build these only when you hit scaling issues
Focus on product features instead

💰 Cost Analysis

Cloud Functions (Current)

Estimate for 10,000 monthly active users:
- 10 admin functions × $0.40/million invocations = $4/month
- 2GB-sec per invocation × 50,000 invocations = $1.25/month
Total: ~$5.25/month

Cloud Run (Proposed)

Estimate for 10,000 monthly active users:

venue-api:
- 100,000 requests/month
- 512MB memory, 1 vCPU
- Avg 2 seconds per request
- Cost: ~$15/month

admin-api:
- 10,000 requests/month
- 256MB memory, 1 vCPU
- Avg 500ms per request
- Cost: ~$3/month

analytics-api:
- 50,000 requests/month (merchant dashboards + admin)
- 512MB memory, 1 vCPU
- Avg 1 second per request (cached metrics)
- Scheduled jobs: 1 daily (aggregation)
- Cost: ~$6/month

billing-api:
- 20,000 requests/month (admin dashboards + reports)
- 512MB memory, 1 vCPU
- Avg 1 second per request (cached metrics)
- Scheduled jobs: 5 daily (GCP, Cloudflare, Railway, OpenAI, Resend ingest)
- Hourly health check job
- Cost: ~$8/month

BigQuery (analytics + billing):
- Streaming inserts: 1.5GB/month app data + 0.5GB/month billing data = $0.10/month
- Storage (Year 1): 24GB × $0.02/GB = $0.48/month
- Query costs: 3,500 queries × 15MB avg = 0.05TB × $5/TB = $0.25/month
- Scheduled aggregations: 0.02TB × $5/TB = $0.10/month
- Total BigQuery: ~$1/month

Cloud Functions (streaming):
- 5 Firestore triggers for BigQuery streaming (lanterns, waves, chats, offers, venues)
- ~50,000 invocations/month × $0.40/million = $0.02/month

Total: ~$34/month

Cost Increase: ~$29/month (~5.5x) from current $5/month

Value Gained:

Security fixes (Issues #224, #394 - critical vulnerabilities)
Multi-vendor billing visibility (Issues #52, #233 - GCP, Cloudflare, Railway, OpenAI, Resend)
Cost tracking & anomaly detection (prevents runaway bills before they happen)
AI usage transparency (Issue #188 - public-facing dashboard)
Historical billing data (can't be retroactively captured - start now!)
Merchant analytics (product feature, revenue driver)
Business intelligence (growth metrics, forecasting, attribution)
Enables Google Places integration
Supports 1000+ venue imports
Batch admin operations
Better monitoring and logging
Foundation for future growth
Addresses 8 open issues (#52, #224, #233, #238, #240, #188, #394, plus venues/admin improvements)

Break-even Analysis: As usage grows, Cloud Run becomes more cost-effective:

At 100k users: Cloud Run is 50% cheaper
At 1M users: Cloud Run is 70% cheaper
Long-running operations impossible on Cloud Functions

🏗️ Implementation Guidelines

Cloud Run Best Practices

Service Configuration

yaml

# cloudbuild.yaml example
timeout: 3600s  # 60 minutes
memory: 1Gi
cpu: 1
minInstances: 0  # Scale to zero when idle
maxInstances: 10
concurrency: 80  # Requests per instance

Environment Variables

bash

FIREBASE_PROJECT_ID=lantern-app-dev
ENVIRONMENT=dev
LOG_LEVEL=info
RATE_LIMIT_OSM=5   # 5 req/sec
RATE_LIMIT_NOMINATIM=1  # 1 req/sec

Health Checks

javascript

app.get('/health', (req, res) => {
  res.json({ status: 'ok', timestamp: Date.now() });
});

Graceful Shutdown

javascript

process.on('SIGTERM', () => {
  console.log('SIGTERM received, closing server...');
  server.close(() => {
    console.log('Server closed');
    process.exit(0);
  });
});

Security

Authentication
- Use Firebase Admin SDK to verify ID tokens
- Implement role-based access control (RBAC)
- Admin endpoints require admin role
Rate Limiting
- Per-user rate limits (10 req/min for imports)
- Global rate limits for external APIs
- IP-based rate limiting for abuse prevention
Input Validation
- Validate all request bodies with Joi/Zod
- Sanitize user inputs
- Limit request body size (10MB max)

Monitoring

Cloud Logging
- Structured JSON logs
- Include request ID, user ID, action
- Log errors with stack traces
Cloud Monitoring
- Request latency (p50, p95, p99)
- Error rate
- Active instances
- Memory/CPU usage
Alerts
- Error rate >5% for 5 minutes
- Latency p99 >10 seconds
- Memory usage >80%

🧪 Testing Strategy

Local Development

bash

# venue-api local testing
cd services/venue-api
docker build -t venue-api .
docker run -p 8080:8080 \
  -e FIREBASE_PROJECT_ID=lantern-app-dev \
  -e ENVIRONMENT=local \
  venue-api

# Test endpoints
curl http://localhost:8080/health
curl -X POST http://localhost:8080/venues/import/osm \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"lat":40.7128,"lng":-74.0060,"radius":1000}'

Integration Tests

javascript

// venue-api integration test
describe('POST /venues/import/osm', () => {
  it('should import venues from OSM', async () => {
    const response = await request(app)
      .post('/venues/import/osm')
      .set('Authorization', `Bearer ${adminToken}`)
      .send({ lat: 40.7128, lng: -74.0060, radius: 1000 })
      .expect(200);

    expect(response.body).toHaveProperty('jobId');
    expect(response.body.status).toBe('queued');
  });

  it('should reject unauthorized users', async () => {
    await request(app)
      .post('/venues/import/osm')
      .send({ lat: 40.7128, lng: -74.0060, radius: 1000 })
      .expect(401);
  });
});

Load Testing

bash

# Use Apache Bench for load testing
ab -n 1000 -c 10 -H "Authorization: Bearer $TOKEN" \
  http://venue-api-dev.a.run.app/health

# Use k6 for complex scenarios
k6 run load-test.js

📝 Migration Checklist

Pre-Migration

[ ] Review current Cloud Functions usage and costs
[ ] Audit API rate limits (OSM, Nominatim, Google Places)
[ ] Set up Cloud Run development environment
[ ] Create Dockerfiles for each service
[ ] Set up Cloud Build pipelines
[ ] Configure secret management (Secret Manager)

During Migration

[ ] Deploy to dev environment first
[ ] Run integration tests
[ ] Monitor error rates and latency
[ ] Gradual rollout (10% → 50% → 100%)
[ ] Keep Cloud Functions as fallback

Post-Migration

[ ] Monitor for 1 week before deprecating Cloud Functions
[ ] Update documentation
[ ] Update admin portal to use new APIs
[ ] Remove old Cloud Functions
[ ] Update Firestore security rules
[ ] Create changelog entry

🎓 Key Takeaways

✅ Use Cloud Run When You Need:

Batch processing (100+ items)
Long-running tasks (>1 minute, up to 60 minutes)
WebSockets or persistent connections
Multi-step workflows with state
Native dependencies (ImageMagick, etc.)
Container-based applications
High concurrency (1000+ req/instance)

✅ Use Cloud Functions When You Need:

Simple, single-purpose logic (<50 lines)
Event-driven triggers (onCreate, onUpdate)
Fast execution (<30 seconds)
Minimal configuration
Pay-per-invocation model for low traffic

🎯 Lantern-Specific Priorities:

Phase 1 (Weeks 1-2): venue-api 🔴
- Security fix (Issues #224, #394)
- Google Places foundation
Phase 2 (Weeks 3-4): admin-api 🟡
- Better admin operations
- Batch processing
Phase 2B (Weeks 4-6): analytics-api + billing-api 🟡 EXPANDED SCOPE
- Merchant analytics (product feature)
- Multi-vendor billing (Issues #52, #233, #240, #238, #188)
- Cost tracking & anomaly detection
- AI usage transparency
- Critical: Historical data capture starts from deployment
Phase 3 (Weeks 7-9): image-service 🟢
- Only if needed
Phase 4+ (Weeks 10+): Real-time, search 🔵
- Only if scaling issues

Most Important:

Start with Phase 1 (venue-api) to fix critical security vulnerabilities
Phase 2B is CRITICAL because:
- Can't retroactively capture historical billing data - must start early
- Prevents cost surprises before they happen (anomaly detection)
- Merchant analytics drives product value and revenue
- AI transparency builds user trust (Issue #188)
- Addresses 5 open billing/analytics issues in one deployment

📚 References

Document Owner: @cattreedev Last Review: 2026-02-09 Next Review: After Phase 1 completion

Cloud Run Migration Strategy (Optimized) ​

🎯 Executive Summary ​

🚀 High Priority – Cloud Run Services ​

1. Venue API (Combined Import + Refresh) 🔴 CRITICAL ​

2. Admin API (Consolidated) 🟡 HIGH PRIORITY ​

3. Analytics Service 🟡 HIGH PRIORITY ​

3B. Billing & Cost Monitoring Service 🟡 HIGH PRIORITY ​

4. Image Processing Service 🟢 MEDIUM PRIORITY ​

5. Real-time Updates Service (Future) 🔵 LOW PRIORITY ​

6. Full-Text Search Service (Future) 🔵 LOW PRIORITY ​

🔵 Keep as Cloud Functions ​

📊 Cloud Functions vs Cloud Run Comparison ​

🎯 Optimized Migration Roadmap ​

Phase 1 – Immediate (Weeks 1-2) 🔴 CRITICAL ​

Phase 2 – Short Term (Weeks 3-4) 🟡 HIGH PRIORITY ​

Phase 2B – Analytics + Billing (Weeks 4-6) 🟡 HIGH PRIORITY ​

Phase 3 – Medium Term (Weeks 7-9) 🟢 MEDIUM PRIORITY ​

Phase 4 – Long Term (Weeks 10+) 🔵 LOW PRIORITY ​

💰 Cost Analysis ​

Cloud Functions (Current) ​

Cloud Run (Proposed) ​

🏗️ Implementation Guidelines ​

Cloud Run Best Practices ​

Security ​

Monitoring ​

🧪 Testing Strategy ​

Local Development ​

Integration Tests ​

Load Testing ​

📝 Migration Checklist ​

Pre-Migration ​

During Migration ​

Post-Migration ​

🎓 Key Takeaways ​

✅ Use Cloud Run When You Need: ​

✅ Use Cloud Functions When You Need: ​

🎯 Lantern-Specific Priorities: ​

📚 References ​

Cloud Run Migration Strategy (Optimized)

🎯 Executive Summary

🚀 High Priority – Cloud Run Services

1. Venue API (Combined Import + Refresh) 🔴 CRITICAL

2. Admin API (Consolidated) 🟡 HIGH PRIORITY

3. Analytics Service 🟡 HIGH PRIORITY

3B. Billing & Cost Monitoring Service 🟡 HIGH PRIORITY

4. Image Processing Service 🟢 MEDIUM PRIORITY

5. Real-time Updates Service (Future) 🔵 LOW PRIORITY

6. Full-Text Search Service (Future) 🔵 LOW PRIORITY

🔵 Keep as Cloud Functions

📊 Cloud Functions vs Cloud Run Comparison

🎯 Optimized Migration Roadmap

Phase 1 – Immediate (Weeks 1-2) 🔴 CRITICAL

Phase 2 – Short Term (Weeks 3-4) 🟡 HIGH PRIORITY

Phase 2B – Analytics + Billing (Weeks 4-6) 🟡 HIGH PRIORITY

Phase 3 – Medium Term (Weeks 7-9) 🟢 MEDIUM PRIORITY

Phase 4 – Long Term (Weeks 10+) 🔵 LOW PRIORITY

💰 Cost Analysis

Cloud Functions (Current)

Cloud Run (Proposed)

🏗️ Implementation Guidelines

Cloud Run Best Practices

Security

Monitoring

🧪 Testing Strategy

Local Development

Integration Tests

Load Testing

📝 Migration Checklist

Pre-Migration

During Migration

Post-Migration

🎓 Key Takeaways

✅ Use Cloud Run When You Need:

✅ Use Cloud Functions When You Need:

🎯 Lantern-Specific Priorities:

📚 References