Skip to content

Cloud Run Migration Strategy (Optimized) โ€‹

Refactor: Server-Side Configs (Cloud Run vs Cloud Functions & Client-to-Server Migrations)

Last Updated: 2026-02-09 Status: Planning Related Issues:

  • #224 (OSM Import Security)
  • #394 (Venue Refresh Abuse)
  • #52 (BigQuery Billing Pipeline)
  • #233 (System Health Billing)
  • #240 (OpenAI Billing)
  • #238 (Resend Billing)
  • #188 (AI Usage Transparency)

๐ŸŽฏ Executive Summary โ€‹

This document outlines an optimized migration strategy from Cloud Functions to Cloud Run for services that require:

  • Long-running operations (>1 minute)
  • Batch processing (100+ items)
  • WebSocket/persistent connections
  • Multi-step workflows
  • Native/container dependencies
  • Complex state management
  • Multi-vendor data aggregation (billing, analytics)
  • Historical data retention (BigQuery)

Key Optimizations:

  1. Consolidate related services to reduce operational overhead and improve code reuse
  2. Combine analytics + billing infrastructure to maximize BigQuery investment and capture historical data from day 1
  3. Address 8 open issues in unified deployments (security, billing, transparency)

๐Ÿš€ High Priority โ€“ Cloud Run Services โ€‹


1. Venue API (Combined Import + Refresh) ๐Ÿ”ด CRITICAL โ€‹

Current State:

  • OSM import: CLI script with client-side triggers
  • Venue refresh: Client-side triggers with public write access (firestore.rules:394)
  • No Google Places integration yet

Why Combined Service:

  • Shared Domain Logic: Both handle venue data validation, deduplication, normalization
  • Unified Rate Limiting: Reusable infrastructure across OSM, Nominatim, Google Places
  • API Source Flexibility: Adding Google Places becomes a new route, not a new service
  • Operational Efficiency: Single deployment, shared connection pools, lower costs
  • Future-Proof: Easy to add more sources (Yelp, Foursquare), centralized quality scoring

Recommended Architecture:

Cloud Run Service: venue-api

javascript
// ===== Import Operations =====
POST /venues/import/osm
  Body: { lat, lng, radius, limit, filters }
  Returns: { jobId, status, estimatedTime }
  Description: Import venues from OSM Overpass API

POST /venues/import/google
  Body: { lat, lng, radius, types, keyword }
  Returns: { jobId, status, estimatedTime }
  Description: Import venues from Google Places API (future)

POST /venues/import/hybrid
  Body: { lat, lng, radius, strategy: "osm-primary" | "google-primary" }
  Returns: { jobId, status, estimatedTime }
  Description: Combined OSM + Google enrichment with deduplication

GET /venues/import/jobs/:jobId
  Returns: { status, progress, venuesProcessed, errors, results }
  Description: Check import job status and results

// ===== Enrichment & Refresh =====
POST /venues/:venueId/enrich
  Body: { fields: ["photos", "hours", "contact", "rating"] }
  Returns: { venue, enrichedFields, source }
  Description: Enrich single venue with external API data

POST /venues/refresh/batch
  Body: { venueIds: [...], priority: "high" | "normal" }
  Returns: { jobId, status, count }
  Description: Batch refresh venue metadata

POST /venues/refresh/scheduled
  Headers: X-CloudScheduler-JobName
  Returns: { processed, updated, errors }
  Description: Cloud Scheduler endpoint for automated refresh

// ===== Utilities =====
POST /venues/consolidate
  Body: { region: { lat, lng, radius }, strategy: "strict" | "fuzzy" }
  Returns: { duplicatesFound, merged, deleted }
  Description: Deduplicate venues in region

POST /venues/geocode
  Body: { address } or { lat, lng }
  Returns: { address, coordinates, confidence }
  Description: Forward or reverse geocoding via Nominatim

GET /venues/:venueId/metadata
  Returns: { venue, sources, lastUpdated, quality }
  Description: Get venue details with provenance info

// ===== Admin Operations =====
POST /venues/cleanup/orphaned
  Returns: { deleted, errors }
  Description: Remove venues with no activity for 90+ days

POST /venues/validate/all
  Returns: { total, invalid, fixed, errors }
  Description: Data quality check and auto-fix

POST /venues/rebuild-index
  Returns: { indexed, errors }
  Description: Rebuild search/geo indexes

Internal Structure:

venue-api/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ index.js                   # Express app entry point
โ”‚   โ”œโ”€โ”€ routes/
โ”‚   โ”‚   โ”œโ”€โ”€ import.routes.js       # OSM, Google, hybrid imports
โ”‚   โ”‚   โ”œโ”€โ”€ refresh.routes.js      # Enrichment & scheduled tasks
โ”‚   โ”‚   โ”œโ”€โ”€ utility.routes.js      # Geocoding, consolidation
โ”‚   โ”‚   โ”œโ”€โ”€ admin.routes.js        # Cleanup, validation
โ”‚   โ”œโ”€โ”€ services/
โ”‚   โ”‚   โ”œโ”€โ”€ osm.service.js         # OSM Overpass API client
โ”‚   โ”‚   โ”œโ”€โ”€ google.service.js      # Google Places API client
โ”‚   โ”‚   โ”œโ”€โ”€ nominatim.service.js   # Geocoding service
โ”‚   โ”‚   โ”œโ”€โ”€ venue.service.js       # Core venue logic
โ”‚   โ”‚   โ”œโ”€โ”€ enrichment.service.js  # Data enrichment
โ”‚   โ”œโ”€โ”€ queue/
โ”‚   โ”‚   โ”œโ”€โ”€ job-manager.js         # Queue handling (Cloud Tasks)
โ”‚   โ”‚   โ”œโ”€โ”€ rate-limiter.js        # Global rate limiting
โ”‚   โ”œโ”€โ”€ middleware/
โ”‚   โ”‚   โ”œโ”€โ”€ auth.js                # Firebase Admin validation
โ”‚   โ”‚   โ”œโ”€โ”€ rate-limit.js          # Per-source rate limits
โ”‚   โ”‚   โ”œโ”€โ”€ error-handler.js       # Unified error handling
โ”‚   โ”œโ”€โ”€ utils/
โ”‚   โ”‚   โ”œโ”€โ”€ deduplication.js       # Venue matching algorithms
โ”‚   โ”‚   โ”œโ”€โ”€ validation.js          # Data validation
โ”‚   โ”‚   โ”œโ”€โ”€ cache.js               # In-memory caching
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ package.json
โ””โ”€โ”€ cloudbuild.yaml

Benefits:

  • Security: Server-side validation prevents client abuse (fixes Issue #224)
  • Rate Limiting: Global throttling for OSM (5 req/sec), Nominatim (1 req/sec), Google Places
  • Batch Processing: Import 1000+ venues in single job (vs Cloud Functions 9-min timeout)
  • Deduplication: Automatic merging of OSM + Google Places data
  • Scheduled Jobs: Cloud Scheduler โ†’ nightly venue refresh for high-traffic locations
  • Audit Trail: Full logging of all import operations
  • Cost Efficiency: Single service vs two separate deployments

Migration Priority: ๐Ÿ”ด Phase 1 (Weeks 1-2) - Addresses critical security issues


2. Admin API (Consolidated) ๐ŸŸก HIGH PRIORITY โ€‹

Current State:

  • 10+ separate Cloud Functions for admin operations
  • Each function has separate auth, logging, error handling
  • No batch operations support
  • Difficult local development

Why Consolidated Service:

  • Shared Middleware: Auth, validation, logging, error handling
  • Unified API: Single endpoint for admin portal
  • Batch Operations: Process 100+ users/venues in single request
  • Better DX: Local development with docker run
  • Reduced Cold Starts: 1 warm service vs 10+ separate functions
  • Complex Workflows: Multi-step admin actions in single transaction

Recommended Architecture:

Cloud Run Service: admin-api

javascript
// ===== User Management =====
// Migrate existing Cloud Functions
GET  /admin/users/:uid
  Description: Get user by ID (replaces getUserByEmail function)

POST /admin/users/:uid/role
  Body: { role: "admin" | "merchant" | "user" }
  Description: Set user role (replaces setUserRole function)

POST /admin/users/:uid/ban
  Body: { reason, duration }
  Description: Ban user (replaces banUser function)

DELETE /admin/users/:uid/ban
  Description: Unban user (replaces unbanUser function)

// New batch operations
POST /admin/users/bulk-import
  Body: { users: [...], sendInvites: boolean }
  Returns: { jobId, status, count }
  Description: Import users from CSV

POST /admin/users/bulk-ban
  Body: { userIds: [...], reason }
  Returns: { banned, errors }
  Description: Batch ban users

POST /admin/users/bulk-role
  Body: { userIds: [...], role }
  Returns: { updated, errors }
  Description: Batch update user roles

GET /admin/users/search
  Query: ?q=email@example.com&role=admin&limit=50
  Returns: { users, total, hasMore }
  Description: Search users with filters

// ===== Venue Management =====
POST /admin/venues/bulk-update
  Body: { venueIds: [...], updates: {...} }
  Returns: { updated, errors }
  Description: Batch update venue metadata

POST /admin/venues/cleanup
  Body: { criteria: "inactive" | "duplicate" | "invalid" }
  Returns: { deleted, merged, errors }
  Description: Cleanup venues by criteria

GET /admin/venues/audit
  Returns: { venues, issues, recommendations }
  Description: Venue data quality audit

// ===== Content Moderation =====
GET /admin/moderation/queue
  Query: ?type=profile|lantern|chat&status=pending
  Returns: { items, total }
  Description: Get moderation queue

POST /admin/moderation/:itemId/approve
  Returns: { status }
  Description: Approve content

POST /admin/moderation/:itemId/reject
  Body: { reason, action: "hide" | "delete" | "ban-user" }
  Returns: { status }
  Description: Reject content

// ===== Analytics & Reports =====
POST /admin/reports/generate
  Body: { type: "engagement" | "revenue" | "venues", period: "daily" | "weekly" | "monthly" }
  Returns: { jobId, status }
  Description: Generate report job

GET /admin/reports/:jobId
  Returns: { status, downloadUrl, data }
  Description: Get report results

GET /admin/analytics/dashboard
  Returns: { metrics: {...}, trends: {...} }
  Description: Real-time dashboard metrics

POST /admin/analytics/export
  Body: { collection, filters, format: "csv" | "json" }
  Returns: { downloadUrl }
  Description: Export data

// ===== System Operations =====
POST /admin/system/cache/clear
  Body: { scope: "all" | "venues" | "users" }
  Returns: { cleared }
  Description: Clear system caches

GET /admin/system/health
  Returns: { status, services, errors }
  Description: System health check

POST /admin/system/maintenance
  Body: { enabled: boolean, message }
  Returns: { status }
  Description: Toggle maintenance mode

Internal Structure:

admin-api/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ index.js                   # Express app
โ”‚   โ”œโ”€โ”€ routes/
โ”‚   โ”‚   โ”œโ”€โ”€ users.routes.js        # User management
โ”‚   โ”‚   โ”œโ”€โ”€ venues.routes.js       # Venue management
โ”‚   โ”‚   โ”œโ”€โ”€ moderation.routes.js   # Content moderation
โ”‚   โ”‚   โ”œโ”€โ”€ reports.routes.js      # Analytics & reports
โ”‚   โ”‚   โ”œโ”€โ”€ system.routes.js       # System operations
โ”‚   โ”œโ”€โ”€ services/
โ”‚   โ”‚   โ”œโ”€โ”€ user.service.js        # User operations
โ”‚   โ”‚   โ”œโ”€โ”€ venue.service.js       # Venue operations
โ”‚   โ”‚   โ”œโ”€โ”€ report.service.js      # Report generation
โ”‚   โ”‚   โ”œโ”€โ”€ analytics.service.js   # Metrics aggregation
โ”‚   โ”œโ”€โ”€ middleware/
โ”‚   โ”‚   โ”œโ”€โ”€ admin-auth.js          # Admin-only auth
โ”‚   โ”‚   โ”œโ”€โ”€ rate-limit.js          # Rate limiting
โ”‚   โ”‚   โ”œโ”€โ”€ audit-log.js           # Action logging
โ”‚   โ”‚   โ”œโ”€โ”€ error-handler.js       # Error handling
โ”‚   โ”œโ”€โ”€ utils/
โ”‚   โ”‚   โ”œโ”€โ”€ csv-parser.js          # CSV import/export
โ”‚   โ”‚   โ”œโ”€โ”€ pdf-generator.js       # PDF reports
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ package.json

Benefits:

  • Unified Auth: Single admin authentication middleware
  • Batch Operations: Process 100+ records in one request
  • Better Logging: Centralized audit trail of all admin actions
  • Easier Development: Run entire admin backend locally
  • Complex Workflows: Multi-step operations in single transaction
  • Reduced Costs: 1 service vs 10+ Cloud Functions

Migration Priority: ๐ŸŸก Phase 2 (Weeks 3-4) - Improve admin operations


3. Analytics Service ๐ŸŸก HIGH PRIORITY โ€‹

Current State:

  • No analytics infrastructure
  • No merchant-facing metrics
  • No cost tracking or usage monitoring
  • Manual reporting and data exports

Why Cloud Run:

  • Data Aggregation: Complex Firestore queries across multiple collections
  • Large Data Processing: Aggregate 100k+ events for reports
  • Scheduled Jobs: Daily/weekly/monthly report generation
  • Memory Intensive: Process 10k+ records for trend analysis
  • Export Generation: CSV/PDF reports can take minutes
  • Real-time Metrics: In-memory caching for dashboard
  • Cost Tracking: Parse Firebase/GCP billing data

Why High Priority:

  • Product Feature: Merchants need analytics to see venue performance
  • Business Intelligence: Track engagement, conversion, revenue
  • Cost Management: Monitor Firebase/API costs before they spike
  • Growth Metrics: Understand user behavior and retention

Recommended Architecture:

Cloud Run Service: analytics-api

javascript
// ===== Merchant Analytics =====
GET /analytics/merchant/dashboard
  Query: ?merchantId={id}&period=7d|30d|90d
  Returns: {
    venueStats: { views, lanterns, waves, conversion },
    offerStats: { impressions, claims, revenue },
    trends: { daily, weekly },
    topVenues: [...]
  }
  Description: Merchant dashboard metrics

GET /analytics/merchant/venue/:venueId
  Query: ?period=7d|30d|90d
  Returns: {
    activity: { lanterns, waves, uniqueUsers },
    peakHours: [...],
    demographics: { ageGroups, genderSplit },
    trends: { daily, hourly }
  }
  Description: Individual venue performance

GET /analytics/merchant/offer/:offerId
  Returns: {
    impressions: number,
    claims: number,
    redemptions: number,
    conversionRate: number,
    revenue: number,
    timeline: [...]
  }
  Description: Offer engagement metrics

POST /analytics/merchant/export
  Body: {
    merchantId,
    type: "venue-performance" | "offer-analytics" | "revenue",
    period: "7d" | "30d" | "90d",
    format: "csv" | "pdf"
  }
  Returns: { jobId, status }
  Description: Generate export job for merchant data

// ===== Admin/Internal Analytics =====
GET /analytics/admin/overview
  Returns: {
    users: { total, active, new, churn },
    venues: { total, active, topPerforming },
    engagement: { lanterns, waves, chats, matches },
    revenue: { total, byMerchant, byVenue },
    costs: { firebase, cloudRun, apis }
  }
  Description: High-level platform metrics

GET /analytics/admin/engagement
  Query: ?period=7d|30d|90d
  Returns: {
    dau: number,  // Daily active users
    wau: number,  // Weekly active users
    mau: number,  // Monthly active users
    retention: { day1, day7, day30 },
    sessionDuration: { avg, median, p95 },
    actions: { lanterns, waves, chats, matches }
  }
  Description: User engagement metrics

GET /analytics/admin/costs
  Query: ?period=7d|30d|90d
  Returns: {
    total: number,
    breakdown: {
      firestore: { reads, writes, deletes, cost },
      storage: { gb, cost },
      cloudRun: { requests, cpu, memory, cost },
      apis: { osm, nominatim, googlePlaces, cost }
    },
    trends: [...],
    projections: { monthly, annual }
  }
  Description: Cost tracking and projections

GET /analytics/admin/venues
  Query: ?sortBy=activity|lanterns|revenue&limit=50
  Returns: {
    venues: [{
      id, name, location,
      stats: { lanterns, waves, uniqueUsers },
      revenue: number,
      merchantShare: number
    }],
    total: number
  }
  Description: Venue performance rankings

GET /analytics/admin/merchants
  Returns: {
    merchants: [{
      id, name,
      venues: number,
      offers: number,
      revenue: number,
      engagement: { impressions, conversions }
    }]
  }
  Description: Merchant performance

// ===== Real-time Metrics =====
GET /analytics/realtime/activity
  Returns: {
    activeLanterns: number,
    activeWaves: number,
    onlineUsers: number,
    topVenues: [{ venueId, name, activeLanterns }],
    recentMatches: number
  }
  Description: Real-time platform activity

GET /analytics/realtime/venue/:venueId
  Returns: {
    activeLanterns: number,
    recentWaves: number,
    onlineUsers: number
  }
  Description: Real-time venue activity

// ===== Reports & Exports =====
POST /analytics/reports/generate
  Body: {
    type: "engagement" | "revenue" | "costs" | "venues" | "merchants",
    period: "daily" | "weekly" | "monthly" | "custom",
    startDate: "2026-01-01",
    endDate: "2026-01-31",
    format: "csv" | "pdf" | "json",
    recipients: ["email@example.com"]  // Optional email delivery
  }
  Returns: { jobId, status, estimatedTime }
  Description: Generate scheduled report

GET /analytics/reports/:jobId
  Returns: {
    status: "queued" | "processing" | "completed" | "failed",
    progress: number,
    downloadUrl: string,
    expiresAt: timestamp
  }
  Description: Get report job status and download link

GET /analytics/reports/scheduled
  Returns: { reports: [{...}] }
  Description: List scheduled/recurring reports

// ===== Data Export =====
POST /analytics/export/raw
  Body: {
    collection: "lanterns" | "waves" | "chats" | "venues",
    filters: {...},
    dateRange: { start, end },
    format: "csv" | "json" | "parquet"
  }
  Returns: { jobId, status }
  Description: Export raw data (admin-only)

// ===== Usage Tracking (Internal) =====
POST /analytics/track/api-usage
  Body: { service: "osm" | "nominatim" | "googlePlaces", requests: number }
  Returns: { recorded: true }
  Description: Track external API usage for cost monitoring

POST /analytics/track/feature-usage
  Body: { feature: string, userId: string, metadata: {...} }
  Returns: { recorded: true }
  Description: Track feature usage for product analytics

Internal Structure:

analytics-api/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ index.js                   # Express app entry point
โ”‚   โ”œโ”€โ”€ routes/
โ”‚   โ”‚   โ”œโ”€โ”€ merchant.routes.js     # Merchant analytics
โ”‚   โ”‚   โ”œโ”€โ”€ admin.routes.js        # Admin/internal analytics
โ”‚   โ”‚   โ”œโ”€โ”€ realtime.routes.js     # Real-time metrics
โ”‚   โ”‚   โ”œโ”€โ”€ reports.routes.js      # Report generation
โ”‚   โ”‚   โ”œโ”€โ”€ export.routes.js       # Data exports
โ”‚   โ”œโ”€โ”€ services/
โ”‚   โ”‚   โ”œโ”€โ”€ aggregation.service.js # Data aggregation from Firestore
โ”‚   โ”‚   โ”œโ”€โ”€ metrics.service.js     # Metrics calculation
โ”‚   โ”‚   โ”œโ”€โ”€ cost-tracking.service.js # Parse billing data
โ”‚   โ”‚   โ”œโ”€โ”€ report-generator.service.js # PDF/CSV generation
โ”‚   โ”‚   โ”œโ”€โ”€ realtime.service.js    # Real-time metrics (cached)
โ”‚   โ”œโ”€โ”€ jobs/
โ”‚   โ”‚   โ”œโ”€โ”€ daily-aggregation.js   # Daily metrics aggregation
โ”‚   โ”‚   โ”œโ”€โ”€ cost-sync.js           # Sync billing data
โ”‚   โ”‚   โ”œโ”€โ”€ report-scheduler.js    # Scheduled reports
โ”‚   โ”œโ”€โ”€ middleware/
โ”‚   โ”‚   โ”œโ”€โ”€ auth.js                # Role-based auth (merchant vs admin)
โ”‚   โ”‚   โ”œโ”€โ”€ rate-limit.js          # Rate limiting
โ”‚   โ”‚   โ”œโ”€โ”€ cache.js               # Redis/in-memory cache
โ”‚   โ”œโ”€โ”€ utils/
โ”‚   โ”‚   โ”œโ”€โ”€ csv-generator.js       # CSV export
โ”‚   โ”‚   โ”œโ”€โ”€ pdf-generator.js       # PDF reports (puppeteer)
โ”‚   โ”‚   โ”œโ”€โ”€ date-utils.js          # Period calculations
โ”‚   โ”‚   โ”œโ”€โ”€ aggregators.js         # Data aggregation helpers
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ package.json
โ””โ”€โ”€ cloudbuild.yaml

Data Storage Strategy (Hybrid with BigQuery):

Recommended: Three-Tier Architecture

javascript
// Tier 1: Real-time Metrics (Redis Cache)
// - Active lanterns/waves right now
// - Online users count
// - Top venues by current activity
// - TTL: 60 seconds
// - Read latency: <10ms

// Tier 2: Recent Analytics (Firestore)
// - Pre-aggregated daily/weekly/monthly metrics
// - Last 90 days of data
// - Fast reads for merchant dashboards
// - Read latency: 50-200ms
analytics/
  daily/
    {venueId}_{date} โ†’ { lanterns, waves, uniqueUsers, revenue, ... }
  weekly/
    {venueId}_{weekStart} โ†’ { ... }
  monthly/
    {venueId}_{monthStart} โ†’ { ... }

// Tier 3: Historical Analytics (BigQuery)
// - All historical data (unlimited retention)
// - Complex multi-table joins and aggregations
// - Ad-hoc queries and deep analytics
// - Cost tracking and projections
// - Machine learning / predictive analytics (future)
// - Read latency: 1-10 seconds (acceptable for reports)

BigQuery Architecture:

yaml
# BigQuery Dataset Structure
lantern_analytics:

  # Raw event tables (streamed from Firestore)
  raw_lanterns:              # All lantern events
    - lantern_id
    - user_id
    - venue_id
    - created_at
    - ended_at
    - status
    - metadata (JSON)

  raw_waves:                 # All wave events
    - wave_id
    - sender_id
    - receiver_id
    - venue_id
    - created_at
    - status
    - response

  raw_chats:                 # All chat events
    - chat_id
    - connection_id
    - sender_id
    - created_at
    - message_length

  raw_offers:                # Offer impressions/claims
    - offer_id
    - merchant_id
    - venue_id
    - user_id
    - action (impression/claim/redemption)
    - created_at
    - metadata (JSON)

  # Pre-aggregated tables (for fast queries)
  daily_venue_stats:         # Daily rollups
    - date
    - venue_id
    - lantern_count
    - wave_count
    - unique_users
    - revenue
    - avg_session_duration

  daily_merchant_stats:      # Daily merchant metrics
    - date
    - merchant_id
    - impression_count
    - claim_count
    - revenue
    - conversion_rate

  # Cost tracking tables
  daily_costs:               # GCP billing data
    - date
    - service (firestore/storage/cloudrun/apis)
    - operation_type
    - cost_usd
    - quantity (reads/writes/requests)

  # User cohort analysis
  user_cohorts:              # Retention analysis
    - cohort_date
    - users_count
    - day_1_retention
    - day_7_retention
    - day_30_retention

Data Flow Architecture:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Real-time Events (Firestore)                                โ”‚
โ”‚  โ€ข User creates lantern โ†’ writes to Firestore               โ”‚
โ”‚  โ€ข Triggers Cloud Function โ†’ streams to BigQuery            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  BigQuery Streaming Insert (Cloud Function)                  โ”‚
โ”‚  โ€ข Low latency (~1 second)                                   โ”‚
โ”‚  โ€ข Automatic schema detection                                โ”‚
โ”‚  โ€ข Cost: $0.05 per GB streamed                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  BigQuery Tables (Raw Events)                                โ”‚
โ”‚  โ€ข Partitioned by date (cost optimization)                   โ”‚
โ”‚  โ€ข Clustered by venue_id, user_id                           โ”‚
โ”‚  โ€ข Retention: Unlimited (storage $0.02/GB/month)            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Scheduled Queries (BigQuery โ†’ Aggregated Tables)            โ”‚
โ”‚  โ€ข Run daily at 2 AM UTC                                    โ”‚
โ”‚  โ€ข Create daily_venue_stats, daily_merchant_stats           โ”‚
โ”‚  โ€ข Cost: $5 per TB scanned (optimized with partitions)      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Analytics API (Cloud Run)                                   โ”‚
โ”‚  โ€ข Recent data (90 days): Read from Firestore               โ”‚
โ”‚  โ€ข Historical data (>90 days): Query BigQuery               โ”‚
โ”‚  โ€ข Reports: BigQuery โ†’ CSV/PDF export                       โ”‚
โ”‚  โ€ข Cache results in Redis (5-minute TTL)                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Implementation Approach:

Phase 1: Initial Setup (Week 1)

bash
# 1. Create BigQuery dataset
bq mk --dataset --location=us-central1 lantern_analytics

# 2. Create raw event tables with partitioning
bq mk --table \
  --time_partitioning_field created_at \
  --clustering_fields venue_id,user_id \
  lantern_analytics.raw_lanterns \
  schema_lanterns.json

# 3. Deploy Cloud Functions for streaming
# Trigger: onCreate("lanterns/{lanternId}")
# Action: Stream insert to BigQuery

Phase 2: Streaming Pipeline (Week 2)

javascript
// Cloud Function: streamLanternToBigQuery
exports.streamLanternToBigQuery = functions.firestore
  .document('lanterns/{lanternId}')
  .onCreate(async (snap, context) => {
    const bigquery = new BigQuery();
    const dataset = bigquery.dataset('lantern_analytics');
    const table = dataset.table('raw_lanterns');

    const row = {
      lantern_id: context.params.lanternId,
      user_id: snap.data().userId,
      venue_id: snap.data().venueId,
      created_at: snap.data().createdAt.toDate().toISOString(),
      status: snap.data().status,
      metadata: JSON.stringify(snap.data())
    };

    await table.insert([row]);
  });

Phase 3: Scheduled Aggregations (Week 3)

sql
-- Scheduled query: daily_venue_stats (runs daily at 2 AM)
CREATE OR REPLACE TABLE lantern_analytics.daily_venue_stats
PARTITION BY date
AS
SELECT
  DATE(created_at) as date,
  venue_id,
  COUNT(DISTINCT lantern_id) as lantern_count,
  COUNT(DISTINCT user_id) as unique_users,
  AVG(TIMESTAMP_DIFF(ended_at, created_at, MINUTE)) as avg_duration_minutes
FROM lantern_analytics.raw_lanterns
WHERE DATE(created_at) = CURRENT_DATE() - 1  -- Yesterday only
GROUP BY date, venue_id;

Phase 4: Analytics API Integration (Week 4)

javascript
// analytics-api service
async function getVenueAnalytics(venueId, period) {
  const startDate = calculateStartDate(period);

  // Recent data: Use Firestore (fast)
  if (startDate > Date.now() - 90 * 24 * 60 * 60 * 1000) {
    return getFromFirestore(venueId, startDate);
  }

  // Historical data: Use BigQuery
  const query = `
    SELECT
      date,
      lantern_count,
      unique_users,
      avg_duration_minutes
    FROM lantern_analytics.daily_venue_stats
    WHERE venue_id = @venueId
      AND date >= @startDate
    ORDER BY date DESC
  `;

  const [rows] = await bigquery.query({
    query,
    params: { venueId, startDate }
  });

  return rows;
}

Benefits:

  • Merchant Value: Data-driven insights to improve venue performance
  • Cost Control: Early warning for runaway Firebase/API costs
  • Product Intelligence: Understand feature usage and user behavior
  • Revenue Optimization: Track offer performance and conversion
  • Scheduled Reports: Automated daily/weekly/monthly reports
  • Real-time Dashboards: Fast metrics without hitting Firestore constantly
  • Unlimited Historical Data: BigQuery retains all data forever at low cost
  • Complex Analytics: Multi-table joins, cohort analysis, funnel tracking
  • Cost-Effective at Scale: Firestore reads cost $0.36/million; BigQuery $5/TB scanned
  • ML/AI Ready: BigQuery ML for predictive analytics and recommendations

BigQuery Cost Analysis (10,000 MAU):

Streaming Inserts:
- 10k users ร— 5 events/day ร— 30 days = 1.5M events/month
- 1.5M events ร— 1KB avg = 1.5GB/month
- Cost: 1.5GB ร— $0.05/GB = $0.08/month

Storage:
- Year 1: 18GB ร— $0.02/GB/month = $0.36/month
- Year 2: 36GB ร— $0.02/GB/month = $0.72/month
- Year 3: 54GB ร— $0.02/GB/month = $1.08/month

Query Costs (with partitioning):
- 100 merchant dashboard queries/day ร— 30 days = 3,000 queries/month
- Avg 10MB scanned per query (90-day window, partitioned)
- 3,000 ร— 10MB = 30GB = 0.03TB
- Cost: 0.03TB ร— $5/TB = $0.15/month

Scheduled Aggregations:
- Daily aggregation: 500MB scanned/day ร— 30 = 15GB/month
- Cost: 0.015TB ร— $5/TB = $0.08/month

Total BigQuery Cost: ~$0.70/month (Year 1)

Compare to Firestore-only approach:
- 3,000 queries ร— 50 document reads = 150k reads/month
- 150k ร— $0.36/million = $0.05/month (limited to 90 days)

BigQuery enables unlimited historical analysis for <$1/month!

When to Use Each Storage Tier:

Use CaseStorageRead LatencyCostBest For
Real-time metricsRedis<10msFree (in-memory)Live dashboards, current activity
Recent analytics (90d)Firestore50-200ms$0.36/million readsMerchant dashboards, daily reports
Historical analyticsBigQuery1-10s$5/TB scannedTrend analysis, cohort reports, ML
Cost trackingBigQuery1-10s$5/TB scannedGCP billing, projections, alerts

Migration Priority: ๐ŸŸก Phase 2B (Weeks 4-6) - High priority for product/business


3B. Billing & Cost Monitoring Service ๐ŸŸก HIGH PRIORITY โ€‹

Current State:

  • No centralized billing visibility
  • Manual cost tracking across 3+ vendors (GCP, Cloudflare, Railway)
  • No historical cost data
  • No anomaly detection or cost alerts
  • No OpenAI/Resend usage tracking

Why Cloud Run + BigQuery:

  • Multi-Vendor Consolidation: Aggregate billing from all providers
  • Historical Analysis: Unlimited retention of billing data
  • Automated Ingestion: Daily jobs pull data from each vendor API
  • Anomaly Detection: Alert on cost spikes before invoices arrive
  • Attribution: Tag costs by environment, service, feature
  • Forecasting: Predict month-end costs based on trends
  • Compliance: Audit trail for all billing data

Why High Priority:

  • Cost Control: Prevent runaway bills before they happen (Issue #52)
  • Budget Planning: Accurate forecasts for investor/board reporting
  • Vendor Comparison: Identify where to optimize spend
  • Transparency: AI usage visibility for users (Issue #188)
  • Operational Health: Billing visibility in admin portal (Issue #233)
  • Historical Data: Capture data from day 1 (can't retroactively get history)

Related Issues: #52, #233, #240, #238, #188

Recommended Architecture:

Cloud Run Service: billing-api

javascript
// ===== Billing Dashboard APIs =====
GET /billing/overview
  Query: ?period=current|last|custom&startDate&endDate
  Returns: {
    totalCost: number,
    byVendor: { gcp, cloudflare, railway, openai, resend },
    byService: { firestore, storage, cloudRun, functions, pages },
    trend: { daily, weekly, monthly },
    projectedMonthEnd: number,
    alerts: [{ type, severity, message }]
  }
  Description: High-level billing overview

GET /billing/vendor/:vendor
  Params: vendor = "gcp" | "cloudflare" | "railway" | "openai" | "resend"
  Query: ?period=30d|90d|custom
  Returns: {
    invoices: [{...}],
    lineItems: [{...}],
    usageMetrics: {...},
    costTrend: [...]
  }
  Description: Vendor-specific billing details

GET /billing/service/:service
  Params: service = "firestore" | "storage" | "cloudRun" | "functions" | etc.
  Returns: {
    cost: number,
    usage: { reads, writes, storage, bandwidth },
    costPerUnit: {...},
    optimization: [{ recommendation, estimatedSavings }]
  }
  Description: Service-level cost breakdown

GET /billing/environment/:env
  Params: env = "dev" | "prod"
  Returns: { cost, breakdown, trend }
  Description: Environment-based cost attribution

// ===== Cost Tracking & Analytics =====
GET /billing/trends
  Query: ?metric=cost|usage&groupBy=vendor|service|env&period=90d
  Returns: { dataPoints: [...], trend, forecast }
  Description: Cost trend analysis

GET /billing/anomalies
  Query: ?severity=critical|warning|info&period=30d
  Returns: {
    anomalies: [{
      date, service, expectedCost, actualCost, variance, severity
    }]
  }
  Description: Cost anomaly detection

GET /billing/forecast
  Query: ?horizon=7d|30d|90d
  Returns: {
    projectedCost: number,
    confidence: number,
    trend: "increasing" | "stable" | "decreasing",
    breakdown: {...}
  }
  Description: Cost forecasting based on trends

// ===== AI Usage Tracking (Issue #240, #188) =====
GET /billing/ai/usage
  Query: ?service=openai|anthropic&period=30d
  Returns: {
    totalTokens: number,
    totalCost: number,
    byModel: { "gpt-4": {...}, "claude-3": {...} },
    byFeature: { "chatbot": {...}, "triage": {...} },
    dailyUsage: [...]
  }
  Description: AI API usage and costs

POST /billing/ai/track
  Body: { service, model, tokens, cost, feature, userId }
  Returns: { recorded: true }
  Description: Track AI usage (called by services)

// ===== Export & Reporting =====
POST /billing/reports/generate
  Body: {
    type: "monthly-summary" | "vendor-comparison" | "cost-attribution",
    period: { start, end },
    format: "csv" | "pdf" | "json",
    recipients: ["email@example.com"]  // Optional
  }
  Returns: { jobId, status }
  Description: Generate billing report

GET /billing/reports/:jobId
  Returns: { status, downloadUrl, expiresAt }
  Description: Get report status and download

GET /billing/export/raw
  Query: ?vendor&startDate&endDate&format=csv|json
  Returns: { downloadUrl }
  Description: Export raw billing data (admin-only)

// ===== Admin Operations =====
POST /billing/sync/trigger
  Body: { vendor: "all" | "gcp" | "cloudflare" | "railway" }
  Returns: { jobId, status }
  Description: Manually trigger billing data sync

GET /billing/sync/status
  Returns: {
    lastSync: { vendor, timestamp, status, rowsImported },
    nextSync: { vendor, scheduledTime }
  }
  Description: Check sync job status

POST /billing/alerts/configure
  Body: {
    type: "daily-threshold" | "spike-detection" | "forecast-exceeded",
    threshold: number,
    recipients: [...]
  }
  Returns: { alertId, status }
  Description: Configure cost alerts

GET /billing/health
  Returns: {
    dataFreshness: { vendor, lastUpdate, status },
    pipelineStatus: "healthy" | "degraded" | "failing",
    errors: [...]
  }
  Description: Billing pipeline health check

Cloud Run Jobs (Scheduled):

javascript
// Daily ingestion jobs
job: ingest-gcp-billing        // Runs: Daily 2 AM UTC
  - Reads from GCP Billing Export (auto-configured)
  - Writes to billing_raw.gcp_billing_export

job: ingest-cloudflare         // Runs: Daily 3 AM UTC
  - Pulls billing API data
  - Writes to billing_raw.cloudflare_invoices_raw

job: ingest-railway            // Runs: Daily 3:15 AM UTC
  - Pulls usage metrics
  - Writes to billing_raw.railway_invoices_raw

job: ingest-openai             // Runs: Daily 3:30 AM UTC (Issue #240)
  - Pulls usage data
  - Writes to billing_raw.openai_usage_raw

job: ingest-resend             // Runs: Daily 3:45 AM UTC (Issue #238)
  - Pulls email sending metrics
  - Writes to billing_raw.resend_usage_raw

// Hourly health check
job: cost-health-check         // Runs: Every hour
  - Validate data freshness
  - Detect anomalies (cost spikes >20% day-over-day)
  - Check unattributed costs
  - Send Slack/email alerts if issues found

Internal Structure:

billing-api/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ index.js                   # Express app
โ”‚   โ”œโ”€โ”€ routes/
โ”‚   โ”‚   โ”œโ”€โ”€ dashboard.routes.js    # Overview, trends, forecasts
โ”‚   โ”‚   โ”œโ”€โ”€ vendor.routes.js       # Vendor-specific billing
โ”‚   โ”‚   โ”œโ”€โ”€ ai.routes.js           # AI usage tracking
โ”‚   โ”‚   โ”œโ”€โ”€ reports.routes.js      # Report generation
โ”‚   โ”‚   โ”œโ”€โ”€ admin.routes.js        # Sync, alerts, health
โ”‚   โ”œโ”€โ”€ services/
โ”‚   โ”‚   โ”œโ”€โ”€ gcp.service.js         # GCP billing API client
โ”‚   โ”‚   โ”œโ”€โ”€ cloudflare.service.js  # Cloudflare billing API
โ”‚   โ”‚   โ”œโ”€โ”€ railway.service.js     # Railway API client
โ”‚   โ”‚   โ”œโ”€โ”€ openai.service.js      # OpenAI usage tracking
โ”‚   โ”‚   โ”œโ”€โ”€ resend.service.js      # Resend metrics
โ”‚   โ”‚   โ”œโ”€โ”€ bigquery.service.js    # BigQuery read/write
โ”‚   โ”‚   โ”œโ”€โ”€ anomaly.service.js     # Anomaly detection
โ”‚   โ”‚   โ”œโ”€โ”€ forecast.service.js    # Cost forecasting
โ”‚   โ”œโ”€โ”€ jobs/
โ”‚   โ”‚   โ”œโ”€โ”€ ingest-gcp.js          # GCP billing ingestion
โ”‚   โ”‚   โ”œโ”€โ”€ ingest-cloudflare.js   # Cloudflare ingestion
โ”‚   โ”‚   โ”œโ”€โ”€ ingest-railway.js      # Railway ingestion
โ”‚   โ”‚   โ”œโ”€โ”€ ingest-openai.js       # OpenAI ingestion
โ”‚   โ”‚   โ”œโ”€โ”€ ingest-resend.js       # Resend ingestion
โ”‚   โ”‚   โ”œโ”€โ”€ normalize.js           # Transform to canonical schema
โ”‚   โ”‚   โ”œโ”€โ”€ attribute.js           # Apply attribution rules
โ”‚   โ”‚   โ”œโ”€โ”€ health-check.js        # Pipeline health validation
โ”‚   โ”œโ”€โ”€ middleware/
โ”‚   โ”‚   โ”œโ”€โ”€ admin-auth.js          # Admin-only endpoints
โ”‚   โ”‚   โ”œโ”€โ”€ rate-limit.js          # Rate limiting
โ”‚   โ”‚   โ”œโ”€โ”€ cache.js               # Redis caching (1hr TTL)
โ”‚   โ”œโ”€โ”€ utils/
โ”‚   โ”‚   โ”œโ”€โ”€ schema-mapper.js       # Vendor โ†’ canonical mapping
โ”‚   โ”‚   โ”œโ”€โ”€ attribution.js         # Cost attribution logic
โ”‚   โ”‚   โ”œโ”€โ”€ pdf-generator.js       # PDF reports
โ”‚   โ”‚   โ”œโ”€โ”€ alert-sender.js        # Slack/email alerts
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ package.json
โ””โ”€โ”€ cloudbuild.yaml

BigQuery Architecture (Expanded from Issue #52):

yaml
# Combined Analytics + Billing Dataset Structure

# ========== APP ANALYTICS (existing plan) ==========
lantern_analytics:
  # User engagement data
  raw_lanterns:              # All lantern events
  raw_waves:                 # All wave events
  raw_chats:                 # All chat events
  raw_offers:                # Offer impressions/claims

  # Pre-aggregated app metrics
  daily_venue_stats:         # Daily venue performance
  daily_merchant_stats:      # Daily merchant metrics
  user_cohorts:              # Retention analysis

# ========== BILLING DATA (new - Issue #52) ==========
billing_raw:
  # Immutable vendor-shaped data
  gcp_billing_export:               # GCP auto-export (native)
  gcp_billing_export_resource_v1:   # Detailed resource costs

  cloudflare_invoices_raw:          # Cloudflare invoices
  cloudflare_invoice_line_items:    # Line items
  cloudflare_usage_raw:             # Usage metrics

  railway_invoices_raw:             # Railway invoices
  railway_invoice_line_items:       # Line items
  railway_usage_raw:                # Usage metrics

  openai_usage_raw:                 # OpenAI API usage (Issue #240)
  resend_usage_raw:                 # Resend email metrics (Issue #238)

billing_norm:
  # Canonical cost schema (all vendors normalized)
  fact_cost_line_items:     # One row = one cost event
    - date
    - vendor (gcp|cloudflare|railway|openai|resend)
    - service (firestore|storage|cloudRun|etc.)
    - sku
    - quantity
    - unit_price
    - total_cost
    - currency
    - metadata (JSON)

  # Helper dimensions
  dim_vendor_sku:           # SKU โ†’ human-readable mapping
  dim_resource_map:         # Resource โ†’ service mapping

billing_attrib:
  # Attribution rules and taxonomy
  dim_taxonomy:             # Allowed env/app/team/cost_center
  dim_cost_allocation_rules: # Rule-based cost attribution

  # Attributed costs
  fact_cost_attributed:     # fact_cost_line_items + attribution
    - All fields from fact_cost_line_items
    - environment (dev|prod)
    - service_name (venue-api|admin-api|analytics-api)
    - feature (lanterns|waves|venues|ai-triage)
    - cost_center
    - team

  # Shared cost allocation (optional)
  fact_cost_allocated:      # Proportional cost splits

billing_marts:
  # BI-friendly rollups (fast dashboards)
  daily_cost_by_vendor:             # Vendor totals per day
  daily_cost_by_service:            # Service costs per day
  daily_cost_by_environment:        # Dev vs prod costs
  monthly_cost_by_vendor:           # Monthly vendor totals
  monthly_cost_by_service:          # Monthly service costs
  top_cost_drivers_30d:             # Biggest cost sources
  unattributed_cost_daily:          # Costs without attribution
  cost_trend_rolling_7d:            # 7-day rolling averages
  cost_forecast_30d:                # Forecasted costs
  ai_usage_by_model:                # AI API usage breakdown (Issue #188)

ops:
  # Pipeline health and monitoring
  pipeline_runs:            # Job execution logs
  data_freshness:           # Vendor data freshness checks
  cost_anomalies:           # Spend spikes, attribution drift
  alert_history:            # Alert notifications sent

Data Flow:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Vendor APIs (Daily Scheduled Jobs)                          โ”‚
โ”‚  โ€ข GCP Billing Export โ†’ BigQuery (auto)                      โ”‚
โ”‚  โ€ข Cloudflare API โ†’ Cloud Run Job โ†’ billing_raw.cloudflare_*โ”‚
โ”‚  โ€ข Railway API โ†’ Cloud Run Job โ†’ billing_raw.railway_*      โ”‚
โ”‚  โ€ข OpenAI API โ†’ Cloud Run Job โ†’ billing_raw.openai_*        โ”‚
โ”‚  โ€ข Resend API โ†’ Cloud Run Job โ†’ billing_raw.resend_*        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  BigQuery Scheduled Queries (Daily 4 AM)                     โ”‚
โ”‚  โ€ข Normalize vendor data โ†’ billing_norm.fact_cost_line_itemsโ”‚
โ”‚  โ€ข Apply attribution rules โ†’ billing_attrib.fact_cost_attributedโ”‚
โ”‚  โ€ข Build dashboard marts โ†’ billing_marts.*                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Cost Health Check Job (Hourly)                              โ”‚
โ”‚  โ€ข Validate data freshness (last 24 hours)                   โ”‚
โ”‚  โ€ข Detect anomalies (>20% day-over-day spike)               โ”‚
โ”‚  โ€ข Check unattributed % (<5% threshold)                     โ”‚
โ”‚  โ€ข Write to ops.cost_anomalies                              โ”‚
โ”‚  โ€ข Send alerts if issues found                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Billing API (Cloud Run Service)                             โ”‚
โ”‚  โ€ข Admin portal dashboard (real-time metrics)                โ”‚
โ”‚  โ€ข Cost reports (CSV/PDF exports)                            โ”‚
โ”‚  โ€ข AI usage transparency (public-facing, Issue #188)         โ”‚
โ”‚  โ€ข Forecasting and trend analysis                            โ”‚
โ”‚  โ€ข Reads from billing_marts.* (fast queries)                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Cost Attribution Example:

sql
-- Example: Attribute Firestore costs to specific features
INSERT INTO billing_attrib.fact_cost_attributed
SELECT
  fcl.*,
  CASE
    WHEN resource LIKE '%/lanterns/%' THEN 'lanterns'
    WHEN resource LIKE '%/waves/%' THEN 'waves'
    WHEN resource LIKE '%/venues/%' THEN 'venues'
    WHEN resource LIKE '%/chats/%' THEN 'chats'
    ELSE 'unattributed'
  END as feature,
  CASE
    WHEN project_id = 'lantern-app-dev' THEN 'dev'
    WHEN project_id = 'lantern-app-prod' THEN 'prod'
  END as environment
FROM billing_norm.fact_cost_line_items fcl
WHERE vendor = 'gcp' AND service = 'firestore';

Benefits:

  • Multi-Vendor Visibility: Single dashboard for all billing (GCP, Cloudflare, Railway, OpenAI, Resend)
  • Historical Analysis: Unlimited retention for trend analysis and forecasting
  • Early Warnings: Anomaly detection alerts before invoices arrive
  • Cost Attribution: Understand which features/environments drive costs
  • AI Transparency: Public-facing AI usage dashboard (Issue #188)
  • Automated Reports: Daily/weekly/monthly billing summaries
  • Forecasting: Predict month-end costs with 7-day trend analysis
  • Optimization: Identify cost-saving opportunities by service
  • Audit Trail: Complete billing history for compliance
  • Addresses 5 Issues: #52 (billing pipeline), #233 (system health), #240 (OpenAI), #238 (Resend), #188 (AI transparency)

Migration Priority: ๐ŸŸก Phase 2B (Weeks 4-6) - Combined with analytics-api deployment


4. Image Processing Service ๐ŸŸข MEDIUM PRIORITY โ€‹

Current State:

  • Direct Firebase Storage uploads
  • Client-side validation only
  • No image optimization or resizing

Why Cloud Run:

  • Multi-step Processing: Resize, compress, watermark, format conversion
  • Native Libraries: Sharp, ImageMagick for advanced operations
  • Large Files: Handle 5MB+ uploads without timeout
  • Multiple Outputs: Generate thumbnails, previews, full-size in one request
  • Security: Malware scanning, EXIF stripping, content moderation

Recommended Architecture:

Cloud Run Service: image-service

javascript
POST /images/upload
  Body: multipart/form-data image file
  Returns: { imageId, urls: { thumbnail, preview, full }, metadata }
  Description: Upload and process image (auto-generates all sizes)

POST /images/:imageId/resize
  Body: { width, height, fit: "cover" | "contain" | "fill" }
  Returns: { url, dimensions }
  Description: Generate custom size

POST /images/:imageId/compress
  Body: { quality: 1-100, format: "webp" | "jpeg" }
  Returns: { url, originalSize, compressedSize, savings }
  Description: Compress image

GET /images/:imageId/metadata
  Returns: { dimensions, format, size, exif, colors }
  Description: Get image metadata

POST /images/:imageId/moderate
  Returns: { safe, labels, nsfw_score }
  Description: Content moderation check

DELETE /images/:imageId
  Returns: { deleted }
  Description: Delete image and all variants

Benefits:

  • Unified Pipeline: All image processing in one place
  • Consistent Quality: Automatic optimization and sizing
  • Security: Malware scanning and EXIF stripping
  • Multiple Formats: WebP for modern browsers, JPEG fallback
  • Watermarking: For merchant offer photos

Migration Priority: ๐ŸŸข Phase 3 (Weeks 6-8) - If image uploads are high-volume


5. Real-time Updates Service (Future) ๐Ÿ”ต LOW PRIORITY โ€‹

Current State:

  • Firestore listeners for real-time updates
  • Works well for current scale

When to Build: Only if you need:

  • WebSocket connections for 10,000+ concurrent users
  • Sub-100ms latency for live updates
  • Presence system (online/offline, typing indicators)
  • Live lantern count updates at high traffic venues

Recommended Architecture:

Cloud Run Service: realtime-service

javascript
WS   /ws/connect
  Description: Establish WebSocket connection

WS   /ws/venue/:venueId
  Description: Subscribe to venue updates

WS   /ws/chat/:connectionId
  Description: Subscribe to chat messages

HTTP /presence/status
  Returns: { online, venueActivity }
  Description: Get current presence status

Migration Priority: ๐Ÿ”ต Phase 4+ (Weeks 9+) - Only if Firestore listeners don't scale


6. Full-Text Search Service (Future) ๐Ÿ”ต LOW PRIORITY โ€‹

Current State:

  • Basic Firestore queries
  • Limited search capabilities

When to Build: Only if you need:

  • Fuzzy search ("cofee" โ†’ "coffee")
  • Multi-field search (name + address + tags)
  • Real-time autocomplete
  • Ranking and relevance scoring

Recommended Architecture:

Cloud Run Service: search-service

javascript
GET  /search/venues?q=coffee+downtown&limit=20
  Returns: { results, total, suggestions }
  Description: Search venues with fuzzy matching

GET  /search/users?q=alice
  Returns: { results, total }
  Description: Search users (admin-only)

POST /search/index/rebuild
  Returns: { indexed, errors }
  Description: Rebuild search index

WS   /search/suggest
  Description: Real-time autocomplete

Technology Options:

  • MeiliSearch (easy setup, good DX)
  • Typesense (fast, typo-tolerant)
  • Elasticsearch (powerful, complex)

Migration Priority: ๐Ÿ”ต Phase 4+ (Weeks 9+) - Only if basic Firestore queries insufficient


๐Ÿ”ต Keep as Cloud Functions โ€‹

Well-suited for Cloud Functions:

  • Short-lived operations (<30 seconds)
  • Event-driven triggers (onCreate, onUpdate, etc.)
  • Simple HTTP endpoints
  • Lightweight utilities

Functions to Keep:

javascript
// Auth Triggers
onCreate("users/{uid}") โ†’ initializeUserProfile
onDelete("users/{uid}") โ†’ cleanupUserData

// Firestore Triggers
onUpdate("lanterns/{lanternId}") โ†’ updateVenueActivityCount
onDelete("venues/{venueId}") โ†’ cleanupVenueLanterns

// Scheduled Functions (if simple)
schedule("every 24 hours") โ†’ cleanupExpiredLanterns
schedule("every 1 hour") โ†’ cleanupExpiredWaves

// Simple HTTP Endpoints
POST /logClientError โ†’ Log client error to Cloud Logging
GET /health โ†’ Service health check

๐Ÿ“Š Cloud Functions vs Cloud Run Comparison โ€‹

FeatureCloud FunctionsCloud Run
Max Execution Time9 minutes60 minutes (configurable)
Cold Start~1-2 seconds~2-5 seconds
Concurrency1-1000 requests/instanceConfigurable (โ‰ค1000)
WebSocket SupportโŒ Noโœ… Yes
Container SizeLimitedUp to 10GB
Memory256MB - 16GB128MB - 32GB
CPUAlways allocatedCan be CPU-only during request
Pricing ModelInvocation + GB-secRequest + CPU-time + Memory
Best ForEvent-driven, simple HTTPAPIs, batch jobs, long tasks
Local DevelopmentFirebase EmulatorDocker
State ManagementNone (stateless)In-memory caching supported
Native DependenciesLimitedFull container support

๐ŸŽฏ Optimized Migration Roadmap โ€‹

Phase 1 โ€“ Immediate (Weeks 1-2) ๐Ÿ”ด CRITICAL โ€‹

Service: venue-api (combined import + refresh)

Why First:

  • Fixes critical security issue (Issue #224 - OSM import abuse)
  • Addresses firestore.rules:394 public write vulnerability
  • Unblocks Google Places API integration
  • High impact, clear requirements

Deliverables:

  • [ ] Cloud Run service deployed to dev
  • [ ] OSM import endpoint with job queue
  • [ ] Venue refresh endpoint with rate limiting
  • [ ] Admin consolidation endpoint (dedupe)
  • [ ] Firestore security rules updated (remove public write)
  • [ ] Documentation and API reference
  • [ ] Integration tests

Success Metrics:

  • OSM imports no longer client-accessible
  • Venue refresh rate limited to 1 req/sec
  • Import jobs handle 1000+ venues without timeout

Phase 2 โ€“ Short Term (Weeks 3-4) ๐ŸŸก HIGH PRIORITY โ€‹

Service: admin-api (consolidated admin operations)

Why Second:

  • Improves admin portal backend
  • Enables batch operations for efficiency
  • Better local development experience
  • Reduces cold start latency

Deliverables:

  • [ ] Cloud Run service deployed to dev
  • [ ] Migrate existing Cloud Functions to Express routes:
    • getUserByEmail โ†’ GET /admin/users/:uid
    • setUserRole โ†’ POST /admin/users/:uid/role
    • banUser โ†’ POST /admin/users/:uid/ban
    • unbanUser โ†’ DELETE /admin/users/:uid/ban
  • [ ] New batch operations:
    • Bulk ban users
    • Bulk import users
    • Bulk venue updates
  • [ ] Admin audit logging
  • [ ] Admin portal updated to use new API
  • [ ] Deprecation plan for old Cloud Functions

Success Metrics:

  • Admin operations 50% faster (reduced cold starts)
  • Batch operations process 100+ records in <10 seconds
  • Local development setup documented

Phase 2B โ€“ Analytics + Billing (Weeks 4-6) ๐ŸŸก HIGH PRIORITY โ€‹

Services: analytics-api + billing-api (combined deployment)

Why Critical:

  • Merchant-facing feature (product requirement)
  • Cost tracking and usage monitoring (Issues #52, #233)
  • Business intelligence for growth
  • Revenue optimization for merchants
  • Multi-vendor billing consolidation (GCP, Cloudflare, Railway, OpenAI, Resend)
  • AI usage transparency (Issue #188)
  • Historical billing data (can't be retroactively captured)

Related Issues: #52, #233, #240, #238, #188

Deliverables:

BigQuery Infrastructure:

  • [ ] Create lantern_analytics dataset:
    • Create partitioned raw event tables (lanterns, waves, chats, offers)
    • Deploy Cloud Functions for streaming inserts
    • Create scheduled queries for daily aggregations
  • [ ] Create billing_raw dataset:
    • Set up GCP Billing Export (auto-configured)
    • Create tables: cloudflare_invoices_raw, railway_invoices_raw
    • Create tables: openai_usage_raw, resend_usage_raw
  • [ ] Create billing_norm dataset:
    • fact_cost_line_items (canonical schema)
    • dim_vendor_sku, dim_resource_map
  • [ ] Create billing_attrib dataset:
    • dim_taxonomy, dim_cost_allocation_rules
    • fact_cost_attributed (with env/service/feature tags)
  • [ ] Create billing_marts dataset:
    • daily_cost_by_vendor, daily_cost_by_service
    • monthly_cost_by_vendor, cost_trend_rolling_7d
    • cost_forecast_30d, ai_usage_by_model
  • [ ] Create ops dataset:
    • pipeline_runs, data_freshness, cost_anomalies

Analytics API (analytics-api):

  • [ ] Merchant analytics endpoints:
    • Dashboard metrics (venue/offer performance)
    • Venue analytics (activity, demographics, trends)
    • Offer analytics (impressions, claims, conversions)
    • Export reports (CSV/PDF)
  • [ ] Admin analytics endpoints:
    • Platform overview (users, engagement, revenue)
    • Venue rankings, merchant performance
  • [ ] Real-time metrics (cached):
    • Active lanterns/waves, online users, top venues
  • [ ] Hybrid query logic:
    • Recent data (90 days): Read from Firestore
    • Historical data (>90 days): Query BigQuery

Billing API (billing-api):

  • [ ] Billing dashboard endpoints:
    • Overview (total cost, by vendor, by service, trends)
    • Vendor-specific details (GCP, Cloudflare, Railway, OpenAI, Resend)
    • Service-level cost breakdown
    • Environment-based attribution (dev vs prod)
  • [ ] Cost tracking & analytics:
    • Trend analysis, anomaly detection, forecasting
    • AI usage tracking (Issue #240 - OpenAI, Issue #188 - transparency)
    • Cost attribution by feature/service
  • [ ] Export & reporting:
    • Generate monthly billing reports (CSV/PDF)
    • Raw data exports (admin-only)
    • Email delivery of scheduled reports
  • [ ] Admin operations:
    • Manual sync triggers
    • Alert configuration (thresholds, spike detection)
    • Pipeline health checks

Cloud Run Jobs (Scheduled):

  • [ ] Daily billing ingestion jobs (3 AM UTC):
    • ingest-cloudflare (Cloudflare billing API)
    • ingest-railway (Railway usage API)
    • ingest-openai (OpenAI usage API, Issue #240)
    • ingest-resend (Resend metrics API, Issue #238)
  • [ ] Hourly cost health check:
    • Validate data freshness (<24hr)
    • Detect cost anomalies (>20% day-over-day)
    • Alert on unattributed costs (>5%)
    • Send Slack/email alerts
  • [ ] Daily analytics aggregation:
    • Firestore โ†’ BigQuery streaming
    • BigQuery scheduled queries (normalize, attribute, rollup)
    • Weekly/monthly reports (BigQuery queries)
    • Cost sync from GCP billing API Portal Integration:
  • [ ] Merchant portal updates:
    • Analytics dashboard (venue/offer performance)
    • Export buttons (CSV/PDF reports)
    • Real-time venue stats
  • [ ] Admin portal updates:
    • System Health "Billing" tab (Issue #233)
    • Multi-vendor cost dashboard (GCP, Cloudflare, Railway, OpenAI, Resend)
    • Platform metrics (users, engagement, revenue)
    • Merchant reports
    • Cost anomaly alerts
    • AI usage transparency page (Issue #188 - public-facing)

Success Metrics:

  • Merchants can view venue performance in <2 seconds
  • Multi-vendor billing data visible in admin portal (Issues #52, #233)
  • Cost tracking updates daily from all vendors
  • Anomaly detection alerts fire within 1 hour of spike
  • Reports generate in <30 seconds for 90-day periods
  • Real-time metrics cached and served in <100ms
  • AI usage transparency page loads in <1 second (Issue #188)
  • Historical billing data captured from deployment date forward

Phase 3 โ€“ Medium Term (Weeks 7-9) ๐ŸŸข MEDIUM PRIORITY โ€‹

Service: image-service (if needed)

Why Third:

  • Image processing only needed if uploads become high-volume
  • Can start with direct Firebase Storage uploads
  • Migrate when optimization/security becomes critical

Deliverables:

  • [ ] Assess current image upload volume and requirements
  • [ ] If needed: Deploy image-service with Sharp/ImageMagick
  • [ ] Multi-size image generation (thumbnail, preview, full)
  • [ ] Image compression and format optimization
  • [ ] Content moderation integration
  • [ ] Watermarking for merchant offers

Phase 4 โ€“ Long Term (Weeks 10+) ๐Ÿ”ต LOW PRIORITY โ€‹

Services (build only if needed):

  • realtime-service (if Firestore listeners don't scale)
  • search-service (if Firestore queries insufficient)

Why Last:

  • Current solutions may be sufficient for 1-2 years
  • Build these only when you hit scaling issues
  • Focus on product features instead

๐Ÿ’ฐ Cost Analysis โ€‹

Cloud Functions (Current) โ€‹

Estimate for 10,000 monthly active users:
- 10 admin functions ร— $0.40/million invocations = $4/month
- 2GB-sec per invocation ร— 50,000 invocations = $1.25/month
Total: ~$5.25/month

Cloud Run (Proposed) โ€‹

Estimate for 10,000 monthly active users:

venue-api:
- 100,000 requests/month
- 512MB memory, 1 vCPU
- Avg 2 seconds per request
- Cost: ~$15/month

admin-api:
- 10,000 requests/month
- 256MB memory, 1 vCPU
- Avg 500ms per request
- Cost: ~$3/month

analytics-api:
- 50,000 requests/month (merchant dashboards + admin)
- 512MB memory, 1 vCPU
- Avg 1 second per request (cached metrics)
- Scheduled jobs: 1 daily (aggregation)
- Cost: ~$6/month

billing-api:
- 20,000 requests/month (admin dashboards + reports)
- 512MB memory, 1 vCPU
- Avg 1 second per request (cached metrics)
- Scheduled jobs: 5 daily (GCP, Cloudflare, Railway, OpenAI, Resend ingest)
- Hourly health check job
- Cost: ~$8/month

BigQuery (analytics + billing):
- Streaming inserts: 1.5GB/month app data + 0.5GB/month billing data = $0.10/month
- Storage (Year 1): 24GB ร— $0.02/GB = $0.48/month
- Query costs: 3,500 queries ร— 15MB avg = 0.05TB ร— $5/TB = $0.25/month
- Scheduled aggregations: 0.02TB ร— $5/TB = $0.10/month
- Total BigQuery: ~$1/month

Cloud Functions (streaming):
- 5 Firestore triggers for BigQuery streaming (lanterns, waves, chats, offers, venues)
- ~50,000 invocations/month ร— $0.40/million = $0.02/month

Total: ~$34/month

Cost Increase: ~$29/month (~5.5x) from current $5/month

Value Gained:

  • Security fixes (Issues #224, #394 - critical vulnerabilities)
  • Multi-vendor billing visibility (Issues #52, #233 - GCP, Cloudflare, Railway, OpenAI, Resend)
  • Cost tracking & anomaly detection (prevents runaway bills before they happen)
  • AI usage transparency (Issue #188 - public-facing dashboard)
  • Historical billing data (can't be retroactively captured - start now!)
  • Merchant analytics (product feature, revenue driver)
  • Business intelligence (growth metrics, forecasting, attribution)
  • Enables Google Places integration
  • Supports 1000+ venue imports
  • Batch admin operations
  • Better monitoring and logging
  • Foundation for future growth
  • Addresses 8 open issues (#52, #224, #233, #238, #240, #188, #394, plus venues/admin improvements)

Break-even Analysis: As usage grows, Cloud Run becomes more cost-effective:

  • At 100k users: Cloud Run is 50% cheaper
  • At 1M users: Cloud Run is 70% cheaper
  • Long-running operations impossible on Cloud Functions

๐Ÿ—๏ธ Implementation Guidelines โ€‹

Cloud Run Best Practices โ€‹

  1. Service Configuration

    yaml
    # cloudbuild.yaml example
    timeout: 3600s  # 60 minutes
    memory: 1Gi
    cpu: 1
    minInstances: 0  # Scale to zero when idle
    maxInstances: 10
    concurrency: 80  # Requests per instance
  2. Environment Variables

    bash
    FIREBASE_PROJECT_ID=lantern-app-dev
    ENVIRONMENT=dev
    LOG_LEVEL=info
    RATE_LIMIT_OSM=5   # 5 req/sec
    RATE_LIMIT_NOMINATIM=1  # 1 req/sec
  3. Health Checks

    javascript
    app.get('/health', (req, res) => {
      res.json({ status: 'ok', timestamp: Date.now() });
    });
  4. Graceful Shutdown

    javascript
    process.on('SIGTERM', () => {
      console.log('SIGTERM received, closing server...');
      server.close(() => {
        console.log('Server closed');
        process.exit(0);
      });
    });

Security โ€‹

  1. Authentication

    • Use Firebase Admin SDK to verify ID tokens
    • Implement role-based access control (RBAC)
    • Admin endpoints require admin role
  2. Rate Limiting

    • Per-user rate limits (10 req/min for imports)
    • Global rate limits for external APIs
    • IP-based rate limiting for abuse prevention
  3. Input Validation

    • Validate all request bodies with Joi/Zod
    • Sanitize user inputs
    • Limit request body size (10MB max)

Monitoring โ€‹

  1. Cloud Logging

    • Structured JSON logs
    • Include request ID, user ID, action
    • Log errors with stack traces
  2. Cloud Monitoring

    • Request latency (p50, p95, p99)
    • Error rate
    • Active instances
    • Memory/CPU usage
  3. Alerts

    • Error rate >5% for 5 minutes
    • Latency p99 >10 seconds
    • Memory usage >80%

๐Ÿงช Testing Strategy โ€‹

Local Development โ€‹

bash
# venue-api local testing
cd services/venue-api
docker build -t venue-api .
docker run -p 8080:8080 \
  -e FIREBASE_PROJECT_ID=lantern-app-dev \
  -e ENVIRONMENT=local \
  venue-api

# Test endpoints
curl http://localhost:8080/health
curl -X POST http://localhost:8080/venues/import/osm \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"lat":40.7128,"lng":-74.0060,"radius":1000}'

Integration Tests โ€‹

javascript
// venue-api integration test
describe('POST /venues/import/osm', () => {
  it('should import venues from OSM', async () => {
    const response = await request(app)
      .post('/venues/import/osm')
      .set('Authorization', `Bearer ${adminToken}`)
      .send({ lat: 40.7128, lng: -74.0060, radius: 1000 })
      .expect(200);

    expect(response.body).toHaveProperty('jobId');
    expect(response.body.status).toBe('queued');
  });

  it('should reject unauthorized users', async () => {
    await request(app)
      .post('/venues/import/osm')
      .send({ lat: 40.7128, lng: -74.0060, radius: 1000 })
      .expect(401);
  });
});

Load Testing โ€‹

bash
# Use Apache Bench for load testing
ab -n 1000 -c 10 -H "Authorization: Bearer $TOKEN" \
  http://venue-api-dev.a.run.app/health

# Use k6 for complex scenarios
k6 run load-test.js

๐Ÿ“ Migration Checklist โ€‹

Pre-Migration โ€‹

  • [ ] Review current Cloud Functions usage and costs
  • [ ] Audit API rate limits (OSM, Nominatim, Google Places)
  • [ ] Set up Cloud Run development environment
  • [ ] Create Dockerfiles for each service
  • [ ] Set up Cloud Build pipelines
  • [ ] Configure secret management (Secret Manager)

During Migration โ€‹

  • [ ] Deploy to dev environment first
  • [ ] Run integration tests
  • [ ] Monitor error rates and latency
  • [ ] Gradual rollout (10% โ†’ 50% โ†’ 100%)
  • [ ] Keep Cloud Functions as fallback

Post-Migration โ€‹

  • [ ] Monitor for 1 week before deprecating Cloud Functions
  • [ ] Update documentation
  • [ ] Update admin portal to use new APIs
  • [ ] Remove old Cloud Functions
  • [ ] Update Firestore security rules
  • [ ] Create changelog entry

๐ŸŽ“ Key Takeaways โ€‹

โœ… Use Cloud Run When You Need: โ€‹

  • Batch processing (100+ items)
  • Long-running tasks (>1 minute, up to 60 minutes)
  • WebSockets or persistent connections
  • Multi-step workflows with state
  • Native dependencies (ImageMagick, etc.)
  • Container-based applications
  • High concurrency (1000+ req/instance)

โœ… Use Cloud Functions When You Need: โ€‹

  • Simple, single-purpose logic (<50 lines)
  • Event-driven triggers (onCreate, onUpdate)
  • Fast execution (<30 seconds)
  • Minimal configuration
  • Pay-per-invocation model for low traffic

๐ŸŽฏ Lantern-Specific Priorities: โ€‹

  1. Phase 1 (Weeks 1-2): venue-api ๐Ÿ”ด

    • Security fix (Issues #224, #394)
    • Google Places foundation
  2. Phase 2 (Weeks 3-4): admin-api ๐ŸŸก

    • Better admin operations
    • Batch processing
  3. Phase 2B (Weeks 4-6): analytics-api + billing-api ๐ŸŸก EXPANDED SCOPE

    • Merchant analytics (product feature)
    • Multi-vendor billing (Issues #52, #233, #240, #238, #188)
    • Cost tracking & anomaly detection
    • AI usage transparency
    • Critical: Historical data capture starts from deployment
  4. Phase 3 (Weeks 7-9): image-service ๐ŸŸข

    • Only if needed
  5. Phase 4+ (Weeks 10+): Real-time, search ๐Ÿ”ต

    • Only if scaling issues

Most Important:

  • Start with Phase 1 (venue-api) to fix critical security vulnerabilities
  • Phase 2B is CRITICAL because:
    • Can't retroactively capture historical billing data - must start early
    • Prevents cost surprises before they happen (anomaly detection)
    • Merchant analytics drives product value and revenue
    • AI transparency builds user trust (Issue #188)
    • Addresses 5 open billing/analytics issues in one deployment

๐Ÿ“š References โ€‹


Document Owner: @cattreedev Last Review: 2026-02-09 Next Review: After Phase 1 completion

Built with VitePress