Cloud Run Migration Strategy (Optimized) โ
Refactor: Server-Side Configs (Cloud Run vs Cloud Functions & Client-to-Server Migrations)
Last Updated: 2026-02-09 Status: Planning Related Issues:
- #224 (OSM Import Security)
- #394 (Venue Refresh Abuse)
- #52 (BigQuery Billing Pipeline)
- #233 (System Health Billing)
- #240 (OpenAI Billing)
- #238 (Resend Billing)
- #188 (AI Usage Transparency)
๐ฏ Executive Summary โ
This document outlines an optimized migration strategy from Cloud Functions to Cloud Run for services that require:
- Long-running operations (>1 minute)
- Batch processing (100+ items)
- WebSocket/persistent connections
- Multi-step workflows
- Native/container dependencies
- Complex state management
- Multi-vendor data aggregation (billing, analytics)
- Historical data retention (BigQuery)
Key Optimizations:
- Consolidate related services to reduce operational overhead and improve code reuse
- Combine analytics + billing infrastructure to maximize BigQuery investment and capture historical data from day 1
- Address 8 open issues in unified deployments (security, billing, transparency)
๐ High Priority โ Cloud Run Services โ
1. Venue API (Combined Import + Refresh) ๐ด CRITICAL โ
Current State:
- OSM import: CLI script with client-side triggers
- Venue refresh: Client-side triggers with public write access (
firestore.rules:394) - No Google Places integration yet
Why Combined Service:
- Shared Domain Logic: Both handle venue data validation, deduplication, normalization
- Unified Rate Limiting: Reusable infrastructure across OSM, Nominatim, Google Places
- API Source Flexibility: Adding Google Places becomes a new route, not a new service
- Operational Efficiency: Single deployment, shared connection pools, lower costs
- Future-Proof: Easy to add more sources (Yelp, Foursquare), centralized quality scoring
Recommended Architecture:
Cloud Run Service: venue-api
// ===== Import Operations =====
POST /venues/import/osm
Body: { lat, lng, radius, limit, filters }
Returns: { jobId, status, estimatedTime }
Description: Import venues from OSM Overpass API
POST /venues/import/google
Body: { lat, lng, radius, types, keyword }
Returns: { jobId, status, estimatedTime }
Description: Import venues from Google Places API (future)
POST /venues/import/hybrid
Body: { lat, lng, radius, strategy: "osm-primary" | "google-primary" }
Returns: { jobId, status, estimatedTime }
Description: Combined OSM + Google enrichment with deduplication
GET /venues/import/jobs/:jobId
Returns: { status, progress, venuesProcessed, errors, results }
Description: Check import job status and results
// ===== Enrichment & Refresh =====
POST /venues/:venueId/enrich
Body: { fields: ["photos", "hours", "contact", "rating"] }
Returns: { venue, enrichedFields, source }
Description: Enrich single venue with external API data
POST /venues/refresh/batch
Body: { venueIds: [...], priority: "high" | "normal" }
Returns: { jobId, status, count }
Description: Batch refresh venue metadata
POST /venues/refresh/scheduled
Headers: X-CloudScheduler-JobName
Returns: { processed, updated, errors }
Description: Cloud Scheduler endpoint for automated refresh
// ===== Utilities =====
POST /venues/consolidate
Body: { region: { lat, lng, radius }, strategy: "strict" | "fuzzy" }
Returns: { duplicatesFound, merged, deleted }
Description: Deduplicate venues in region
POST /venues/geocode
Body: { address } or { lat, lng }
Returns: { address, coordinates, confidence }
Description: Forward or reverse geocoding via Nominatim
GET /venues/:venueId/metadata
Returns: { venue, sources, lastUpdated, quality }
Description: Get venue details with provenance info
// ===== Admin Operations =====
POST /venues/cleanup/orphaned
Returns: { deleted, errors }
Description: Remove venues with no activity for 90+ days
POST /venues/validate/all
Returns: { total, invalid, fixed, errors }
Description: Data quality check and auto-fix
POST /venues/rebuild-index
Returns: { indexed, errors }
Description: Rebuild search/geo indexesInternal Structure:
venue-api/
โโโ src/
โ โโโ index.js # Express app entry point
โ โโโ routes/
โ โ โโโ import.routes.js # OSM, Google, hybrid imports
โ โ โโโ refresh.routes.js # Enrichment & scheduled tasks
โ โ โโโ utility.routes.js # Geocoding, consolidation
โ โ โโโ admin.routes.js # Cleanup, validation
โ โโโ services/
โ โ โโโ osm.service.js # OSM Overpass API client
โ โ โโโ google.service.js # Google Places API client
โ โ โโโ nominatim.service.js # Geocoding service
โ โ โโโ venue.service.js # Core venue logic
โ โ โโโ enrichment.service.js # Data enrichment
โ โโโ queue/
โ โ โโโ job-manager.js # Queue handling (Cloud Tasks)
โ โ โโโ rate-limiter.js # Global rate limiting
โ โโโ middleware/
โ โ โโโ auth.js # Firebase Admin validation
โ โ โโโ rate-limit.js # Per-source rate limits
โ โ โโโ error-handler.js # Unified error handling
โ โโโ utils/
โ โ โโโ deduplication.js # Venue matching algorithms
โ โ โโโ validation.js # Data validation
โ โ โโโ cache.js # In-memory caching
โโโ Dockerfile
โโโ package.json
โโโ cloudbuild.yamlBenefits:
- Security: Server-side validation prevents client abuse (fixes Issue #224)
- Rate Limiting: Global throttling for OSM (5 req/sec), Nominatim (1 req/sec), Google Places
- Batch Processing: Import 1000+ venues in single job (vs Cloud Functions 9-min timeout)
- Deduplication: Automatic merging of OSM + Google Places data
- Scheduled Jobs: Cloud Scheduler โ nightly venue refresh for high-traffic locations
- Audit Trail: Full logging of all import operations
- Cost Efficiency: Single service vs two separate deployments
Migration Priority: ๐ด Phase 1 (Weeks 1-2) - Addresses critical security issues
2. Admin API (Consolidated) ๐ก HIGH PRIORITY โ
Current State:
- 10+ separate Cloud Functions for admin operations
- Each function has separate auth, logging, error handling
- No batch operations support
- Difficult local development
Why Consolidated Service:
- Shared Middleware: Auth, validation, logging, error handling
- Unified API: Single endpoint for admin portal
- Batch Operations: Process 100+ users/venues in single request
- Better DX: Local development with
docker run - Reduced Cold Starts: 1 warm service vs 10+ separate functions
- Complex Workflows: Multi-step admin actions in single transaction
Recommended Architecture:
Cloud Run Service: admin-api
// ===== User Management =====
// Migrate existing Cloud Functions
GET /admin/users/:uid
Description: Get user by ID (replaces getUserByEmail function)
POST /admin/users/:uid/role
Body: { role: "admin" | "merchant" | "user" }
Description: Set user role (replaces setUserRole function)
POST /admin/users/:uid/ban
Body: { reason, duration }
Description: Ban user (replaces banUser function)
DELETE /admin/users/:uid/ban
Description: Unban user (replaces unbanUser function)
// New batch operations
POST /admin/users/bulk-import
Body: { users: [...], sendInvites: boolean }
Returns: { jobId, status, count }
Description: Import users from CSV
POST /admin/users/bulk-ban
Body: { userIds: [...], reason }
Returns: { banned, errors }
Description: Batch ban users
POST /admin/users/bulk-role
Body: { userIds: [...], role }
Returns: { updated, errors }
Description: Batch update user roles
GET /admin/users/search
Query: ?q=email@example.com&role=admin&limit=50
Returns: { users, total, hasMore }
Description: Search users with filters
// ===== Venue Management =====
POST /admin/venues/bulk-update
Body: { venueIds: [...], updates: {...} }
Returns: { updated, errors }
Description: Batch update venue metadata
POST /admin/venues/cleanup
Body: { criteria: "inactive" | "duplicate" | "invalid" }
Returns: { deleted, merged, errors }
Description: Cleanup venues by criteria
GET /admin/venues/audit
Returns: { venues, issues, recommendations }
Description: Venue data quality audit
// ===== Content Moderation =====
GET /admin/moderation/queue
Query: ?type=profile|lantern|chat&status=pending
Returns: { items, total }
Description: Get moderation queue
POST /admin/moderation/:itemId/approve
Returns: { status }
Description: Approve content
POST /admin/moderation/:itemId/reject
Body: { reason, action: "hide" | "delete" | "ban-user" }
Returns: { status }
Description: Reject content
// ===== Analytics & Reports =====
POST /admin/reports/generate
Body: { type: "engagement" | "revenue" | "venues", period: "daily" | "weekly" | "monthly" }
Returns: { jobId, status }
Description: Generate report job
GET /admin/reports/:jobId
Returns: { status, downloadUrl, data }
Description: Get report results
GET /admin/analytics/dashboard
Returns: { metrics: {...}, trends: {...} }
Description: Real-time dashboard metrics
POST /admin/analytics/export
Body: { collection, filters, format: "csv" | "json" }
Returns: { downloadUrl }
Description: Export data
// ===== System Operations =====
POST /admin/system/cache/clear
Body: { scope: "all" | "venues" | "users" }
Returns: { cleared }
Description: Clear system caches
GET /admin/system/health
Returns: { status, services, errors }
Description: System health check
POST /admin/system/maintenance
Body: { enabled: boolean, message }
Returns: { status }
Description: Toggle maintenance modeInternal Structure:
admin-api/
โโโ src/
โ โโโ index.js # Express app
โ โโโ routes/
โ โ โโโ users.routes.js # User management
โ โ โโโ venues.routes.js # Venue management
โ โ โโโ moderation.routes.js # Content moderation
โ โ โโโ reports.routes.js # Analytics & reports
โ โ โโโ system.routes.js # System operations
โ โโโ services/
โ โ โโโ user.service.js # User operations
โ โ โโโ venue.service.js # Venue operations
โ โ โโโ report.service.js # Report generation
โ โ โโโ analytics.service.js # Metrics aggregation
โ โโโ middleware/
โ โ โโโ admin-auth.js # Admin-only auth
โ โ โโโ rate-limit.js # Rate limiting
โ โ โโโ audit-log.js # Action logging
โ โ โโโ error-handler.js # Error handling
โ โโโ utils/
โ โ โโโ csv-parser.js # CSV import/export
โ โ โโโ pdf-generator.js # PDF reports
โโโ Dockerfile
โโโ package.jsonBenefits:
- Unified Auth: Single admin authentication middleware
- Batch Operations: Process 100+ records in one request
- Better Logging: Centralized audit trail of all admin actions
- Easier Development: Run entire admin backend locally
- Complex Workflows: Multi-step operations in single transaction
- Reduced Costs: 1 service vs 10+ Cloud Functions
Migration Priority: ๐ก Phase 2 (Weeks 3-4) - Improve admin operations
3. Analytics Service ๐ก HIGH PRIORITY โ
Current State:
- No analytics infrastructure
- No merchant-facing metrics
- No cost tracking or usage monitoring
- Manual reporting and data exports
Why Cloud Run:
- Data Aggregation: Complex Firestore queries across multiple collections
- Large Data Processing: Aggregate 100k+ events for reports
- Scheduled Jobs: Daily/weekly/monthly report generation
- Memory Intensive: Process 10k+ records for trend analysis
- Export Generation: CSV/PDF reports can take minutes
- Real-time Metrics: In-memory caching for dashboard
- Cost Tracking: Parse Firebase/GCP billing data
Why High Priority:
- Product Feature: Merchants need analytics to see venue performance
- Business Intelligence: Track engagement, conversion, revenue
- Cost Management: Monitor Firebase/API costs before they spike
- Growth Metrics: Understand user behavior and retention
Recommended Architecture:
Cloud Run Service: analytics-api
// ===== Merchant Analytics =====
GET /analytics/merchant/dashboard
Query: ?merchantId={id}&period=7d|30d|90d
Returns: {
venueStats: { views, lanterns, waves, conversion },
offerStats: { impressions, claims, revenue },
trends: { daily, weekly },
topVenues: [...]
}
Description: Merchant dashboard metrics
GET /analytics/merchant/venue/:venueId
Query: ?period=7d|30d|90d
Returns: {
activity: { lanterns, waves, uniqueUsers },
peakHours: [...],
demographics: { ageGroups, genderSplit },
trends: { daily, hourly }
}
Description: Individual venue performance
GET /analytics/merchant/offer/:offerId
Returns: {
impressions: number,
claims: number,
redemptions: number,
conversionRate: number,
revenue: number,
timeline: [...]
}
Description: Offer engagement metrics
POST /analytics/merchant/export
Body: {
merchantId,
type: "venue-performance" | "offer-analytics" | "revenue",
period: "7d" | "30d" | "90d",
format: "csv" | "pdf"
}
Returns: { jobId, status }
Description: Generate export job for merchant data
// ===== Admin/Internal Analytics =====
GET /analytics/admin/overview
Returns: {
users: { total, active, new, churn },
venues: { total, active, topPerforming },
engagement: { lanterns, waves, chats, matches },
revenue: { total, byMerchant, byVenue },
costs: { firebase, cloudRun, apis }
}
Description: High-level platform metrics
GET /analytics/admin/engagement
Query: ?period=7d|30d|90d
Returns: {
dau: number, // Daily active users
wau: number, // Weekly active users
mau: number, // Monthly active users
retention: { day1, day7, day30 },
sessionDuration: { avg, median, p95 },
actions: { lanterns, waves, chats, matches }
}
Description: User engagement metrics
GET /analytics/admin/costs
Query: ?period=7d|30d|90d
Returns: {
total: number,
breakdown: {
firestore: { reads, writes, deletes, cost },
storage: { gb, cost },
cloudRun: { requests, cpu, memory, cost },
apis: { osm, nominatim, googlePlaces, cost }
},
trends: [...],
projections: { monthly, annual }
}
Description: Cost tracking and projections
GET /analytics/admin/venues
Query: ?sortBy=activity|lanterns|revenue&limit=50
Returns: {
venues: [{
id, name, location,
stats: { lanterns, waves, uniqueUsers },
revenue: number,
merchantShare: number
}],
total: number
}
Description: Venue performance rankings
GET /analytics/admin/merchants
Returns: {
merchants: [{
id, name,
venues: number,
offers: number,
revenue: number,
engagement: { impressions, conversions }
}]
}
Description: Merchant performance
// ===== Real-time Metrics =====
GET /analytics/realtime/activity
Returns: {
activeLanterns: number,
activeWaves: number,
onlineUsers: number,
topVenues: [{ venueId, name, activeLanterns }],
recentMatches: number
}
Description: Real-time platform activity
GET /analytics/realtime/venue/:venueId
Returns: {
activeLanterns: number,
recentWaves: number,
onlineUsers: number
}
Description: Real-time venue activity
// ===== Reports & Exports =====
POST /analytics/reports/generate
Body: {
type: "engagement" | "revenue" | "costs" | "venues" | "merchants",
period: "daily" | "weekly" | "monthly" | "custom",
startDate: "2026-01-01",
endDate: "2026-01-31",
format: "csv" | "pdf" | "json",
recipients: ["email@example.com"] // Optional email delivery
}
Returns: { jobId, status, estimatedTime }
Description: Generate scheduled report
GET /analytics/reports/:jobId
Returns: {
status: "queued" | "processing" | "completed" | "failed",
progress: number,
downloadUrl: string,
expiresAt: timestamp
}
Description: Get report job status and download link
GET /analytics/reports/scheduled
Returns: { reports: [{...}] }
Description: List scheduled/recurring reports
// ===== Data Export =====
POST /analytics/export/raw
Body: {
collection: "lanterns" | "waves" | "chats" | "venues",
filters: {...},
dateRange: { start, end },
format: "csv" | "json" | "parquet"
}
Returns: { jobId, status }
Description: Export raw data (admin-only)
// ===== Usage Tracking (Internal) =====
POST /analytics/track/api-usage
Body: { service: "osm" | "nominatim" | "googlePlaces", requests: number }
Returns: { recorded: true }
Description: Track external API usage for cost monitoring
POST /analytics/track/feature-usage
Body: { feature: string, userId: string, metadata: {...} }
Returns: { recorded: true }
Description: Track feature usage for product analyticsInternal Structure:
analytics-api/
โโโ src/
โ โโโ index.js # Express app entry point
โ โโโ routes/
โ โ โโโ merchant.routes.js # Merchant analytics
โ โ โโโ admin.routes.js # Admin/internal analytics
โ โ โโโ realtime.routes.js # Real-time metrics
โ โ โโโ reports.routes.js # Report generation
โ โ โโโ export.routes.js # Data exports
โ โโโ services/
โ โ โโโ aggregation.service.js # Data aggregation from Firestore
โ โ โโโ metrics.service.js # Metrics calculation
โ โ โโโ cost-tracking.service.js # Parse billing data
โ โ โโโ report-generator.service.js # PDF/CSV generation
โ โ โโโ realtime.service.js # Real-time metrics (cached)
โ โโโ jobs/
โ โ โโโ daily-aggregation.js # Daily metrics aggregation
โ โ โโโ cost-sync.js # Sync billing data
โ โ โโโ report-scheduler.js # Scheduled reports
โ โโโ middleware/
โ โ โโโ auth.js # Role-based auth (merchant vs admin)
โ โ โโโ rate-limit.js # Rate limiting
โ โ โโโ cache.js # Redis/in-memory cache
โ โโโ utils/
โ โ โโโ csv-generator.js # CSV export
โ โ โโโ pdf-generator.js # PDF reports (puppeteer)
โ โ โโโ date-utils.js # Period calculations
โ โ โโโ aggregators.js # Data aggregation helpers
โโโ Dockerfile
โโโ package.json
โโโ cloudbuild.yamlData Storage Strategy (Hybrid with BigQuery):
Recommended: Three-Tier Architecture
// Tier 1: Real-time Metrics (Redis Cache)
// - Active lanterns/waves right now
// - Online users count
// - Top venues by current activity
// - TTL: 60 seconds
// - Read latency: <10ms
// Tier 2: Recent Analytics (Firestore)
// - Pre-aggregated daily/weekly/monthly metrics
// - Last 90 days of data
// - Fast reads for merchant dashboards
// - Read latency: 50-200ms
analytics/
daily/
{venueId}_{date} โ { lanterns, waves, uniqueUsers, revenue, ... }
weekly/
{venueId}_{weekStart} โ { ... }
monthly/
{venueId}_{monthStart} โ { ... }
// Tier 3: Historical Analytics (BigQuery)
// - All historical data (unlimited retention)
// - Complex multi-table joins and aggregations
// - Ad-hoc queries and deep analytics
// - Cost tracking and projections
// - Machine learning / predictive analytics (future)
// - Read latency: 1-10 seconds (acceptable for reports)BigQuery Architecture:
# BigQuery Dataset Structure
lantern_analytics:
# Raw event tables (streamed from Firestore)
raw_lanterns: # All lantern events
- lantern_id
- user_id
- venue_id
- created_at
- ended_at
- status
- metadata (JSON)
raw_waves: # All wave events
- wave_id
- sender_id
- receiver_id
- venue_id
- created_at
- status
- response
raw_chats: # All chat events
- chat_id
- connection_id
- sender_id
- created_at
- message_length
raw_offers: # Offer impressions/claims
- offer_id
- merchant_id
- venue_id
- user_id
- action (impression/claim/redemption)
- created_at
- metadata (JSON)
# Pre-aggregated tables (for fast queries)
daily_venue_stats: # Daily rollups
- date
- venue_id
- lantern_count
- wave_count
- unique_users
- revenue
- avg_session_duration
daily_merchant_stats: # Daily merchant metrics
- date
- merchant_id
- impression_count
- claim_count
- revenue
- conversion_rate
# Cost tracking tables
daily_costs: # GCP billing data
- date
- service (firestore/storage/cloudrun/apis)
- operation_type
- cost_usd
- quantity (reads/writes/requests)
# User cohort analysis
user_cohorts: # Retention analysis
- cohort_date
- users_count
- day_1_retention
- day_7_retention
- day_30_retentionData Flow Architecture:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Real-time Events (Firestore) โ
โ โข User creates lantern โ writes to Firestore โ
โ โข Triggers Cloud Function โ streams to BigQuery โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ BigQuery Streaming Insert (Cloud Function) โ
โ โข Low latency (~1 second) โ
โ โข Automatic schema detection โ
โ โข Cost: $0.05 per GB streamed โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ BigQuery Tables (Raw Events) โ
โ โข Partitioned by date (cost optimization) โ
โ โข Clustered by venue_id, user_id โ
โ โข Retention: Unlimited (storage $0.02/GB/month) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Scheduled Queries (BigQuery โ Aggregated Tables) โ
โ โข Run daily at 2 AM UTC โ
โ โข Create daily_venue_stats, daily_merchant_stats โ
โ โข Cost: $5 per TB scanned (optimized with partitions) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Analytics API (Cloud Run) โ
โ โข Recent data (90 days): Read from Firestore โ
โ โข Historical data (>90 days): Query BigQuery โ
โ โข Reports: BigQuery โ CSV/PDF export โ
โ โข Cache results in Redis (5-minute TTL) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโImplementation Approach:
Phase 1: Initial Setup (Week 1)
# 1. Create BigQuery dataset
bq mk --dataset --location=us-central1 lantern_analytics
# 2. Create raw event tables with partitioning
bq mk --table \
--time_partitioning_field created_at \
--clustering_fields venue_id,user_id \
lantern_analytics.raw_lanterns \
schema_lanterns.json
# 3. Deploy Cloud Functions for streaming
# Trigger: onCreate("lanterns/{lanternId}")
# Action: Stream insert to BigQueryPhase 2: Streaming Pipeline (Week 2)
// Cloud Function: streamLanternToBigQuery
exports.streamLanternToBigQuery = functions.firestore
.document('lanterns/{lanternId}')
.onCreate(async (snap, context) => {
const bigquery = new BigQuery();
const dataset = bigquery.dataset('lantern_analytics');
const table = dataset.table('raw_lanterns');
const row = {
lantern_id: context.params.lanternId,
user_id: snap.data().userId,
venue_id: snap.data().venueId,
created_at: snap.data().createdAt.toDate().toISOString(),
status: snap.data().status,
metadata: JSON.stringify(snap.data())
};
await table.insert([row]);
});Phase 3: Scheduled Aggregations (Week 3)
-- Scheduled query: daily_venue_stats (runs daily at 2 AM)
CREATE OR REPLACE TABLE lantern_analytics.daily_venue_stats
PARTITION BY date
AS
SELECT
DATE(created_at) as date,
venue_id,
COUNT(DISTINCT lantern_id) as lantern_count,
COUNT(DISTINCT user_id) as unique_users,
AVG(TIMESTAMP_DIFF(ended_at, created_at, MINUTE)) as avg_duration_minutes
FROM lantern_analytics.raw_lanterns
WHERE DATE(created_at) = CURRENT_DATE() - 1 -- Yesterday only
GROUP BY date, venue_id;Phase 4: Analytics API Integration (Week 4)
// analytics-api service
async function getVenueAnalytics(venueId, period) {
const startDate = calculateStartDate(period);
// Recent data: Use Firestore (fast)
if (startDate > Date.now() - 90 * 24 * 60 * 60 * 1000) {
return getFromFirestore(venueId, startDate);
}
// Historical data: Use BigQuery
const query = `
SELECT
date,
lantern_count,
unique_users,
avg_duration_minutes
FROM lantern_analytics.daily_venue_stats
WHERE venue_id = @venueId
AND date >= @startDate
ORDER BY date DESC
`;
const [rows] = await bigquery.query({
query,
params: { venueId, startDate }
});
return rows;
}Benefits:
- Merchant Value: Data-driven insights to improve venue performance
- Cost Control: Early warning for runaway Firebase/API costs
- Product Intelligence: Understand feature usage and user behavior
- Revenue Optimization: Track offer performance and conversion
- Scheduled Reports: Automated daily/weekly/monthly reports
- Real-time Dashboards: Fast metrics without hitting Firestore constantly
- Unlimited Historical Data: BigQuery retains all data forever at low cost
- Complex Analytics: Multi-table joins, cohort analysis, funnel tracking
- Cost-Effective at Scale: Firestore reads cost $0.36/million; BigQuery $5/TB scanned
- ML/AI Ready: BigQuery ML for predictive analytics and recommendations
BigQuery Cost Analysis (10,000 MAU):
Streaming Inserts:
- 10k users ร 5 events/day ร 30 days = 1.5M events/month
- 1.5M events ร 1KB avg = 1.5GB/month
- Cost: 1.5GB ร $0.05/GB = $0.08/month
Storage:
- Year 1: 18GB ร $0.02/GB/month = $0.36/month
- Year 2: 36GB ร $0.02/GB/month = $0.72/month
- Year 3: 54GB ร $0.02/GB/month = $1.08/month
Query Costs (with partitioning):
- 100 merchant dashboard queries/day ร 30 days = 3,000 queries/month
- Avg 10MB scanned per query (90-day window, partitioned)
- 3,000 ร 10MB = 30GB = 0.03TB
- Cost: 0.03TB ร $5/TB = $0.15/month
Scheduled Aggregations:
- Daily aggregation: 500MB scanned/day ร 30 = 15GB/month
- Cost: 0.015TB ร $5/TB = $0.08/month
Total BigQuery Cost: ~$0.70/month (Year 1)
Compare to Firestore-only approach:
- 3,000 queries ร 50 document reads = 150k reads/month
- 150k ร $0.36/million = $0.05/month (limited to 90 days)
BigQuery enables unlimited historical analysis for <$1/month!When to Use Each Storage Tier:
| Use Case | Storage | Read Latency | Cost | Best For |
|---|---|---|---|---|
| Real-time metrics | Redis | <10ms | Free (in-memory) | Live dashboards, current activity |
| Recent analytics (90d) | Firestore | 50-200ms | $0.36/million reads | Merchant dashboards, daily reports |
| Historical analytics | BigQuery | 1-10s | $5/TB scanned | Trend analysis, cohort reports, ML |
| Cost tracking | BigQuery | 1-10s | $5/TB scanned | GCP billing, projections, alerts |
Migration Priority: ๐ก Phase 2B (Weeks 4-6) - High priority for product/business
3B. Billing & Cost Monitoring Service ๐ก HIGH PRIORITY โ
Current State:
- No centralized billing visibility
- Manual cost tracking across 3+ vendors (GCP, Cloudflare, Railway)
- No historical cost data
- No anomaly detection or cost alerts
- No OpenAI/Resend usage tracking
Why Cloud Run + BigQuery:
- Multi-Vendor Consolidation: Aggregate billing from all providers
- Historical Analysis: Unlimited retention of billing data
- Automated Ingestion: Daily jobs pull data from each vendor API
- Anomaly Detection: Alert on cost spikes before invoices arrive
- Attribution: Tag costs by environment, service, feature
- Forecasting: Predict month-end costs based on trends
- Compliance: Audit trail for all billing data
Why High Priority:
- Cost Control: Prevent runaway bills before they happen (Issue #52)
- Budget Planning: Accurate forecasts for investor/board reporting
- Vendor Comparison: Identify where to optimize spend
- Transparency: AI usage visibility for users (Issue #188)
- Operational Health: Billing visibility in admin portal (Issue #233)
- Historical Data: Capture data from day 1 (can't retroactively get history)
Related Issues: #52, #233, #240, #238, #188
Recommended Architecture:
Cloud Run Service: billing-api
// ===== Billing Dashboard APIs =====
GET /billing/overview
Query: ?period=current|last|custom&startDate&endDate
Returns: {
totalCost: number,
byVendor: { gcp, cloudflare, railway, openai, resend },
byService: { firestore, storage, cloudRun, functions, pages },
trend: { daily, weekly, monthly },
projectedMonthEnd: number,
alerts: [{ type, severity, message }]
}
Description: High-level billing overview
GET /billing/vendor/:vendor
Params: vendor = "gcp" | "cloudflare" | "railway" | "openai" | "resend"
Query: ?period=30d|90d|custom
Returns: {
invoices: [{...}],
lineItems: [{...}],
usageMetrics: {...},
costTrend: [...]
}
Description: Vendor-specific billing details
GET /billing/service/:service
Params: service = "firestore" | "storage" | "cloudRun" | "functions" | etc.
Returns: {
cost: number,
usage: { reads, writes, storage, bandwidth },
costPerUnit: {...},
optimization: [{ recommendation, estimatedSavings }]
}
Description: Service-level cost breakdown
GET /billing/environment/:env
Params: env = "dev" | "prod"
Returns: { cost, breakdown, trend }
Description: Environment-based cost attribution
// ===== Cost Tracking & Analytics =====
GET /billing/trends
Query: ?metric=cost|usage&groupBy=vendor|service|env&period=90d
Returns: { dataPoints: [...], trend, forecast }
Description: Cost trend analysis
GET /billing/anomalies
Query: ?severity=critical|warning|info&period=30d
Returns: {
anomalies: [{
date, service, expectedCost, actualCost, variance, severity
}]
}
Description: Cost anomaly detection
GET /billing/forecast
Query: ?horizon=7d|30d|90d
Returns: {
projectedCost: number,
confidence: number,
trend: "increasing" | "stable" | "decreasing",
breakdown: {...}
}
Description: Cost forecasting based on trends
// ===== AI Usage Tracking (Issue #240, #188) =====
GET /billing/ai/usage
Query: ?service=openai|anthropic&period=30d
Returns: {
totalTokens: number,
totalCost: number,
byModel: { "gpt-4": {...}, "claude-3": {...} },
byFeature: { "chatbot": {...}, "triage": {...} },
dailyUsage: [...]
}
Description: AI API usage and costs
POST /billing/ai/track
Body: { service, model, tokens, cost, feature, userId }
Returns: { recorded: true }
Description: Track AI usage (called by services)
// ===== Export & Reporting =====
POST /billing/reports/generate
Body: {
type: "monthly-summary" | "vendor-comparison" | "cost-attribution",
period: { start, end },
format: "csv" | "pdf" | "json",
recipients: ["email@example.com"] // Optional
}
Returns: { jobId, status }
Description: Generate billing report
GET /billing/reports/:jobId
Returns: { status, downloadUrl, expiresAt }
Description: Get report status and download
GET /billing/export/raw
Query: ?vendor&startDate&endDate&format=csv|json
Returns: { downloadUrl }
Description: Export raw billing data (admin-only)
// ===== Admin Operations =====
POST /billing/sync/trigger
Body: { vendor: "all" | "gcp" | "cloudflare" | "railway" }
Returns: { jobId, status }
Description: Manually trigger billing data sync
GET /billing/sync/status
Returns: {
lastSync: { vendor, timestamp, status, rowsImported },
nextSync: { vendor, scheduledTime }
}
Description: Check sync job status
POST /billing/alerts/configure
Body: {
type: "daily-threshold" | "spike-detection" | "forecast-exceeded",
threshold: number,
recipients: [...]
}
Returns: { alertId, status }
Description: Configure cost alerts
GET /billing/health
Returns: {
dataFreshness: { vendor, lastUpdate, status },
pipelineStatus: "healthy" | "degraded" | "failing",
errors: [...]
}
Description: Billing pipeline health checkCloud Run Jobs (Scheduled):
// Daily ingestion jobs
job: ingest-gcp-billing // Runs: Daily 2 AM UTC
- Reads from GCP Billing Export (auto-configured)
- Writes to billing_raw.gcp_billing_export
job: ingest-cloudflare // Runs: Daily 3 AM UTC
- Pulls billing API data
- Writes to billing_raw.cloudflare_invoices_raw
job: ingest-railway // Runs: Daily 3:15 AM UTC
- Pulls usage metrics
- Writes to billing_raw.railway_invoices_raw
job: ingest-openai // Runs: Daily 3:30 AM UTC (Issue #240)
- Pulls usage data
- Writes to billing_raw.openai_usage_raw
job: ingest-resend // Runs: Daily 3:45 AM UTC (Issue #238)
- Pulls email sending metrics
- Writes to billing_raw.resend_usage_raw
// Hourly health check
job: cost-health-check // Runs: Every hour
- Validate data freshness
- Detect anomalies (cost spikes >20% day-over-day)
- Check unattributed costs
- Send Slack/email alerts if issues foundInternal Structure:
billing-api/
โโโ src/
โ โโโ index.js # Express app
โ โโโ routes/
โ โ โโโ dashboard.routes.js # Overview, trends, forecasts
โ โ โโโ vendor.routes.js # Vendor-specific billing
โ โ โโโ ai.routes.js # AI usage tracking
โ โ โโโ reports.routes.js # Report generation
โ โ โโโ admin.routes.js # Sync, alerts, health
โ โโโ services/
โ โ โโโ gcp.service.js # GCP billing API client
โ โ โโโ cloudflare.service.js # Cloudflare billing API
โ โ โโโ railway.service.js # Railway API client
โ โ โโโ openai.service.js # OpenAI usage tracking
โ โ โโโ resend.service.js # Resend metrics
โ โ โโโ bigquery.service.js # BigQuery read/write
โ โ โโโ anomaly.service.js # Anomaly detection
โ โ โโโ forecast.service.js # Cost forecasting
โ โโโ jobs/
โ โ โโโ ingest-gcp.js # GCP billing ingestion
โ โ โโโ ingest-cloudflare.js # Cloudflare ingestion
โ โ โโโ ingest-railway.js # Railway ingestion
โ โ โโโ ingest-openai.js # OpenAI ingestion
โ โ โโโ ingest-resend.js # Resend ingestion
โ โ โโโ normalize.js # Transform to canonical schema
โ โ โโโ attribute.js # Apply attribution rules
โ โ โโโ health-check.js # Pipeline health validation
โ โโโ middleware/
โ โ โโโ admin-auth.js # Admin-only endpoints
โ โ โโโ rate-limit.js # Rate limiting
โ โ โโโ cache.js # Redis caching (1hr TTL)
โ โโโ utils/
โ โ โโโ schema-mapper.js # Vendor โ canonical mapping
โ โ โโโ attribution.js # Cost attribution logic
โ โ โโโ pdf-generator.js # PDF reports
โ โ โโโ alert-sender.js # Slack/email alerts
โโโ Dockerfile
โโโ package.json
โโโ cloudbuild.yamlBigQuery Architecture (Expanded from Issue #52):
# Combined Analytics + Billing Dataset Structure
# ========== APP ANALYTICS (existing plan) ==========
lantern_analytics:
# User engagement data
raw_lanterns: # All lantern events
raw_waves: # All wave events
raw_chats: # All chat events
raw_offers: # Offer impressions/claims
# Pre-aggregated app metrics
daily_venue_stats: # Daily venue performance
daily_merchant_stats: # Daily merchant metrics
user_cohorts: # Retention analysis
# ========== BILLING DATA (new - Issue #52) ==========
billing_raw:
# Immutable vendor-shaped data
gcp_billing_export: # GCP auto-export (native)
gcp_billing_export_resource_v1: # Detailed resource costs
cloudflare_invoices_raw: # Cloudflare invoices
cloudflare_invoice_line_items: # Line items
cloudflare_usage_raw: # Usage metrics
railway_invoices_raw: # Railway invoices
railway_invoice_line_items: # Line items
railway_usage_raw: # Usage metrics
openai_usage_raw: # OpenAI API usage (Issue #240)
resend_usage_raw: # Resend email metrics (Issue #238)
billing_norm:
# Canonical cost schema (all vendors normalized)
fact_cost_line_items: # One row = one cost event
- date
- vendor (gcp|cloudflare|railway|openai|resend)
- service (firestore|storage|cloudRun|etc.)
- sku
- quantity
- unit_price
- total_cost
- currency
- metadata (JSON)
# Helper dimensions
dim_vendor_sku: # SKU โ human-readable mapping
dim_resource_map: # Resource โ service mapping
billing_attrib:
# Attribution rules and taxonomy
dim_taxonomy: # Allowed env/app/team/cost_center
dim_cost_allocation_rules: # Rule-based cost attribution
# Attributed costs
fact_cost_attributed: # fact_cost_line_items + attribution
- All fields from fact_cost_line_items
- environment (dev|prod)
- service_name (venue-api|admin-api|analytics-api)
- feature (lanterns|waves|venues|ai-triage)
- cost_center
- team
# Shared cost allocation (optional)
fact_cost_allocated: # Proportional cost splits
billing_marts:
# BI-friendly rollups (fast dashboards)
daily_cost_by_vendor: # Vendor totals per day
daily_cost_by_service: # Service costs per day
daily_cost_by_environment: # Dev vs prod costs
monthly_cost_by_vendor: # Monthly vendor totals
monthly_cost_by_service: # Monthly service costs
top_cost_drivers_30d: # Biggest cost sources
unattributed_cost_daily: # Costs without attribution
cost_trend_rolling_7d: # 7-day rolling averages
cost_forecast_30d: # Forecasted costs
ai_usage_by_model: # AI API usage breakdown (Issue #188)
ops:
# Pipeline health and monitoring
pipeline_runs: # Job execution logs
data_freshness: # Vendor data freshness checks
cost_anomalies: # Spend spikes, attribution drift
alert_history: # Alert notifications sentData Flow:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Vendor APIs (Daily Scheduled Jobs) โ
โ โข GCP Billing Export โ BigQuery (auto) โ
โ โข Cloudflare API โ Cloud Run Job โ billing_raw.cloudflare_*โ
โ โข Railway API โ Cloud Run Job โ billing_raw.railway_* โ
โ โข OpenAI API โ Cloud Run Job โ billing_raw.openai_* โ
โ โข Resend API โ Cloud Run Job โ billing_raw.resend_* โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ BigQuery Scheduled Queries (Daily 4 AM) โ
โ โข Normalize vendor data โ billing_norm.fact_cost_line_itemsโ
โ โข Apply attribution rules โ billing_attrib.fact_cost_attributedโ
โ โข Build dashboard marts โ billing_marts.* โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Cost Health Check Job (Hourly) โ
โ โข Validate data freshness (last 24 hours) โ
โ โข Detect anomalies (>20% day-over-day spike) โ
โ โข Check unattributed % (<5% threshold) โ
โ โข Write to ops.cost_anomalies โ
โ โข Send alerts if issues found โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Billing API (Cloud Run Service) โ
โ โข Admin portal dashboard (real-time metrics) โ
โ โข Cost reports (CSV/PDF exports) โ
โ โข AI usage transparency (public-facing, Issue #188) โ
โ โข Forecasting and trend analysis โ
โ โข Reads from billing_marts.* (fast queries) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโCost Attribution Example:
-- Example: Attribute Firestore costs to specific features
INSERT INTO billing_attrib.fact_cost_attributed
SELECT
fcl.*,
CASE
WHEN resource LIKE '%/lanterns/%' THEN 'lanterns'
WHEN resource LIKE '%/waves/%' THEN 'waves'
WHEN resource LIKE '%/venues/%' THEN 'venues'
WHEN resource LIKE '%/chats/%' THEN 'chats'
ELSE 'unattributed'
END as feature,
CASE
WHEN project_id = 'lantern-app-dev' THEN 'dev'
WHEN project_id = 'lantern-app-prod' THEN 'prod'
END as environment
FROM billing_norm.fact_cost_line_items fcl
WHERE vendor = 'gcp' AND service = 'firestore';Benefits:
- Multi-Vendor Visibility: Single dashboard for all billing (GCP, Cloudflare, Railway, OpenAI, Resend)
- Historical Analysis: Unlimited retention for trend analysis and forecasting
- Early Warnings: Anomaly detection alerts before invoices arrive
- Cost Attribution: Understand which features/environments drive costs
- AI Transparency: Public-facing AI usage dashboard (Issue #188)
- Automated Reports: Daily/weekly/monthly billing summaries
- Forecasting: Predict month-end costs with 7-day trend analysis
- Optimization: Identify cost-saving opportunities by service
- Audit Trail: Complete billing history for compliance
- Addresses 5 Issues: #52 (billing pipeline), #233 (system health), #240 (OpenAI), #238 (Resend), #188 (AI transparency)
Migration Priority: ๐ก Phase 2B (Weeks 4-6) - Combined with analytics-api deployment
4. Image Processing Service ๐ข MEDIUM PRIORITY โ
Current State:
- Direct Firebase Storage uploads
- Client-side validation only
- No image optimization or resizing
Why Cloud Run:
- Multi-step Processing: Resize, compress, watermark, format conversion
- Native Libraries: Sharp, ImageMagick for advanced operations
- Large Files: Handle 5MB+ uploads without timeout
- Multiple Outputs: Generate thumbnails, previews, full-size in one request
- Security: Malware scanning, EXIF stripping, content moderation
Recommended Architecture:
Cloud Run Service: image-service
POST /images/upload
Body: multipart/form-data image file
Returns: { imageId, urls: { thumbnail, preview, full }, metadata }
Description: Upload and process image (auto-generates all sizes)
POST /images/:imageId/resize
Body: { width, height, fit: "cover" | "contain" | "fill" }
Returns: { url, dimensions }
Description: Generate custom size
POST /images/:imageId/compress
Body: { quality: 1-100, format: "webp" | "jpeg" }
Returns: { url, originalSize, compressedSize, savings }
Description: Compress image
GET /images/:imageId/metadata
Returns: { dimensions, format, size, exif, colors }
Description: Get image metadata
POST /images/:imageId/moderate
Returns: { safe, labels, nsfw_score }
Description: Content moderation check
DELETE /images/:imageId
Returns: { deleted }
Description: Delete image and all variantsBenefits:
- Unified Pipeline: All image processing in one place
- Consistent Quality: Automatic optimization and sizing
- Security: Malware scanning and EXIF stripping
- Multiple Formats: WebP for modern browsers, JPEG fallback
- Watermarking: For merchant offer photos
Migration Priority: ๐ข Phase 3 (Weeks 6-8) - If image uploads are high-volume
5. Real-time Updates Service (Future) ๐ต LOW PRIORITY โ
Current State:
- Firestore listeners for real-time updates
- Works well for current scale
When to Build: Only if you need:
- WebSocket connections for 10,000+ concurrent users
- Sub-100ms latency for live updates
- Presence system (online/offline, typing indicators)
- Live lantern count updates at high traffic venues
Recommended Architecture:
Cloud Run Service: realtime-service
WS /ws/connect
Description: Establish WebSocket connection
WS /ws/venue/:venueId
Description: Subscribe to venue updates
WS /ws/chat/:connectionId
Description: Subscribe to chat messages
HTTP /presence/status
Returns: { online, venueActivity }
Description: Get current presence statusMigration Priority: ๐ต Phase 4+ (Weeks 9+) - Only if Firestore listeners don't scale
6. Full-Text Search Service (Future) ๐ต LOW PRIORITY โ
Current State:
- Basic Firestore queries
- Limited search capabilities
When to Build: Only if you need:
- Fuzzy search ("cofee" โ "coffee")
- Multi-field search (name + address + tags)
- Real-time autocomplete
- Ranking and relevance scoring
Recommended Architecture:
Cloud Run Service: search-service
GET /search/venues?q=coffee+downtown&limit=20
Returns: { results, total, suggestions }
Description: Search venues with fuzzy matching
GET /search/users?q=alice
Returns: { results, total }
Description: Search users (admin-only)
POST /search/index/rebuild
Returns: { indexed, errors }
Description: Rebuild search index
WS /search/suggest
Description: Real-time autocompleteTechnology Options:
- MeiliSearch (easy setup, good DX)
- Typesense (fast, typo-tolerant)
- Elasticsearch (powerful, complex)
Migration Priority: ๐ต Phase 4+ (Weeks 9+) - Only if basic Firestore queries insufficient
๐ต Keep as Cloud Functions โ
Well-suited for Cloud Functions:
- Short-lived operations (<30 seconds)
- Event-driven triggers (onCreate, onUpdate, etc.)
- Simple HTTP endpoints
- Lightweight utilities
Functions to Keep:
// Auth Triggers
onCreate("users/{uid}") โ initializeUserProfile
onDelete("users/{uid}") โ cleanupUserData
// Firestore Triggers
onUpdate("lanterns/{lanternId}") โ updateVenueActivityCount
onDelete("venues/{venueId}") โ cleanupVenueLanterns
// Scheduled Functions (if simple)
schedule("every 24 hours") โ cleanupExpiredLanterns
schedule("every 1 hour") โ cleanupExpiredWaves
// Simple HTTP Endpoints
POST /logClientError โ Log client error to Cloud Logging
GET /health โ Service health check๐ Cloud Functions vs Cloud Run Comparison โ
| Feature | Cloud Functions | Cloud Run |
|---|---|---|
| Max Execution Time | 9 minutes | 60 minutes (configurable) |
| Cold Start | ~1-2 seconds | ~2-5 seconds |
| Concurrency | 1-1000 requests/instance | Configurable (โค1000) |
| WebSocket Support | โ No | โ Yes |
| Container Size | Limited | Up to 10GB |
| Memory | 256MB - 16GB | 128MB - 32GB |
| CPU | Always allocated | Can be CPU-only during request |
| Pricing Model | Invocation + GB-sec | Request + CPU-time + Memory |
| Best For | Event-driven, simple HTTP | APIs, batch jobs, long tasks |
| Local Development | Firebase Emulator | Docker |
| State Management | None (stateless) | In-memory caching supported |
| Native Dependencies | Limited | Full container support |
๐ฏ Optimized Migration Roadmap โ
Phase 1 โ Immediate (Weeks 1-2) ๐ด CRITICAL โ
Service: venue-api (combined import + refresh)
Why First:
- Fixes critical security issue (Issue #224 - OSM import abuse)
- Addresses
firestore.rules:394public write vulnerability - Unblocks Google Places API integration
- High impact, clear requirements
Deliverables:
- [ ] Cloud Run service deployed to dev
- [ ] OSM import endpoint with job queue
- [ ] Venue refresh endpoint with rate limiting
- [ ] Admin consolidation endpoint (dedupe)
- [ ] Firestore security rules updated (remove public write)
- [ ] Documentation and API reference
- [ ] Integration tests
Success Metrics:
- OSM imports no longer client-accessible
- Venue refresh rate limited to 1 req/sec
- Import jobs handle 1000+ venues without timeout
Phase 2 โ Short Term (Weeks 3-4) ๐ก HIGH PRIORITY โ
Service: admin-api (consolidated admin operations)
Why Second:
- Improves admin portal backend
- Enables batch operations for efficiency
- Better local development experience
- Reduces cold start latency
Deliverables:
- [ ] Cloud Run service deployed to dev
- [ ] Migrate existing Cloud Functions to Express routes:
getUserByEmailโGET /admin/users/:uidsetUserRoleโPOST /admin/users/:uid/rolebanUserโPOST /admin/users/:uid/banunbanUserโDELETE /admin/users/:uid/ban
- [ ] New batch operations:
- Bulk ban users
- Bulk import users
- Bulk venue updates
- [ ] Admin audit logging
- [ ] Admin portal updated to use new API
- [ ] Deprecation plan for old Cloud Functions
Success Metrics:
- Admin operations 50% faster (reduced cold starts)
- Batch operations process 100+ records in <10 seconds
- Local development setup documented
Phase 2B โ Analytics + Billing (Weeks 4-6) ๐ก HIGH PRIORITY โ
Services: analytics-api + billing-api (combined deployment)
Why Critical:
- Merchant-facing feature (product requirement)
- Cost tracking and usage monitoring (Issues #52, #233)
- Business intelligence for growth
- Revenue optimization for merchants
- Multi-vendor billing consolidation (GCP, Cloudflare, Railway, OpenAI, Resend)
- AI usage transparency (Issue #188)
- Historical billing data (can't be retroactively captured)
Related Issues: #52, #233, #240, #238, #188
Deliverables:
BigQuery Infrastructure:
- [ ] Create
lantern_analyticsdataset:- Create partitioned raw event tables (lanterns, waves, chats, offers)
- Deploy Cloud Functions for streaming inserts
- Create scheduled queries for daily aggregations
- [ ] Create
billing_rawdataset:- Set up GCP Billing Export (auto-configured)
- Create tables: cloudflare_invoices_raw, railway_invoices_raw
- Create tables: openai_usage_raw, resend_usage_raw
- [ ] Create
billing_normdataset:- fact_cost_line_items (canonical schema)
- dim_vendor_sku, dim_resource_map
- [ ] Create
billing_attribdataset:- dim_taxonomy, dim_cost_allocation_rules
- fact_cost_attributed (with env/service/feature tags)
- [ ] Create
billing_martsdataset:- daily_cost_by_vendor, daily_cost_by_service
- monthly_cost_by_vendor, cost_trend_rolling_7d
- cost_forecast_30d, ai_usage_by_model
- [ ] Create
opsdataset:- pipeline_runs, data_freshness, cost_anomalies
Analytics API (analytics-api):
- [ ] Merchant analytics endpoints:
- Dashboard metrics (venue/offer performance)
- Venue analytics (activity, demographics, trends)
- Offer analytics (impressions, claims, conversions)
- Export reports (CSV/PDF)
- [ ] Admin analytics endpoints:
- Platform overview (users, engagement, revenue)
- Venue rankings, merchant performance
- [ ] Real-time metrics (cached):
- Active lanterns/waves, online users, top venues
- [ ] Hybrid query logic:
- Recent data (90 days): Read from Firestore
- Historical data (>90 days): Query BigQuery
Billing API (billing-api):
- [ ] Billing dashboard endpoints:
- Overview (total cost, by vendor, by service, trends)
- Vendor-specific details (GCP, Cloudflare, Railway, OpenAI, Resend)
- Service-level cost breakdown
- Environment-based attribution (dev vs prod)
- [ ] Cost tracking & analytics:
- Trend analysis, anomaly detection, forecasting
- AI usage tracking (Issue #240 - OpenAI, Issue #188 - transparency)
- Cost attribution by feature/service
- [ ] Export & reporting:
- Generate monthly billing reports (CSV/PDF)
- Raw data exports (admin-only)
- Email delivery of scheduled reports
- [ ] Admin operations:
- Manual sync triggers
- Alert configuration (thresholds, spike detection)
- Pipeline health checks
Cloud Run Jobs (Scheduled):
- [ ] Daily billing ingestion jobs (3 AM UTC):
- ingest-cloudflare (Cloudflare billing API)
- ingest-railway (Railway usage API)
- ingest-openai (OpenAI usage API, Issue #240)
- ingest-resend (Resend metrics API, Issue #238)
- [ ] Hourly cost health check:
- Validate data freshness (<24hr)
- Detect cost anomalies (>20% day-over-day)
- Alert on unattributed costs (>5%)
- Send Slack/email alerts
- [ ] Daily analytics aggregation:
- Firestore โ BigQuery streaming
- BigQuery scheduled queries (normalize, attribute, rollup)
- Weekly/monthly reports (BigQuery queries)
- Cost sync from GCP billing API Portal Integration:
- [ ] Merchant portal updates:
- Analytics dashboard (venue/offer performance)
- Export buttons (CSV/PDF reports)
- Real-time venue stats
- [ ] Admin portal updates:
- System Health "Billing" tab (Issue #233)
- Multi-vendor cost dashboard (GCP, Cloudflare, Railway, OpenAI, Resend)
- Platform metrics (users, engagement, revenue)
- Merchant reports
- Cost anomaly alerts
- AI usage transparency page (Issue #188 - public-facing)
Success Metrics:
- Merchants can view venue performance in <2 seconds
- Multi-vendor billing data visible in admin portal (Issues #52, #233)
- Cost tracking updates daily from all vendors
- Anomaly detection alerts fire within 1 hour of spike
- Reports generate in <30 seconds for 90-day periods
- Real-time metrics cached and served in <100ms
- AI usage transparency page loads in <1 second (Issue #188)
- Historical billing data captured from deployment date forward
Phase 3 โ Medium Term (Weeks 7-9) ๐ข MEDIUM PRIORITY โ
Service: image-service (if needed)
Why Third:
- Image processing only needed if uploads become high-volume
- Can start with direct Firebase Storage uploads
- Migrate when optimization/security becomes critical
Deliverables:
- [ ] Assess current image upload volume and requirements
- [ ] If needed: Deploy
image-servicewith Sharp/ImageMagick - [ ] Multi-size image generation (thumbnail, preview, full)
- [ ] Image compression and format optimization
- [ ] Content moderation integration
- [ ] Watermarking for merchant offers
Phase 4 โ Long Term (Weeks 10+) ๐ต LOW PRIORITY โ
Services (build only if needed):
realtime-service(if Firestore listeners don't scale)search-service(if Firestore queries insufficient)
Why Last:
- Current solutions may be sufficient for 1-2 years
- Build these only when you hit scaling issues
- Focus on product features instead
๐ฐ Cost Analysis โ
Cloud Functions (Current) โ
Estimate for 10,000 monthly active users:
- 10 admin functions ร $0.40/million invocations = $4/month
- 2GB-sec per invocation ร 50,000 invocations = $1.25/month
Total: ~$5.25/monthCloud Run (Proposed) โ
Estimate for 10,000 monthly active users:
venue-api:
- 100,000 requests/month
- 512MB memory, 1 vCPU
- Avg 2 seconds per request
- Cost: ~$15/month
admin-api:
- 10,000 requests/month
- 256MB memory, 1 vCPU
- Avg 500ms per request
- Cost: ~$3/month
analytics-api:
- 50,000 requests/month (merchant dashboards + admin)
- 512MB memory, 1 vCPU
- Avg 1 second per request (cached metrics)
- Scheduled jobs: 1 daily (aggregation)
- Cost: ~$6/month
billing-api:
- 20,000 requests/month (admin dashboards + reports)
- 512MB memory, 1 vCPU
- Avg 1 second per request (cached metrics)
- Scheduled jobs: 5 daily (GCP, Cloudflare, Railway, OpenAI, Resend ingest)
- Hourly health check job
- Cost: ~$8/month
BigQuery (analytics + billing):
- Streaming inserts: 1.5GB/month app data + 0.5GB/month billing data = $0.10/month
- Storage (Year 1): 24GB ร $0.02/GB = $0.48/month
- Query costs: 3,500 queries ร 15MB avg = 0.05TB ร $5/TB = $0.25/month
- Scheduled aggregations: 0.02TB ร $5/TB = $0.10/month
- Total BigQuery: ~$1/month
Cloud Functions (streaming):
- 5 Firestore triggers for BigQuery streaming (lanterns, waves, chats, offers, venues)
- ~50,000 invocations/month ร $0.40/million = $0.02/month
Total: ~$34/monthCost Increase: ~$29/month (~5.5x) from current $5/month
Value Gained:
- Security fixes (Issues #224, #394 - critical vulnerabilities)
- Multi-vendor billing visibility (Issues #52, #233 - GCP, Cloudflare, Railway, OpenAI, Resend)
- Cost tracking & anomaly detection (prevents runaway bills before they happen)
- AI usage transparency (Issue #188 - public-facing dashboard)
- Historical billing data (can't be retroactively captured - start now!)
- Merchant analytics (product feature, revenue driver)
- Business intelligence (growth metrics, forecasting, attribution)
- Enables Google Places integration
- Supports 1000+ venue imports
- Batch admin operations
- Better monitoring and logging
- Foundation for future growth
- Addresses 8 open issues (#52, #224, #233, #238, #240, #188, #394, plus venues/admin improvements)
Break-even Analysis: As usage grows, Cloud Run becomes more cost-effective:
- At 100k users: Cloud Run is 50% cheaper
- At 1M users: Cloud Run is 70% cheaper
- Long-running operations impossible on Cloud Functions
๐๏ธ Implementation Guidelines โ
Cloud Run Best Practices โ
Service Configuration
yaml# cloudbuild.yaml example timeout: 3600s # 60 minutes memory: 1Gi cpu: 1 minInstances: 0 # Scale to zero when idle maxInstances: 10 concurrency: 80 # Requests per instanceEnvironment Variables
bashFIREBASE_PROJECT_ID=lantern-app-dev ENVIRONMENT=dev LOG_LEVEL=info RATE_LIMIT_OSM=5 # 5 req/sec RATE_LIMIT_NOMINATIM=1 # 1 req/secHealth Checks
javascriptapp.get('/health', (req, res) => { res.json({ status: 'ok', timestamp: Date.now() }); });Graceful Shutdown
javascriptprocess.on('SIGTERM', () => { console.log('SIGTERM received, closing server...'); server.close(() => { console.log('Server closed'); process.exit(0); }); });
Security โ
Authentication
- Use Firebase Admin SDK to verify ID tokens
- Implement role-based access control (RBAC)
- Admin endpoints require
adminrole
Rate Limiting
- Per-user rate limits (10 req/min for imports)
- Global rate limits for external APIs
- IP-based rate limiting for abuse prevention
Input Validation
- Validate all request bodies with Joi/Zod
- Sanitize user inputs
- Limit request body size (10MB max)
Monitoring โ
Cloud Logging
- Structured JSON logs
- Include request ID, user ID, action
- Log errors with stack traces
Cloud Monitoring
- Request latency (p50, p95, p99)
- Error rate
- Active instances
- Memory/CPU usage
Alerts
- Error rate >5% for 5 minutes
- Latency p99 >10 seconds
- Memory usage >80%
๐งช Testing Strategy โ
Local Development โ
# venue-api local testing
cd services/venue-api
docker build -t venue-api .
docker run -p 8080:8080 \
-e FIREBASE_PROJECT_ID=lantern-app-dev \
-e ENVIRONMENT=local \
venue-api
# Test endpoints
curl http://localhost:8080/health
curl -X POST http://localhost:8080/venues/import/osm \
-H "Authorization: Bearer $TOKEN" \
-d '{"lat":40.7128,"lng":-74.0060,"radius":1000}'Integration Tests โ
// venue-api integration test
describe('POST /venues/import/osm', () => {
it('should import venues from OSM', async () => {
const response = await request(app)
.post('/venues/import/osm')
.set('Authorization', `Bearer ${adminToken}`)
.send({ lat: 40.7128, lng: -74.0060, radius: 1000 })
.expect(200);
expect(response.body).toHaveProperty('jobId');
expect(response.body.status).toBe('queued');
});
it('should reject unauthorized users', async () => {
await request(app)
.post('/venues/import/osm')
.send({ lat: 40.7128, lng: -74.0060, radius: 1000 })
.expect(401);
});
});Load Testing โ
# Use Apache Bench for load testing
ab -n 1000 -c 10 -H "Authorization: Bearer $TOKEN" \
http://venue-api-dev.a.run.app/health
# Use k6 for complex scenarios
k6 run load-test.js๐ Migration Checklist โ
Pre-Migration โ
- [ ] Review current Cloud Functions usage and costs
- [ ] Audit API rate limits (OSM, Nominatim, Google Places)
- [ ] Set up Cloud Run development environment
- [ ] Create Dockerfiles for each service
- [ ] Set up Cloud Build pipelines
- [ ] Configure secret management (Secret Manager)
During Migration โ
- [ ] Deploy to dev environment first
- [ ] Run integration tests
- [ ] Monitor error rates and latency
- [ ] Gradual rollout (10% โ 50% โ 100%)
- [ ] Keep Cloud Functions as fallback
Post-Migration โ
- [ ] Monitor for 1 week before deprecating Cloud Functions
- [ ] Update documentation
- [ ] Update admin portal to use new APIs
- [ ] Remove old Cloud Functions
- [ ] Update Firestore security rules
- [ ] Create changelog entry
๐ Key Takeaways โ
โ Use Cloud Run When You Need: โ
- Batch processing (100+ items)
- Long-running tasks (>1 minute, up to 60 minutes)
- WebSockets or persistent connections
- Multi-step workflows with state
- Native dependencies (ImageMagick, etc.)
- Container-based applications
- High concurrency (1000+ req/instance)
โ Use Cloud Functions When You Need: โ
- Simple, single-purpose logic (<50 lines)
- Event-driven triggers (onCreate, onUpdate)
- Fast execution (<30 seconds)
- Minimal configuration
- Pay-per-invocation model for low traffic
๐ฏ Lantern-Specific Priorities: โ
Phase 1 (Weeks 1-2):
venue-api๐ด- Security fix (Issues #224, #394)
- Google Places foundation
Phase 2 (Weeks 3-4):
admin-api๐ก- Better admin operations
- Batch processing
Phase 2B (Weeks 4-6):
analytics-api+billing-api๐ก EXPANDED SCOPE- Merchant analytics (product feature)
- Multi-vendor billing (Issues #52, #233, #240, #238, #188)
- Cost tracking & anomaly detection
- AI usage transparency
- Critical: Historical data capture starts from deployment
Phase 3 (Weeks 7-9):
image-service๐ข- Only if needed
Phase 4+ (Weeks 10+): Real-time, search ๐ต
- Only if scaling issues
Most Important:
- Start with Phase 1 (
venue-api) to fix critical security vulnerabilities - Phase 2B is CRITICAL because:
- Can't retroactively capture historical billing data - must start early
- Prevents cost surprises before they happen (anomaly detection)
- Merchant analytics drives product value and revenue
- AI transparency builds user trust (Issue #188)
- Addresses 5 open billing/analytics issues in one deployment
๐ References โ
Document Owner: @cattreedev Last Review: 2026-02-09 Next Review: After Phase 1 completion