Skip to content

Privacy Hardening Roadmap โ€‹

Living document. Last updated: 2026-05-11 Status: Sprint A in progress Owner: privacy workstream Companion brief: SEALED_IDENTITY.md

1. The bar โ€‹

The architectural goal โ€” agreed 2026-05-11 โ€” is that a complete Firestore dump or BigQuery export, OR a subpoena providing a phone number, should yield no usefully identifiable user content. Identifiers should be hashed against a KMS pepper the server doesn't directly hold (Cloud KMS-resident long term). Content fields should be encrypted with keys derived from material the server cannot reproduce.

Stated differently: privacy stops being a policy we keep and becomes a property of the data. We can't disclose what we don't structurally hold.

This is aspirational but not theatre โ€” it's the strict reading of Immutable Right #6 (Cofounder Agreement) and Business Plan ยง3.2 (anonymity, k-anonymity โ‰ฅ 3 on merchant-surfaced metrics).

2. What's already shipped โ€‹

PRTitleMerged
#479Sealed-identity Stage A phases 1โ€“2 + log hygiene (#307)2026-05-10
#485Stage A phase 3 + banned_accounts end-to-end2026-05-11
#490Stage A phase 4 โ€” server-side createUser (stops writing plaintext users.phone on new signups)2026-05-11

Effect today: new accounts store phoneHash only; existing accounts still have plaintext phone until Sprint C below. Profile encryption (PBKDF2 + AES-GCM via apps/web/src/lib/encryption.js) was already in place before this roadmap began.

3. Current threat model โ€” what a dump or subpoena yields โ€‹

SurfaceToday (post-Stage-A)Useful to attacker / court?
users.phoneHash (new accounts)hex hash, no preimage without our KMS pepperOnly confirms "this phone has an account" if you can compute the hash.
users.phone (existing accounts, pre-migration only)plaintext on rows the Sprint C migration script hasn't run against yetYes for those rows; eliminated by the operator running tooling/scripts/drop-plaintext-phone.mjs after the Stage A flag has soaked.
users.encryptedSeed, phoneSalt, authProofHash, encryptedBirthDate, encryptionCanarysealedNo โ€” we cannot decrypt.
users.lanternName, lastLoginAt, authMethodpseudonym + timestampsFuzzy.
users.encryptedMoodAES-GCM ciphertext (post-Sprint-A)No โ€” we cannot decrypt.
users.encryptedInterestsAES-GCM ciphertext (post-Sprint-A)No โ€” we cannot decrypt.
users.mood / users.vibe / users.interests (legacy plaintext)plaintext on not-yet-migrated docs onlyYes for those docs; eliminated as lazy migration runs on next read.
lanterns.profileVibe / lanterns.profileInterests (denormalized at light time)plaintext, but bounded by 48h TTL (Sprint B)User-elected published content visible to nearby users by design; bounded exposure.
lanterns.interest (free-text quote)plaintext, bounded by 48h TTL (Sprint B)User-elected published content; see Sprint A scope decision in ยง4 and decision log ยง7.
lanterns.lat, lanterns.lngpost-Sprint-A truncated to ~111m (3 decimals)Fine-grained movement reconstruction now structurally degraded.
chats.* message bodiesencrypted client-sideNo
chats.* metadata (timestamps, participant userIds)userId pairs + timestampsSome โ€” userId linkability remains until Stage B.
waves.*userId pairs + timestampsSome โ€” same.
banned_accountshashes only, server-only Firestore ruleNo
BigQuery analytics.events (raw, user-keyed)user_id-indexed event stream, 90-day partition expiration enforced by BigQuery itself (partitioning.expirationMs = 7776000000 on the timestamp field). Partitions older than 90 days are auto-deleted โ€” events older than 90 days simply do not exist.No, for events outside the 90-day window. For events inside the window: Sprint D.2 pseudonymizes a deleted user's events immediately; Sprint B.2 is a safety net in case partition expiration is ever lengthened.
BigQuery analytics.event_counts_daily (aggregated export)columns: day, event_name, event_tier, environment, count, aggregated_at. No user_id, no entity_id, no per-user breakdown. The MERGE in eventCountsAggregation.service.js groups by (day, event_name, event_tier, environment) only โ€” user identity is aggregated away at the bridge between raw and aggregated tiers. Indefinite retention.No โ€” there is no per-user information to leak. This is the merchant-facing analytics surface; long-tail retention is intentional and privacy-safe.
Cloud Run access logsIPs + timestamps + userIds (30-day retention from #307)Some โ€” structural to GCP.
users/{uid} โ†” activity link via plaintext userIddirectYes โ€” Stage B (Sprint E) seals this last surface.

4. Sprint plan โ€‹

Status legend: ๐ŸŸฆ not started / ๐ŸŸก in progress / ๐ŸŸข merged

Sprint A โ€” Encrypt the loudest plaintext leaks ๐ŸŸข โ€‹

Effort: ~3โ€“4 days. PR: #498 โ€” merged 2026-05-11 02:30 UTC. Follow-up review fixes in [#TBD] (Copilot review on #498 surfaced input validation + NaN edge cases โ€” addressed).

Scope decision (2026-05-11): the lantern free-text quote (lantern.interest) stays plaintext at rest. It's user-elected published content visible to nearby users by design โ€” the product's social-discovery use case (e.g. "Looking for hiking partners") depends on viewers being able to read it. Encrypting it with the lighter's key breaks the feature; encrypting with a venue/proximity-derived key is out of scope pre-launch. Breach exposure is bounded by Sprint B's 48h TTL on lantern docs.

What we still encrypt in Sprint A:

  • [x] Encrypt users.mood at write, decrypt at read. Stored as encryptedMood. (#281 item 1)
  • [x] Encrypt users.interests (array) at write, decrypt at read. Stored as encryptedInterests. (#281 item 1)
  • [x] Geohash-truncate lanterns.lat/lng to ~3 decimal places (~111 m precision) before write. Exact coords stay in memory for proximity checks only. (#281 item 4)
  • [x] Update lantern denormalization: at light time, the lighter's decrypted mood + interests are passed to the Cloud Function via formData.profileVibe / formData.profileInterests and written onto the lantern doc. This keeps LanternMiniProfile working for viewers who can't decrypt the lighter's profile.
  • [x] Lazy migration: existing user docs get re-encrypted on next profile read when the encryption key is cached. No bulk migration job.
  • [x] Tests for profileService + lanternService updated.
  • [x] Follow-up: input validation on formData.profileInterests/profileVibe (server-side caps), truncateCoord rejects non-finite values, additional unit tests for encrypted-read paths (per Copilot review on #498).

What stays plaintext (and why):

  • lantern.interest (free-text quote) โ€” user-elected published content. Bounded by 48h TTL (Sprint B). Documented in ยง7 below.
  • lantern.profileVibe / lantern.profileInterests (denormalized) โ€” copies of the user's mood/interests written to the lantern at light time, so nearby users can read them without holding the lighter's key. Bounded by the same 48h TTL.
  • lantern.encryptedMetadata โ€” the schema comment in lanternService.js mentions an optional "encrypted user notes" field. We are NOT implementing this in Sprint A. It's a private-note feature for the lighter only โ€” separate from the published mood/interest. Tracked as a future enhancement, not blocking the privacy bar.

What Sprint A buys you: a Firestore dump of users/* shows ciphertext where there used to be plaintext sentences about user vibes and interests. A dump of lanterns/* still shows the published-content fields (intentional), but coordinates are truncated to ~111m precision so user-movement reconstruction from a leak is rendered much coarser. Combined with Sprint B's TTL enforcement, the practical exposure window for any lantern data closes to 48h.

Sprint B โ€” Retention and pseudonymization ๐ŸŸก โ€‹

Split into two PRs for review-surface reasons.

Sprint B.1 โ€” TTL enforcement (Firestore) ๐ŸŸก โ€‹

Effort: ~2 days. PR: claude/privacy-sprint-b (this branch).

  • [x] Scheduled Cloud Function purging lanterns docs older than 48h from litAt. Runs every 6h.
  • [x] Scheduled Cloud Function purging waves docs older than 7d from createdAt. Runs daily at 03:00 UTC.
  • [x] Scheduled Cloud Function purging connections docs older than 30d (by lastActivityAt with createdAt fallback for legacy schema) and cascade-deleting their messages/* sub-collection. Runs daily at 04:00 UTC.
  • [x] Each job batched (400 ops per Firestore batch) with a per-run cap (5 000 docs) so any single invocation has bounded cost.
  • [x] Exports registered in main.js.

Sprint B.2 โ€” BigQuery user_id pseudonymization ๐ŸŸก โ€‹

Effort: ~1โ€“2 days. PR: claude/privacy-sprint-b2 (this branch).

  • [x] services/api/analytics/src/services/userIdPseudonymization.service.js: SQL builder + runner that issues UPDATE analytics.events SET user_id = TO_HEX(SHA256(CONCAT(user_id, @salt))) against rows older than 90 days. Salt is crypto.randomBytes(32).toString('hex'), generated fresh per invocation, never persisted, logged, or returned.
  • [x] Idempotency: NOT REGEXP_CONTAINS(user_id, r'^[a-f0-9]{64}$') skips already-pseudonymized rows.
  • [x] Dry-run mode: returns COUNT(*) of would-affect rows without mutating.
  • [x] Range guard: ageDays clamped to [30, 730]. A too-low value would pseudonymize fresh operational data.
  • [x] HTTP entrypoint: POST /analytics/scheduled/pseudonymize-user-ids (Cloud-Scheduler-header OR admin auth, same pattern as the existing aggregation jobs).
  • [x] 8 vitest cases covering SQL shape, idempotency regex, ageDays validation, dry-run/UPDATE result shapes, salt-freshness, and the cryptographic-erasure invariant.

Operator follow-up (after merge):

  • [ ] Create the Cloud Scheduler job: daily at 05:00 UTC POST /analytics/scheduled/pseudonymize-user-ids.
  • [ ] Confirm analytics service account has roles/bigquery.dataEditor on the analytics dataset.
  • [ ] Privacy-policy line: "Analytics events older than 90 days are pseudonymized โ€” even Lantern cannot link them to a current account."

What this buys you (combined): A breach today can't expose data outside the retention window. Firestore content is purged within hours/days of its useful life ending. BigQuery events older than 90 days become aggregable but no longer user-linkable, even by Lantern.

Sprint C โ€” Finish Stage A (Phase 5) ๐ŸŸก โ€‹

Effort: ~2 days. PR: claude/privacy-sprint-c (this branch). Code changes merge in this PR; the migration script is operator-run after deploy.

Code (merged with this PR):

  • [x] tooling/scripts/drop-plaintext-phone.mjs โ€” cursor-paginated, --dry-run + --limit=N flags, idempotent. Refuses to drop phone from a row that doesn't already have phoneHash (defense-in-depth against running out of order).
  • [x] Updated services/functions/firebase/modules/phoneLookup.js โ€” uses phoneHash when STAGE_A_PHONE_HASH_LOOKUP_ENABLED=true; falls back to plaintext otherwise. Same selectLookupQuery pattern as services/api/auth/src/routes/phone.js.
  • [x] Updated services/api/auth/src/routes/phoneRecycling.js โ€” same Stage A flag pattern for the reclaim-by-phone lookup.
  • [x] Audited remaining users.phone reads. The only Firestore queries on the plaintext field were the two updated above. Other phone references in the codebase point at adminProfiles.phone (separate collection) or Firebase Auth user.phoneNumber (managed by Firebase Auth, not Firestore) โ€” none of those are affected.

Operator runbook (after merge):

  • [ ] Confirm STAGE_A_PHONE_HASH_LOOKUP_ENABLED=true is set on both auth-api Cloud Run AND the Cloud Functions runtime in every target environment.
  • [ ] Confirm tooling/scripts/backfill-phone-hash.mjs has been run on the target environment so every existing users/* row has phoneHash populated.
  • [ ] Let the new code soak with the flag on for at least 48h to surface any plaintext-path regressions.
  • [ ] Run tooling/scripts/drop-plaintext-phone.mjs --dry-run and inspect the counts. Expect Would drop = (#rows with plaintext phone), Skipped (missing phoneHash) = 0. If any rows are missing the hash, stop and re-run the backfill before continuing.
  • [ ] Run tooling/scripts/drop-plaintext-phone.mjs (no --dry-run). Idempotent โ€” safe to re-run if interrupted.
  • [ ] Optional: drop any composite Firestore index on users.phone if one exists. Single-field indexes are auto-created and will be auto-removed by Firestore over time once nothing queries the field. No active harm from leaving them.

What this buys you: The "plaintext phone (existing accounts)" row in the ยง3 threat-model table goes away entirely. A Firestore dump of users/* post-migration shows hashed phones across the board โ€” no remaining direct-PII leak via that surface.

Sprint D โ€” GDPR cryptographic erasure cascade ๐ŸŸก โ€‹

Split into two PRs for review-surface reasons (mirrors the Sprint B split rationale).

Sprint D.1 โ€” Server cascade (Firestore + Auth + Storage) ๐ŸŸก โ€‹

Effort: ~2 days. PR: claude/privacy-sprint-d (this branch).

  • [x] services/functions/firebase/modules/userDeletion.js: deleteUserCompletely callable Cloud Function. Cascades across:
    • Firebase Auth user record
    • users/{userId} Firestore doc
    • lanterns where userId = X
    • waves where senderId = X OR receiverId = X (two-pass query)
    • connections where participants contains X, with messages sub-collection cascade
    • Cloud Storage avatars/{userId}/* (best-effort)
  • [x] Authorization: admin-deleting-other (verifyAdmin) OR self-delete (callerUid matches userId).
  • [x] Audit row in adminActions โ€” userId-only references per docs/privacy/LOG_HYGIENE.md, capped 280-char reason. Returns counts per surface.
  • [x] Batched writes (400/batch) with safety break on runaway loops.
  • [x] Resilient: each cascade is wrapped in try/catch and continues on partial failure. Auth delete treats auth/user-not-found as success.

Sprint D.2 โ€” BigQuery pseudonymization-on-deletion ๐ŸŸก โ€‹

Effort: ~0.5 day. PR: claude/privacy-sprint-d2 (this branch).

  • [x] pseudonymizeUserBigQueryEvents(userId, projectId) helper inline in services/functions/firebase/modules/userDeletion.js. Mirrors the bulk job from Sprint B.2 but scoped to a single userId with a fresh per-call salt (crypto.randomBytes(32)).
  • [x] Wired into the deleteUserCompletely cascade as step 7 (between user-doc + Auth cleanup and the audit row). Best-effort: BigQuery failures are logged and swallowed so they don't roll back the Firestore/Auth deletion that's already completed.
  • [x] BigQuery dep available transitively via @lantern/forge (already a dep of services/functions/firebase). No new top-level dep added.
  • [x] counts.bigQueryEvents added to the return shape so the caller can see how many rows were pseudonymized.

Architectural choice (worth noting): inlined in the Cloud Function rather than calling the analytics service via HTTPS. Cross-service auth (Cloud Function โ†’ Cloud Run with IAM ID tokens) adds setup. The BigQuery client is small enough that inlining is the lower-friction call. If we ever extract analytics into its own deploy with strict isolation, this is a clean refactor candidate.

Sprint D.3 โ€” UI integration ๐ŸŸฆ โ€‹

Effort: ~1 day. PR: (TBD โ€” separate UI work)

  • [ ] Admin panel button for moderator-triggered deletion (with reason field).
  • [ ] User-initiated "delete my account" path from profile settings, with confirmation.
  • [ ] Privacy policy line: "we cascade and cryptographically erase on request."

What this buys you (combined): A user requesting deletion (or required under GDPR Art. 17) leaves no recoverable trace within seconds, not 90 days. The "account deletion leaves orphaned data" gap from #281 item 3 is closed.

Sprint E โ€” Stage B (seal the userId resolution) ๐ŸŸฆ โ€‹

Effort: ~2 sprints, ~8โ€“10 working days. PR: (TBD, likely multiple)

  • [ ] Replace plaintext userId in the auth-lookup row with encryptedUserIdBlob = AES-256-GCM(userId, key = HKDF(entropy)).
  • [ ] Update /auth/phone/lookup to return encryptedUserIdBlob instead of userId. Client decrypts after PIN unlock.
  • [ ] Existing-user migration: re-encrypt user_id blobs at next successful login.
  • [ ] Recovery flow when client clears app data but retains passphrase.
  • [ ] CS workflow redesign (no more "lookup user by phone" from the admin tool).
  • [ ] Anti-fraud detection redesign for userId-only signals.

What this buys you: A subpoena providing a phone returns hashed phone + ciphertext blob โ€” the server cannot decrypt it without the user's passphrase. The remaining userId-keyed records become useful only if the subpoena starts with a userId (typically law enforcement starts with a phone or an email).

5. Governance (parallel, non-engineering) โ€‹

  • [ ] Mission Arbiter call on #145 Layers 2โ€“5 (device fingerprinting, behavioral, social-graph, payment-method bans). All four directly contradict Right #6 and the sealed-identity brief's non-negotiables. Currently marked BLOCKED. Recommend formally closing them with a Mission Arbiter decision rather than leaving as "pending."
  • [ ] Privacy policy + ToS audit against shipped architecture after each sprint. Avoid both over-claiming ("we never see your phone" โ€” we see it transiently to compute the hash) and under-claiming (silence on the encrypted-userId blob loses the marketing benefit).
  • [ ] Subpoena response playbook documenting per-request-shape what we can/cannot produce.
  • [ ] DPIA prep if/when EU market entry is on the roadmap.

6. Residual threats โ€” what stays "theoretically yielding something" even after all sprints โ€‹

To be straight: after Sprints Aโ€“E ship, the smallest defensible attack surface remaining is:

  • Account existence by phone hash. With a known phone and the pepper, a court can compel Lantern to compute the hash and return yes/no. Removing this requires removing auth.
  • Cloud Run access logs. Timestamps + IPs + userIds at the GCP infrastructure layer. App logs have no PII (#307 fixed); the access log itself is structural to running on GCP.
  • GCP as a tenant. A court compelling Google directly is outside our architectural defenses. Mitigation is jurisdiction selection โ€” UK IPA / Australia TOLA flagged in SEALED_IDENTITY.md as do-not-enter without Stage B already shipped.
  • Aggregate inference under combined queries. k-anonymity โ‰ฅ 3 is necessary but not sufficient; differential privacy (Laplace noise on aggregates) is a future Phase 3+ item.

That's the floor. After Sprints Aโ€“E we reach it.

7. Decision log โ€‹

DateDecisionRationale
2026-05-10Hash users.phone with HMAC-SHA-256 + KMS pepper, not bcryptbcrypt is per-row salted โ†’ can't equality-query for the hot login path. HMAC + KMS pepper gives equivalent dictionary-attack resistance under the "attacker has DB dump, no pepper" threat model. (Stage A spec ยง6.1)
2026-05-10Descope #145 Layers 2โ€“5 (device + behavioral + social-graph + payment fingerprinting)Directly contradicts Immutable Right #6. Layer 1 (hashed phone/email ban list) is shipped. Layers 2โ€“5 await Mission Arbiter governance decision.
2026-05-10Counsel review is parallel documentation work, not a permission gatePrivacy architecture is not subject to legal-review veto. Counsel describes what's shipped (privacy policy, subpoena playbook, DPIA); they don't authorize what ships.
2026-05-11Phase 4 = server-side createUser endpoint, not client-fetches-hash-and-writesClient-fetches-hash doesn't prevent ban-bypass by malicious client. Server owns the write โ†’ ban check is unbypassable.
2026-05-11This roadmap document is canonical for the multi-sprint privacy hardening workSurvives context compaction; lets multiple PRs reference a single source of truth.
2026-05-11lantern.interest free-text quote stays plaintext at rest; relies on Sprint B's 48h TTL for bounded exposureEncrypting it with the lighter's key breaks social discovery (nearby users can't read). Proximity-derived shared-key encryption is out of scope pre-launch. User-elected publishing is acknowledged as part of the product.
2026-05-11Profile encryption uses lazy migration on next login, not a bulk backfill jobActive-user population converges naturally; inactive accounts age out via separate cleanup. Avoids long-running migrations against the live database.
2026-05-11Sprint B split into B.1 (Firestore TTL) and B.2 (BigQuery pseudonymization)Two distinct infrastructure surfaces. B.1 is straightforward scheduled functions; B.2 requires BigQuery client + Service Account permissions and benefits from its own review/rollout. Both still belong to "Sprint B" in the roadmap narrative.
2026-05-11Use scheduled Cloud Functions, not Firestore native TTL, for retentionFirestore native TTL doesn't cascade to sub-collections โ€” connections/{cid}/messages/* would remain after the parent doc deletes. Doing all three retention jobs in one Cloud Functions module keeps the story coherent.
2026-05-11Two-tier BigQuery retention model: raw events (user-keyed, 90-day partition expiration) vs aggregated event_counts_daily (no user_id, indefinite retention)Partition expiration on analytics.events is enforced by BigQuery itself โ€” events older than 90 days don't exist. The aggregation MERGE that builds event_counts_daily groups by (day, event_name, event_tier, environment) only, so user identity is aggregated away at the bridge between tiers. Long-tail merchant analytics stay queryable; user identifiability does not. Verified by inspecting schema 2026-05-11: event_counts_daily has no user_id / entity_id / service_id columns.

8. References โ€‹

9. How to use this document โ€‹

  • Each sprint PR updates this file's checkboxes in ยง4 to reflect what shipped.
  • Decisions get appended to ยง7 with date + rationale. Never silently overwrite โ€” append.
  • The threat-model table in ยง3 gets updated after each sprint merges, to reflect the new floor.
  • Residual threats in ยง6 are honest โ€” never claim more sealing than the architecture actually provides.

Built with VitePress