Privacy Hardening Roadmap

Living document. Last updated: 2026-05-11 Status: Sprint A in progress Owner: privacy workstream Companion brief: SEALED_IDENTITY.md

1. The bar

The architectural goal — agreed 2026-05-11 — is that a complete Firestore dump or BigQuery export, OR a subpoena providing a phone number, should yield no usefully identifiable user content. Identifiers should be hashed against a KMS pepper the server doesn't directly hold (Cloud KMS-resident long term). Content fields should be encrypted with keys derived from material the server cannot reproduce.

Stated differently: privacy stops being a policy we keep and becomes a property of the data. We can't disclose what we don't structurally hold.

This is aspirational but not theatre — it's the strict reading of Immutable Right #6 (Cofounder Agreement) and Business Plan §3.2 (anonymity, k-anonymity ≥ 3 on merchant-surfaced metrics).

2. What's already shipped

PR	Title	Merged
#479	Sealed-identity Stage A phases 1–2 + log hygiene (#307)	2026-05-10
#485	Stage A phase 3 + `banned_accounts` end-to-end	2026-05-11
#490	Stage A phase 4 — server-side `createUser` (stops writing plaintext `users.phone` on new signups)	2026-05-11

Effect today: new accounts store phoneHash only; existing accounts still have plaintext phone until Sprint C below. Profile encryption (PBKDF2 + AES-GCM via apps/web/src/lib/encryption.js) was already in place before this roadmap began.

3. Current threat model — what a dump or subpoena yields

Surface	Today (post-Stage-A)	Useful to attacker / court?
`users.phoneHash` (new accounts)	hex hash, no preimage without our KMS pepper	Only confirms "this phone has an account" if you can compute the hash.
`users.phone` (existing accounts, pre-migration only)	plaintext on rows the Sprint C migration script hasn't run against yet	Yes for those rows; eliminated by the operator running `tooling/scripts/drop-plaintext-phone.mjs` after the Stage A flag has soaked.
`users.encryptedSeed`, `phoneSalt`, `authProofHash`, `encryptedBirthDate`, `encryptionCanary`	sealed	No — we cannot decrypt.
`users.lanternName`, `lastLoginAt`, `authMethod`	pseudonym + timestamps	Fuzzy.
`users.encryptedMood`	AES-GCM ciphertext (post-Sprint-A)	No — we cannot decrypt.
`users.encryptedInterests`	AES-GCM ciphertext (post-Sprint-A)	No — we cannot decrypt.
`users.mood` / `users.vibe` / `users.interests` (legacy plaintext)	plaintext on not-yet-migrated docs only	Yes for those docs; eliminated as lazy migration runs on next read.
`lanterns.profileVibe` / `lanterns.profileInterests` (denormalized at light time)	plaintext, but bounded by 48h TTL (Sprint B)	User-elected published content visible to nearby users by design; bounded exposure.
`lanterns.interest` (free-text quote)	plaintext, bounded by 48h TTL (Sprint B)	User-elected published content; see Sprint A scope decision in §4 and decision log §7.
`lanterns.lat`, `lanterns.lng`	post-Sprint-A truncated to ~111m (3 decimals)	Fine-grained movement reconstruction now structurally degraded.
`chats.*` message bodies	encrypted client-side	No
`chats.*` metadata (timestamps, participant userIds)	userId pairs + timestamps	Some — userId linkability remains until Stage B.
`waves.*`	userId pairs + timestamps	Some — same.
`banned_accounts`	hashes only, server-only Firestore rule	No
BigQuery `analytics.events` (raw, user-keyed)	user_id-indexed event stream, 90-day partition expiration enforced by BigQuery itself (`partitioning.expirationMs = 7776000000` on the `timestamp` field). Partitions older than 90 days are auto-deleted — events older than 90 days simply do not exist.	No, for events outside the 90-day window. For events inside the window: Sprint D.2 pseudonymizes a deleted user's events immediately; Sprint B.2 is a safety net in case partition expiration is ever lengthened.
BigQuery `analytics.event_counts_daily` (aggregated export)	columns: `day, event_name, event_tier, environment, count, aggregated_at`. No `user_id`, no `entity_id`, no per-user breakdown. The MERGE in `eventCountsAggregation.service.js` groups by `(day, event_name, event_tier, environment)` only — user identity is aggregated away at the bridge between raw and aggregated tiers. Indefinite retention.	No — there is no per-user information to leak. This is the merchant-facing analytics surface; long-tail retention is intentional and privacy-safe.
Cloud Run access logs	IPs + timestamps + userIds (30-day retention from #307)	Some — structural to GCP.
`users/{uid}` ↔ activity link via plaintext `userId`	direct	Yes — Stage B (Sprint E) seals this last surface.

4. Sprint plan

Status legend: 🟦 not started / 🟡 in progress / 🟢 merged

Sprint A — Encrypt the loudest plaintext leaks 🟢

Effort: ~3–4 days. PR: #498 — merged 2026-05-11 02:30 UTC. Follow-up review fixes in [#TBD] (Copilot review on #498 surfaced input validation + NaN edge cases — addressed).

Scope decision (2026-05-11): the lantern free-text quote (lantern.interest) stays plaintext at rest. It's user-elected published content visible to nearby users by design — the product's social-discovery use case (e.g. "Looking for hiking partners") depends on viewers being able to read it. Encrypting it with the lighter's key breaks the feature; encrypting with a venue/proximity-derived key is out of scope pre-launch. Breach exposure is bounded by Sprint B's 48h TTL on lantern docs.

What we still encrypt in Sprint A:

[x] Encrypt users.mood at write, decrypt at read. Stored as encryptedMood. (#281 item 1)
[x] Encrypt users.interests (array) at write, decrypt at read. Stored as encryptedInterests. (#281 item 1)
[x] Geohash-truncate lanterns.lat/lng to ~3 decimal places (~111 m precision) before write. Exact coords stay in memory for proximity checks only. (#281 item 4)
[x] Update lantern denormalization: at light time, the lighter's decrypted mood + interests are passed to the Cloud Function via formData.profileVibe / formData.profileInterests and written onto the lantern doc. This keeps LanternMiniProfile working for viewers who can't decrypt the lighter's profile.
[x] Lazy migration: existing user docs get re-encrypted on next profile read when the encryption key is cached. No bulk migration job.
[x] Tests for profileService + lanternService updated.
[x] Follow-up: input validation on formData.profileInterests/profileVibe (server-side caps), truncateCoord rejects non-finite values, additional unit tests for encrypted-read paths (per Copilot review on #498).

What stays plaintext (and why):

lantern.interest (free-text quote) — user-elected published content. Bounded by 48h TTL (Sprint B). Documented in §7 below.
lantern.profileVibe / lantern.profileInterests (denormalized) — copies of the user's mood/interests written to the lantern at light time, so nearby users can read them without holding the lighter's key. Bounded by the same 48h TTL.
lantern.encryptedMetadata — the schema comment in lanternService.js mentions an optional "encrypted user notes" field. We are NOT implementing this in Sprint A. It's a private-note feature for the lighter only — separate from the published mood/interest. Tracked as a future enhancement, not blocking the privacy bar.

What Sprint A buys you: a Firestore dump of users/* shows ciphertext where there used to be plaintext sentences about user vibes and interests. A dump of lanterns/* still shows the published-content fields (intentional), but coordinates are truncated to ~111m precision so user-movement reconstruction from a leak is rendered much coarser. Combined with Sprint B's TTL enforcement, the practical exposure window for any lantern data closes to 48h.

Sprint B — Retention and pseudonymization 🟡

Split into two PRs for review-surface reasons.

Sprint B.1 — TTL enforcement (Firestore) 🟡

Effort: ~2 days. PR: claude/privacy-sprint-b (this branch).

[x] Scheduled Cloud Function purging lanterns docs older than 48h from litAt. Runs every 6h.
[x] Scheduled Cloud Function purging waves docs older than 7d from createdAt. Runs daily at 03:00 UTC.
[x] Scheduled Cloud Function purging connections docs older than 30d (by lastActivityAt with createdAt fallback for legacy schema) and cascade-deleting their messages/* sub-collection. Runs daily at 04:00 UTC.
[x] Each job batched (400 ops per Firestore batch) with a per-run cap (5 000 docs) so any single invocation has bounded cost.
[x] Exports registered in main.js.

Sprint B.2 — BigQuery user_id pseudonymization 🟡

Effort: ~1–2 days. PR: claude/privacy-sprint-b2 (this branch).

[x] services/api/analytics/src/services/userIdPseudonymization.service.js: SQL builder + runner that issues UPDATE analytics.events SET user_id = TO_HEX(SHA256(CONCAT(user_id, @salt))) against rows older than 90 days. Salt is crypto.randomBytes(32).toString('hex'), generated fresh per invocation, never persisted, logged, or returned.
[x] Idempotency: NOT REGEXP_CONTAINS(user_id, r'^[a-f0-9]{64}$') skips already-pseudonymized rows.
[x] Dry-run mode: returns COUNT(*) of would-affect rows without mutating.
[x] Range guard: ageDays clamped to [30, 730]. A too-low value would pseudonymize fresh operational data.
[x] HTTP entrypoint: POST /analytics/scheduled/pseudonymize-user-ids (Cloud-Scheduler-header OR admin auth, same pattern as the existing aggregation jobs).
[x] 8 vitest cases covering SQL shape, idempotency regex, ageDays validation, dry-run/UPDATE result shapes, salt-freshness, and the cryptographic-erasure invariant.

Operator follow-up (after merge):

[ ] Create the Cloud Scheduler job: daily at 05:00 UTC POST /analytics/scheduled/pseudonymize-user-ids.
[ ] Confirm analytics service account has roles/bigquery.dataEditor on the analytics dataset.
[ ] Privacy-policy line: "Analytics events older than 90 days are pseudonymized — even Lantern cannot link them to a current account."

What this buys you (combined): A breach today can't expose data outside the retention window. Firestore content is purged within hours/days of its useful life ending. BigQuery events older than 90 days become aggregable but no longer user-linkable, even by Lantern.

Sprint C — Finish Stage A (Phase 5) 🟡

Effort: ~2 days. PR: claude/privacy-sprint-c (this branch). Code changes merge in this PR; the migration script is operator-run after deploy.

Code (merged with this PR):

[x] tooling/scripts/drop-plaintext-phone.mjs — cursor-paginated, --dry-run + --limit=N flags, idempotent. Refuses to drop phone from a row that doesn't already have phoneHash (defense-in-depth against running out of order).
[x] Updated services/functions/firebase/modules/phoneLookup.js — uses phoneHash when STAGE_A_PHONE_HASH_LOOKUP_ENABLED=true; falls back to plaintext otherwise. Same selectLookupQuery pattern as services/api/auth/src/routes/phone.js.
[x] Updated services/api/auth/src/routes/phoneRecycling.js — same Stage A flag pattern for the reclaim-by-phone lookup.
[x] Audited remaining users.phone reads. The only Firestore queries on the plaintext field were the two updated above. Other phone references in the codebase point at adminProfiles.phone (separate collection) or Firebase Auth user.phoneNumber (managed by Firebase Auth, not Firestore) — none of those are affected.

Operator runbook (after merge):

[ ] Confirm STAGE_A_PHONE_HASH_LOOKUP_ENABLED=true is set on both auth-api Cloud Run AND the Cloud Functions runtime in every target environment.
[ ] Confirm tooling/scripts/backfill-phone-hash.mjs has been run on the target environment so every existing users/* row has phoneHash populated.
[ ] Let the new code soak with the flag on for at least 48h to surface any plaintext-path regressions.
[ ] Run tooling/scripts/drop-plaintext-phone.mjs --dry-run and inspect the counts. Expect Would drop = (#rows with plaintext phone), Skipped (missing phoneHash) = 0. If any rows are missing the hash, stop and re-run the backfill before continuing.
[ ] Run tooling/scripts/drop-plaintext-phone.mjs (no --dry-run). Idempotent — safe to re-run if interrupted.
[ ] Optional: drop any composite Firestore index on users.phone if one exists. Single-field indexes are auto-created and will be auto-removed by Firestore over time once nothing queries the field. No active harm from leaving them.

What this buys you: The "plaintext phone (existing accounts)" row in the §3 threat-model table goes away entirely. A Firestore dump of users/* post-migration shows hashed phones across the board — no remaining direct-PII leak via that surface.

Split into two PRs for review-surface reasons (mirrors the Sprint B split rationale).

Sprint D.1 — Server cascade (Firestore + Auth + Storage) 🟡

Effort: ~2 days. PR: claude/privacy-sprint-d (this branch).

[x] services/functions/firebase/modules/userDeletion.js: deleteUserCompletely callable Cloud Function. Cascades across:
- Firebase Auth user record
- users/{userId} Firestore doc
- lanterns where userId = X
- waves where senderId = X OR receiverId = X (two-pass query)
- connections where participants contains X, with messages sub-collection cascade
- Cloud Storage avatars/{userId}/* (best-effort)
[x] Authorization: admin-deleting-other (verifyAdmin) OR self-delete (callerUid matches userId).
[x] Audit row in adminActions — userId-only references per docs/privacy/LOG_HYGIENE.md, capped 280-char reason. Returns counts per surface.
[x] Batched writes (400/batch) with safety break on runaway loops.
[x] Resilient: each cascade is wrapped in try/catch and continues on partial failure. Auth delete treats auth/user-not-found as success.

Sprint D.2 — BigQuery pseudonymization-on-deletion 🟡

Effort: ~0.5 day. PR: claude/privacy-sprint-d2 (this branch).

[x] pseudonymizeUserBigQueryEvents(userId, projectId) helper inline in services/functions/firebase/modules/userDeletion.js. Mirrors the bulk job from Sprint B.2 but scoped to a single userId with a fresh per-call salt (crypto.randomBytes(32)).
[x] Wired into the deleteUserCompletely cascade as step 7 (between user-doc + Auth cleanup and the audit row). Best-effort: BigQuery failures are logged and swallowed so they don't roll back the Firestore/Auth deletion that's already completed.
[x] BigQuery dep available transitively via @lantern/forge (already a dep of services/functions/firebase). No new top-level dep added.
[x] counts.bigQueryEvents added to the return shape so the caller can see how many rows were pseudonymized.

Architectural choice (worth noting): inlined in the Cloud Function rather than calling the analytics service via HTTPS. Cross-service auth (Cloud Function → Cloud Run with IAM ID tokens) adds setup. The BigQuery client is small enough that inlining is the lower-friction call. If we ever extract analytics into its own deploy with strict isolation, this is a clean refactor candidate.

Sprint D.3 — UI integration 🟦

Effort: ~1 day. PR: (TBD — separate UI work)

[ ] Admin panel button for moderator-triggered deletion (with reason field).
[ ] User-initiated "delete my account" path from profile settings, with confirmation.
[ ] Privacy policy line: "we cascade and cryptographically erase on request."

What this buys you (combined): A user requesting deletion (or required under GDPR Art. 17) leaves no recoverable trace within seconds, not 90 days. The "account deletion leaves orphaned data" gap from #281 item 3 is closed.

Sprint E — Stage B (seal the userId resolution) 🟦

Effort: ~2 sprints, ~8–10 working days. PR: (TBD, likely multiple)

[ ] Replace plaintext userId in the auth-lookup row with encryptedUserIdBlob = AES-256-GCM(userId, key = HKDF(entropy)).
[ ] Update /auth/phone/lookup to return encryptedUserIdBlob instead of userId. Client decrypts after PIN unlock.
[ ] Existing-user migration: re-encrypt user_id blobs at next successful login.
[ ] Recovery flow when client clears app data but retains passphrase.
[ ] CS workflow redesign (no more "lookup user by phone" from the admin tool).
[ ] Anti-fraud detection redesign for userId-only signals.

What this buys you: A subpoena providing a phone returns hashed phone + ciphertext blob — the server cannot decrypt it without the user's passphrase. The remaining userId-keyed records become useful only if the subpoena starts with a userId (typically law enforcement starts with a phone or an email).

5. Governance (parallel, non-engineering)

[ ] Mission Arbiter call on #145 Layers 2–5 (device fingerprinting, behavioral, social-graph, payment-method bans). All four directly contradict Right #6 and the sealed-identity brief's non-negotiables. Currently marked BLOCKED. Recommend formally closing them with a Mission Arbiter decision rather than leaving as "pending."
[ ] Privacy policy + ToS audit against shipped architecture after each sprint. Avoid both over-claiming ("we never see your phone" — we see it transiently to compute the hash) and under-claiming (silence on the encrypted-userId blob loses the marketing benefit).
[ ] Subpoena response playbook documenting per-request-shape what we can/cannot produce.
[ ] DPIA prep if/when EU market entry is on the roadmap.

6. Residual threats — what stays "theoretically yielding something" even after all sprints

To be straight: after Sprints A–E ship, the smallest defensible attack surface remaining is:

Account existence by phone hash. With a known phone and the pepper, a court can compel Lantern to compute the hash and return yes/no. Removing this requires removing auth.
Cloud Run access logs. Timestamps + IPs + userIds at the GCP infrastructure layer. App logs have no PII (#307 fixed); the access log itself is structural to running on GCP.
GCP as a tenant. A court compelling Google directly is outside our architectural defenses. Mitigation is jurisdiction selection — UK IPA / Australia TOLA flagged in SEALED_IDENTITY.md as do-not-enter without Stage B already shipped.
Aggregate inference under combined queries. k-anonymity ≥ 3 is necessary but not sufficient; differential privacy (Laplace noise on aggregates) is a future Phase 3+ item.

That's the floor. After Sprints A–E we reach it.

7. Decision log

Date	Decision	Rationale
2026-05-10	Hash `users.phone` with HMAC-SHA-256 + KMS pepper, not bcrypt	bcrypt is per-row salted → can't equality-query for the hot login path. HMAC + KMS pepper gives equivalent dictionary-attack resistance under the "attacker has DB dump, no pepper" threat model. (Stage A spec §6.1)
2026-05-10	Descope #145 Layers 2–5 (device + behavioral + social-graph + payment fingerprinting)	Directly contradicts Immutable Right #6. Layer 1 (hashed phone/email ban list) is shipped. Layers 2–5 await Mission Arbiter governance decision.
2026-05-10	Counsel review is parallel documentation work, not a permission gate	Privacy architecture is not subject to legal-review veto. Counsel describes what's shipped (privacy policy, subpoena playbook, DPIA); they don't authorize what ships.
2026-05-11	Phase 4 = server-side `createUser` endpoint, not client-fetches-hash-and-writes	Client-fetches-hash doesn't prevent ban-bypass by malicious client. Server owns the write → ban check is unbypassable.
2026-05-11	This roadmap document is canonical for the multi-sprint privacy hardening work	Survives context compaction; lets multiple PRs reference a single source of truth.
2026-05-11	`lantern.interest` free-text quote stays plaintext at rest; relies on Sprint B's 48h TTL for bounded exposure	Encrypting it with the lighter's key breaks social discovery (nearby users can't read). Proximity-derived shared-key encryption is out of scope pre-launch. User-elected publishing is acknowledged as part of the product.
2026-05-11	Profile encryption uses lazy migration on next login, not a bulk backfill job	Active-user population converges naturally; inactive accounts age out via separate cleanup. Avoids long-running migrations against the live database.
2026-05-11	Sprint B split into B.1 (Firestore TTL) and B.2 (BigQuery pseudonymization)	Two distinct infrastructure surfaces. B.1 is straightforward scheduled functions; B.2 requires BigQuery client + Service Account permissions and benefits from its own review/rollout. Both still belong to "Sprint B" in the roadmap narrative.
2026-05-11	Use scheduled Cloud Functions, not Firestore native TTL, for retention	Firestore native TTL doesn't cascade to sub-collections — `connections/{cid}/messages/*` would remain after the parent doc deletes. Doing all three retention jobs in one Cloud Functions module keeps the story coherent.
2026-05-11	Two-tier BigQuery retention model: raw events (user-keyed, 90-day partition expiration) vs aggregated `event_counts_daily` (no user_id, indefinite retention)	Partition expiration on `analytics.events` is enforced by BigQuery itself — events older than 90 days don't exist. The aggregation MERGE that builds `event_counts_daily` groups by `(day, event_name, event_tier, environment)` only, so user identity is aggregated away at the bridge between tiers. Long-tail merchant analytics stay queryable; user identifiability does not. Verified by inspecting schema 2026-05-11: `event_counts_daily` has no `user_id` / `entity_id` / `service_id` columns.

8. References

Brief: docs/privacy/SEALED_IDENTITY.md
Stage A design spec: docs/superpowers/specs/2026-05-10-sealed-identity-stage-a-design.md
Stage A spike plan: docs/superpowers/plans/2026-05-10-sealed-identity-spike.md
Phase 4 plan: docs/superpowers/plans/2026-05-11-sealed-identity-stage-a-phase-4-impl.md
Privacy audit issue: #281
GDPR deletion + BQ pseudonymization issue: #308
Existing privacy docs: HOW_ENCRYPTION_WORKS.md, PRIVACY_PRESERVING_DATA_COLLECTION.md
Log hygiene policy: docs/privacy/LOG_HYGIENE.md

9. How to use this document

Each sprint PR updates this file's checkboxes in §4 to reflect what shipped.
Decisions get appended to §7 with date + rationale. Never silently overwrite — append.
The threat-model table in §3 gets updated after each sprint merges, to reflect the new floor.
Residual threats in §6 are honest — never claim more sealing than the architecture actually provides.

Privacy Hardening Roadmap ​

1. The bar ​

2. What's already shipped ​

3. Current threat model — what a dump or subpoena yields ​

4. Sprint plan ​

Sprint A — Encrypt the loudest plaintext leaks 🟢 ​

Sprint B — Retention and pseudonymization 🟡 ​

Sprint B.1 — TTL enforcement (Firestore) 🟡 ​

Sprint B.2 — BigQuery user_id pseudonymization 🟡 ​

Sprint C — Finish Stage A (Phase 5) 🟡 ​

Sprint D — GDPR cryptographic erasure cascade 🟡 ​

Sprint D.1 — Server cascade (Firestore + Auth + Storage) 🟡 ​

Sprint D.2 — BigQuery pseudonymization-on-deletion 🟡 ​

Sprint D.3 — UI integration 🟦 ​

Sprint E — Stage B (seal the userId resolution) 🟦 ​

5. Governance (parallel, non-engineering) ​

6. Residual threats — what stays "theoretically yielding something" even after all sprints ​

7. Decision log ​

8. References ​

9. How to use this document ​