Privacy Hardening Roadmap โ
Living document. Last updated: 2026-05-11 Status: Sprint A in progress Owner: privacy workstream Companion brief: SEALED_IDENTITY.md
1. The bar โ
The architectural goal โ agreed 2026-05-11 โ is that a complete Firestore dump or BigQuery export, OR a subpoena providing a phone number, should yield no usefully identifiable user content. Identifiers should be hashed against a KMS pepper the server doesn't directly hold (Cloud KMS-resident long term). Content fields should be encrypted with keys derived from material the server cannot reproduce.
Stated differently: privacy stops being a policy we keep and becomes a property of the data. We can't disclose what we don't structurally hold.
This is aspirational but not theatre โ it's the strict reading of Immutable Right #6 (Cofounder Agreement) and Business Plan ยง3.2 (anonymity, k-anonymity โฅ 3 on merchant-surfaced metrics).
2. What's already shipped โ
| PR | Title | Merged |
|---|---|---|
| #479 | Sealed-identity Stage A phases 1โ2 + log hygiene (#307) | 2026-05-10 |
| #485 | Stage A phase 3 + banned_accounts end-to-end | 2026-05-11 |
| #490 | Stage A phase 4 โ server-side createUser (stops writing plaintext users.phone on new signups) | 2026-05-11 |
Effect today: new accounts store phoneHash only; existing accounts still have plaintext phone until Sprint C below. Profile encryption (PBKDF2 + AES-GCM via apps/web/src/lib/encryption.js) was already in place before this roadmap began.
3. Current threat model โ what a dump or subpoena yields โ
| Surface | Today (post-Stage-A) | Useful to attacker / court? |
|---|---|---|
users.phoneHash (new accounts) | hex hash, no preimage without our KMS pepper | Only confirms "this phone has an account" if you can compute the hash. |
users.phone (existing accounts, pre-migration only) | plaintext on rows the Sprint C migration script hasn't run against yet | Yes for those rows; eliminated by the operator running tooling/scripts/drop-plaintext-phone.mjs after the Stage A flag has soaked. |
users.encryptedSeed, phoneSalt, authProofHash, encryptedBirthDate, encryptionCanary | sealed | No โ we cannot decrypt. |
users.lanternName, lastLoginAt, authMethod | pseudonym + timestamps | Fuzzy. |
users.encryptedMood | AES-GCM ciphertext (post-Sprint-A) | No โ we cannot decrypt. |
users.encryptedInterests | AES-GCM ciphertext (post-Sprint-A) | No โ we cannot decrypt. |
users.mood / users.vibe / users.interests (legacy plaintext) | plaintext on not-yet-migrated docs only | Yes for those docs; eliminated as lazy migration runs on next read. |
lanterns.profileVibe / lanterns.profileInterests (denormalized at light time) | plaintext, but bounded by 48h TTL (Sprint B) | User-elected published content visible to nearby users by design; bounded exposure. |
lanterns.interest (free-text quote) | plaintext, bounded by 48h TTL (Sprint B) | User-elected published content; see Sprint A scope decision in ยง4 and decision log ยง7. |
lanterns.lat, lanterns.lng | post-Sprint-A truncated to ~111m (3 decimals) | Fine-grained movement reconstruction now structurally degraded. |
chats.* message bodies | encrypted client-side | No |
chats.* metadata (timestamps, participant userIds) | userId pairs + timestamps | Some โ userId linkability remains until Stage B. |
waves.* | userId pairs + timestamps | Some โ same. |
banned_accounts | hashes only, server-only Firestore rule | No |
BigQuery analytics.events (raw, user-keyed) | user_id-indexed event stream, 90-day partition expiration enforced by BigQuery itself (partitioning.expirationMs = 7776000000 on the timestamp field). Partitions older than 90 days are auto-deleted โ events older than 90 days simply do not exist. | No, for events outside the 90-day window. For events inside the window: Sprint D.2 pseudonymizes a deleted user's events immediately; Sprint B.2 is a safety net in case partition expiration is ever lengthened. |
BigQuery analytics.event_counts_daily (aggregated export) | columns: day, event_name, event_tier, environment, count, aggregated_at. No user_id, no entity_id, no per-user breakdown. The MERGE in eventCountsAggregation.service.js groups by (day, event_name, event_tier, environment) only โ user identity is aggregated away at the bridge between raw and aggregated tiers. Indefinite retention. | No โ there is no per-user information to leak. This is the merchant-facing analytics surface; long-tail retention is intentional and privacy-safe. |
| Cloud Run access logs | IPs + timestamps + userIds (30-day retention from #307) | Some โ structural to GCP. |
users/{uid} โ activity link via plaintext userId | direct | Yes โ Stage B (Sprint E) seals this last surface. |
4. Sprint plan โ
Status legend: ๐ฆ not started / ๐ก in progress / ๐ข merged
Sprint A โ Encrypt the loudest plaintext leaks ๐ข โ
Effort: ~3โ4 days. PR: #498 โ merged 2026-05-11 02:30 UTC. Follow-up review fixes in [#TBD] (Copilot review on #498 surfaced input validation + NaN edge cases โ addressed).
Scope decision (2026-05-11): the lantern free-text quote (lantern.interest) stays plaintext at rest. It's user-elected published content visible to nearby users by design โ the product's social-discovery use case (e.g. "Looking for hiking partners") depends on viewers being able to read it. Encrypting it with the lighter's key breaks the feature; encrypting with a venue/proximity-derived key is out of scope pre-launch. Breach exposure is bounded by Sprint B's 48h TTL on lantern docs.
What we still encrypt in Sprint A:
- [x] Encrypt
users.moodat write, decrypt at read. Stored asencryptedMood. (#281 item 1) - [x] Encrypt
users.interests(array) at write, decrypt at read. Stored asencryptedInterests. (#281 item 1) - [x] Geohash-truncate
lanterns.lat/lngto ~3 decimal places (~111 m precision) before write. Exact coords stay in memory for proximity checks only. (#281 item 4) - [x] Update lantern denormalization: at light time, the lighter's decrypted mood + interests are passed to the Cloud Function via
formData.profileVibe/formData.profileInterestsand written onto the lantern doc. This keepsLanternMiniProfileworking for viewers who can't decrypt the lighter's profile. - [x] Lazy migration: existing user docs get re-encrypted on next profile read when the encryption key is cached. No bulk migration job.
- [x] Tests for profileService + lanternService updated.
- [x] Follow-up: input validation on
formData.profileInterests/profileVibe(server-side caps),truncateCoordrejects non-finite values, additional unit tests for encrypted-read paths (per Copilot review on #498).
What stays plaintext (and why):
lantern.interest(free-text quote) โ user-elected published content. Bounded by 48h TTL (Sprint B). Documented in ยง7 below.lantern.profileVibe/lantern.profileInterests(denormalized) โ copies of the user's mood/interests written to the lantern at light time, so nearby users can read them without holding the lighter's key. Bounded by the same 48h TTL.lantern.encryptedMetadataโ the schema comment inlanternService.jsmentions an optional "encrypted user notes" field. We are NOT implementing this in Sprint A. It's a private-note feature for the lighter only โ separate from the published mood/interest. Tracked as a future enhancement, not blocking the privacy bar.
What Sprint A buys you: a Firestore dump of users/* shows ciphertext where there used to be plaintext sentences about user vibes and interests. A dump of lanterns/* still shows the published-content fields (intentional), but coordinates are truncated to ~111m precision so user-movement reconstruction from a leak is rendered much coarser. Combined with Sprint B's TTL enforcement, the practical exposure window for any lantern data closes to 48h.
Sprint B โ Retention and pseudonymization ๐ก โ
Split into two PRs for review-surface reasons.
Sprint B.1 โ TTL enforcement (Firestore) ๐ก โ
Effort: ~2 days. PR: claude/privacy-sprint-b (this branch).
- [x] Scheduled Cloud Function purging
lanternsdocs older than 48h fromlitAt. Runs every 6h. - [x] Scheduled Cloud Function purging
wavesdocs older than 7d fromcreatedAt. Runs daily at 03:00 UTC. - [x] Scheduled Cloud Function purging
connectionsdocs older than 30d (bylastActivityAtwithcreatedAtfallback for legacy schema) and cascade-deleting theirmessages/*sub-collection. Runs daily at 04:00 UTC. - [x] Each job batched (400 ops per Firestore batch) with a per-run cap (5 000 docs) so any single invocation has bounded cost.
- [x] Exports registered in
main.js.
Sprint B.2 โ BigQuery user_id pseudonymization ๐ก โ
Effort: ~1โ2 days. PR: claude/privacy-sprint-b2 (this branch).
- [x]
services/api/analytics/src/services/userIdPseudonymization.service.js: SQL builder + runner that issuesUPDATE analytics.events SET user_id = TO_HEX(SHA256(CONCAT(user_id, @salt)))against rows older than 90 days. Salt iscrypto.randomBytes(32).toString('hex'), generated fresh per invocation, never persisted, logged, or returned. - [x] Idempotency:
NOT REGEXP_CONTAINS(user_id, r'^[a-f0-9]{64}$')skips already-pseudonymized rows. - [x] Dry-run mode: returns
COUNT(*)of would-affect rows without mutating. - [x] Range guard:
ageDaysclamped to [30, 730]. A too-low value would pseudonymize fresh operational data. - [x] HTTP entrypoint:
POST /analytics/scheduled/pseudonymize-user-ids(Cloud-Scheduler-header OR admin auth, same pattern as the existing aggregation jobs). - [x] 8 vitest cases covering SQL shape, idempotency regex, ageDays validation, dry-run/UPDATE result shapes, salt-freshness, and the cryptographic-erasure invariant.
Operator follow-up (after merge):
- [ ] Create the Cloud Scheduler job:
daily at 05:00 UTC POST /analytics/scheduled/pseudonymize-user-ids. - [ ] Confirm analytics service account has
roles/bigquery.dataEditoron theanalyticsdataset. - [ ] Privacy-policy line: "Analytics events older than 90 days are pseudonymized โ even Lantern cannot link them to a current account."
What this buys you (combined): A breach today can't expose data outside the retention window. Firestore content is purged within hours/days of its useful life ending. BigQuery events older than 90 days become aggregable but no longer user-linkable, even by Lantern.
Sprint C โ Finish Stage A (Phase 5) ๐ก โ
Effort: ~2 days. PR: claude/privacy-sprint-c (this branch). Code changes merge in this PR; the migration script is operator-run after deploy.
Code (merged with this PR):
- [x]
tooling/scripts/drop-plaintext-phone.mjsโ cursor-paginated,--dry-run+--limit=Nflags, idempotent. Refuses to dropphonefrom a row that doesn't already havephoneHash(defense-in-depth against running out of order). - [x] Updated
services/functions/firebase/modules/phoneLookup.jsโ usesphoneHashwhenSTAGE_A_PHONE_HASH_LOOKUP_ENABLED=true; falls back to plaintext otherwise. SameselectLookupQuerypattern asservices/api/auth/src/routes/phone.js. - [x] Updated
services/api/auth/src/routes/phoneRecycling.jsโ same Stage A flag pattern for the reclaim-by-phone lookup. - [x] Audited remaining
users.phonereads. The only Firestore queries on the plaintext field were the two updated above. Otherphonereferences in the codebase point atadminProfiles.phone(separate collection) orFirebase Auth user.phoneNumber(managed by Firebase Auth, not Firestore) โ none of those are affected.
Operator runbook (after merge):
- [ ] Confirm
STAGE_A_PHONE_HASH_LOOKUP_ENABLED=trueis set on both auth-api Cloud Run AND the Cloud Functions runtime in every target environment. - [ ] Confirm
tooling/scripts/backfill-phone-hash.mjshas been run on the target environment so every existingusers/*row hasphoneHashpopulated. - [ ] Let the new code soak with the flag on for at least 48h to surface any plaintext-path regressions.
- [ ] Run
tooling/scripts/drop-plaintext-phone.mjs --dry-runand inspect the counts. ExpectWould drop = (#rows with plaintext phone),Skipped (missing phoneHash) = 0. If any rows are missing the hash, stop and re-run the backfill before continuing. - [ ] Run
tooling/scripts/drop-plaintext-phone.mjs(no--dry-run). Idempotent โ safe to re-run if interrupted. - [ ] Optional: drop any composite Firestore index on
users.phoneif one exists. Single-field indexes are auto-created and will be auto-removed by Firestore over time once nothing queries the field. No active harm from leaving them.
What this buys you: The "plaintext phone (existing accounts)" row in the ยง3 threat-model table goes away entirely. A Firestore dump of users/* post-migration shows hashed phones across the board โ no remaining direct-PII leak via that surface.
Sprint D โ GDPR cryptographic erasure cascade ๐ก โ
Split into two PRs for review-surface reasons (mirrors the Sprint B split rationale).
Sprint D.1 โ Server cascade (Firestore + Auth + Storage) ๐ก โ
Effort: ~2 days. PR: claude/privacy-sprint-d (this branch).
- [x]
services/functions/firebase/modules/userDeletion.js:deleteUserCompletelycallable Cloud Function. Cascades across:- Firebase Auth user record
users/{userId}Firestore doclanternswhereuserId = XwaveswheresenderId = XORreceiverId = X(two-pass query)connectionswhereparticipantscontainsX, withmessagessub-collection cascade- Cloud Storage
avatars/{userId}/*(best-effort)
- [x] Authorization: admin-deleting-other (verifyAdmin) OR self-delete (callerUid matches userId).
- [x] Audit row in
adminActionsโ userId-only references perdocs/privacy/LOG_HYGIENE.md, capped 280-char reason. Returns counts per surface. - [x] Batched writes (400/batch) with safety break on runaway loops.
- [x] Resilient: each cascade is wrapped in try/catch and continues on partial failure. Auth delete treats
auth/user-not-foundas success.
Sprint D.2 โ BigQuery pseudonymization-on-deletion ๐ก โ
Effort: ~0.5 day. PR: claude/privacy-sprint-d2 (this branch).
- [x]
pseudonymizeUserBigQueryEvents(userId, projectId)helper inline inservices/functions/firebase/modules/userDeletion.js. Mirrors the bulk job from Sprint B.2 but scoped to a single userId with a fresh per-call salt (crypto.randomBytes(32)). - [x] Wired into the
deleteUserCompletelycascade as step 7 (between user-doc + Auth cleanup and the audit row). Best-effort: BigQuery failures are logged and swallowed so they don't roll back the Firestore/Auth deletion that's already completed. - [x] BigQuery dep available transitively via
@lantern/forge(already a dep ofservices/functions/firebase). No new top-level dep added. - [x]
counts.bigQueryEventsadded to the return shape so the caller can see how many rows were pseudonymized.
Architectural choice (worth noting): inlined in the Cloud Function rather than calling the analytics service via HTTPS. Cross-service auth (Cloud Function โ Cloud Run with IAM ID tokens) adds setup. The BigQuery client is small enough that inlining is the lower-friction call. If we ever extract analytics into its own deploy with strict isolation, this is a clean refactor candidate.
Sprint D.3 โ UI integration ๐ฆ โ
Effort: ~1 day. PR: (TBD โ separate UI work)
- [ ] Admin panel button for moderator-triggered deletion (with reason field).
- [ ] User-initiated "delete my account" path from profile settings, with confirmation.
- [ ] Privacy policy line: "we cascade and cryptographically erase on request."
What this buys you (combined): A user requesting deletion (or required under GDPR Art. 17) leaves no recoverable trace within seconds, not 90 days. The "account deletion leaves orphaned data" gap from #281 item 3 is closed.
Sprint E โ Stage B (seal the userId resolution) ๐ฆ โ
Effort: ~2 sprints, ~8โ10 working days. PR: (TBD, likely multiple)
- [ ] Replace plaintext
userIdin the auth-lookup row withencryptedUserIdBlob = AES-256-GCM(userId, key = HKDF(entropy)). - [ ] Update
/auth/phone/lookupto returnencryptedUserIdBlobinstead ofuserId. Client decrypts after PIN unlock. - [ ] Existing-user migration: re-encrypt user_id blobs at next successful login.
- [ ] Recovery flow when client clears app data but retains passphrase.
- [ ] CS workflow redesign (no more "lookup user by phone" from the admin tool).
- [ ] Anti-fraud detection redesign for userId-only signals.
What this buys you: A subpoena providing a phone returns hashed phone + ciphertext blob โ the server cannot decrypt it without the user's passphrase. The remaining userId-keyed records become useful only if the subpoena starts with a userId (typically law enforcement starts with a phone or an email).
5. Governance (parallel, non-engineering) โ
- [ ] Mission Arbiter call on #145 Layers 2โ5 (device fingerprinting, behavioral, social-graph, payment-method bans). All four directly contradict Right #6 and the sealed-identity brief's non-negotiables. Currently marked BLOCKED. Recommend formally closing them with a Mission Arbiter decision rather than leaving as "pending."
- [ ] Privacy policy + ToS audit against shipped architecture after each sprint. Avoid both over-claiming ("we never see your phone" โ we see it transiently to compute the hash) and under-claiming (silence on the encrypted-userId blob loses the marketing benefit).
- [ ] Subpoena response playbook documenting per-request-shape what we can/cannot produce.
- [ ] DPIA prep if/when EU market entry is on the roadmap.
6. Residual threats โ what stays "theoretically yielding something" even after all sprints โ
To be straight: after Sprints AโE ship, the smallest defensible attack surface remaining is:
- Account existence by phone hash. With a known phone and the pepper, a court can compel Lantern to compute the hash and return yes/no. Removing this requires removing auth.
- Cloud Run access logs. Timestamps + IPs + userIds at the GCP infrastructure layer. App logs have no PII (#307 fixed); the access log itself is structural to running on GCP.
- GCP as a tenant. A court compelling Google directly is outside our architectural defenses. Mitigation is jurisdiction selection โ UK IPA / Australia TOLA flagged in SEALED_IDENTITY.md as do-not-enter without Stage B already shipped.
- Aggregate inference under combined queries. k-anonymity โฅ 3 is necessary but not sufficient; differential privacy (Laplace noise on aggregates) is a future Phase 3+ item.
That's the floor. After Sprints AโE we reach it.
7. Decision log โ
| Date | Decision | Rationale |
|---|---|---|
| 2026-05-10 | Hash users.phone with HMAC-SHA-256 + KMS pepper, not bcrypt | bcrypt is per-row salted โ can't equality-query for the hot login path. HMAC + KMS pepper gives equivalent dictionary-attack resistance under the "attacker has DB dump, no pepper" threat model. (Stage A spec ยง6.1) |
| 2026-05-10 | Descope #145 Layers 2โ5 (device + behavioral + social-graph + payment fingerprinting) | Directly contradicts Immutable Right #6. Layer 1 (hashed phone/email ban list) is shipped. Layers 2โ5 await Mission Arbiter governance decision. |
| 2026-05-10 | Counsel review is parallel documentation work, not a permission gate | Privacy architecture is not subject to legal-review veto. Counsel describes what's shipped (privacy policy, subpoena playbook, DPIA); they don't authorize what ships. |
| 2026-05-11 | Phase 4 = server-side createUser endpoint, not client-fetches-hash-and-writes | Client-fetches-hash doesn't prevent ban-bypass by malicious client. Server owns the write โ ban check is unbypassable. |
| 2026-05-11 | This roadmap document is canonical for the multi-sprint privacy hardening work | Survives context compaction; lets multiple PRs reference a single source of truth. |
| 2026-05-11 | lantern.interest free-text quote stays plaintext at rest; relies on Sprint B's 48h TTL for bounded exposure | Encrypting it with the lighter's key breaks social discovery (nearby users can't read). Proximity-derived shared-key encryption is out of scope pre-launch. User-elected publishing is acknowledged as part of the product. |
| 2026-05-11 | Profile encryption uses lazy migration on next login, not a bulk backfill job | Active-user population converges naturally; inactive accounts age out via separate cleanup. Avoids long-running migrations against the live database. |
| 2026-05-11 | Sprint B split into B.1 (Firestore TTL) and B.2 (BigQuery pseudonymization) | Two distinct infrastructure surfaces. B.1 is straightforward scheduled functions; B.2 requires BigQuery client + Service Account permissions and benefits from its own review/rollout. Both still belong to "Sprint B" in the roadmap narrative. |
| 2026-05-11 | Use scheduled Cloud Functions, not Firestore native TTL, for retention | Firestore native TTL doesn't cascade to sub-collections โ connections/{cid}/messages/* would remain after the parent doc deletes. Doing all three retention jobs in one Cloud Functions module keeps the story coherent. |
| 2026-05-11 | Two-tier BigQuery retention model: raw events (user-keyed, 90-day partition expiration) vs aggregated event_counts_daily (no user_id, indefinite retention) | Partition expiration on analytics.events is enforced by BigQuery itself โ events older than 90 days don't exist. The aggregation MERGE that builds event_counts_daily groups by (day, event_name, event_tier, environment) only, so user identity is aggregated away at the bridge between tiers. Long-tail merchant analytics stay queryable; user identifiability does not. Verified by inspecting schema 2026-05-11: event_counts_daily has no user_id / entity_id / service_id columns. |
8. References โ
- Brief: docs/privacy/SEALED_IDENTITY.md
- Stage A design spec: docs/superpowers/specs/2026-05-10-sealed-identity-stage-a-design.md
- Stage A spike plan: docs/superpowers/plans/2026-05-10-sealed-identity-spike.md
- Phase 4 plan: docs/superpowers/plans/2026-05-11-sealed-identity-stage-a-phase-4-impl.md
- Privacy audit issue: #281
- GDPR deletion + BQ pseudonymization issue: #308
- Existing privacy docs: HOW_ENCRYPTION_WORKS.md, PRIVACY_PRESERVING_DATA_COLLECTION.md
- Log hygiene policy: docs/privacy/LOG_HYGIENE.md
9. How to use this document โ
- Each sprint PR updates this file's checkboxes in ยง4 to reflect what shipped.
- Decisions get appended to ยง7 with date + rationale. Never silently overwrite โ append.
- The threat-model table in ยง3 gets updated after each sprint merges, to reflect the new floor.
- Residual threats in ยง6 are honest โ never claim more sealing than the architecture actually provides.