2026-04-29 โ Issue #337 Cloud Run Deploy Gaps: Shipped + Lessons โ
Outcome: All four #337 acceptance items shipped via PRs #338, #339, #345, plus an outstanding GitHub-API rewrite (this worklog's branch). Five follow-up issues filed (#340, #342, #343 + the IAM-binding work + the SW staleness fix). Today exposed a number of process gaps worth writing down so I don't repeat them.
What shipped โ
| PR | Branch | What it did |
|---|---|---|
| #338 | chore/preview-label-gate | Gate deploy-preview.yml on a preview PR label so 4 Cloudflare Pages previews don't fire on every PR push. Also catalogs the new label in triage-categories.json excludedLabels. |
| #339 | fix/337-cloud-run-deploy-gaps | The headline #337 work: registry devUrl for auth-api + lanterns-api + assistant-api; merchants-api Dockerfile + deploy job; assistant-api deploy job (initial, snapshot-bundling design); <LanternChat /> UX guard for unset VITE_ASSISTANT_API_URL; admin-deploy-job env wiring for assistant URL. |
| #341 | fix/337-followup-assistant-deploy | First follow-up โ fixed cp -r services recursing into the snapshot destination via mktemp -d + mv. Set merchants-api devUrl (deferred in the original PR for no good reason). |
| #345 | fix/337-revert-iam-step | Revert of an IAM-binding step I added across all 7 deploy jobs in PR #344. gcloud add-iam-policy-binding failed with PERMISSION_DENIED in CI because the WIF service account lacks run.services.setIamPolicy. |
| (this branch) | fix/337-assistant-github-runtime | Rip out the repo-snapshot bundling design entirely. New repoFs.js reads via the GitHub Contents/Search API at runtime. Smaller image, no per-deploy snapshot churn, content stays fresh between deploys. Includes a test-repofs.mjs smoke-test script. |
What broke + why โ
The #337 work surfaced five distinct failures before everything was healthy. Three of them were on the assistant-api deploy alone:
cprecursion in the snapshot stage โ destinationservices/api/assistant/repo-snapshot/lives insideservices/, socp -r services โฆrecursed into itself. Fixed in #341.cpfailed on missing allow-list paths โ I copied therepoFs.jsallow-list verbatim into the workflow, but.github/instructions/andAGENTS.mddon't exist in the tree. Fixed (defensively) by looping with[ -e ]checks.gcloud --sourceexcluded the staged dirs from the build context โ the per-service.gitignorelists.shared-pkg/andrepo-snapshot/, andgcloud run deploy --sourcefalls back to.gitignorewhen no.gcloudignoreexists. Build context was 115KB instead of expected ~MB; DockerCOPY .shared-pkg/failed. Fixed by adding an explicit.gcloudignore(and ultimately by removing the snapshot pattern entirely in this branch).- IAM binding step PERMISSION_DENIED โ I added a
gcloud run services add-iam-policy-bindingstep to all 7 deploy jobs to work around--allow-unauthenticatedbeing silently no-op'd on this account. It worked locally with my admin creds; failed in CI because the WIF service account only has deploy perms, notsetIamPolicy. The actualgcloud run deploysucceeded, so services kept serving โ but my added step exited non-zero, making 5 previously-green deploy jobs report failure. This is the one that genuinely scared us. Reverted in PR #345. - Localhost-fallback CSP error in
merchantsApi.jsโ admin client fell back tohttp://localhost:8085in prod whenVITE_MERCHANTS_API_URLwas unset; CSP correctly blocked. Same shape of bug latent inassistantApi.js(empty-string fallback โ relative URL โ SPA โ 405). Both fixed.
Lessons โ
These are the things I want to internalize so I don't burn another deploy cycle on them.
1. Local creds โ CI creds. Test the actual CI auth path. โ
The IAM step (#4 above) is the hardest one to swallow. I tested gcloud add-iam-policy-binding locally, it worked, I shipped it. It worked because my gcloud was authenticated as me โ an admin. CI runs as a Workload Identity Federation service account with a much narrower role set.
Going forward: when adding any new gcloud step, look up which role grants the operation (here: roles/run.admin for setIamPolicy) and confirm the WIF SA has it before merging. If you can't read the WIF SA's IAM directly, at minimum document the assumption in the commit message so the failure is traceable.
2. Dry-run the actual bash on a clean tree before pushing. โ
Bugs #1 and #2 (the cp failures) were both shell-level. They would have been caught in seconds by:
# in a clean checkout, paste the run: block exactly as it appears in the workflow
bash -c '<the full script>'I did this once after #341 and it would have caught the .github/instructions failure before the push. I just didn't do it consistently. From now on: any new shell in a workflow gets dry-run locally on a clean checkout before commit. No exceptions.
3. gcloud run deploy --source upload uses .gitignore by default. โ
Worth knowing as a first-class fact. If you stage runtime artifacts into a build context that are gitignored (and you don't have a .gcloudignore), gcloud silently drops them. Symptom: tiny build context + COPY failure inside Docker. Always pair runtime-staged artifacts with an explicit .gcloudignore that overrides the relevant .gitignore exclusions.
4. When a substantial change ships, ship a way to test it locally. โ
The test-repofs.mjs script in this branch is the model for what I should have done with the snapshot logic too. A 100-line smoke test that you can re-run after every change catches a class of bugs that no amount of YAML linting or local script-testing will catch.
Pattern: for any change that touches a backend service's behavior, ship a test-<thing>.mjs (or equivalent) alongside the change that exercises the new code path against a realistic environment. Calling it out in the PR description makes the reviewer's job 10ร easier.
5. Bundle related work into a single PR. โ
I split the original #337 work into two PRs (the gate detour + the main work) when one would have done. That created the very mess the user objected to: orphan PRs, ordering issues, and a fragile "land #338 first to save preview spend" plan that didn't hold up. Saved as a feedback memory.
6. Don't trust apparent damage without checking. โ
When the IAM step failed, the GitHub Actions UI showed 5 red X's and the user understandably panicked. The actual gcloud run deploy had succeeded for all five โ only my added IAM step had failed. All services were still serving 200. I should have led with curl /health checks rather than reverting blind.
This is also a CI design concern: a post-deploy validation step that fails should not make a successful deploy report failure. We should split "deploy" and "validate" into separate GitHub Actions jobs (or at least separate steps with continue-on-error: true on the validation) so the apparent-vs-actual damage signals stay aligned.
7. Localhost fallbacks are a liability. โ
Two API clients had silent localhost / empty-string fallbacks that bit users in prod. The pattern across the codebase is inconsistent: 3 clients fall back to a Cloud Run dev URL, 2 fall back to localhost-or-empty, 4 require an env var. Issue #343 is the proper fix โ registry-driven getApiBaseUrl(slug, envVar) helper that fails loudly in misconfigured prod builds.
8. Worktrees aren't always worth it for sequential PR work. โ
We started in a worktree to keep the main checkout clean while we worked on #337. Hit two real frictions: .env.local doesn't transfer, and Claude Code's per-project context (memory, allowed permissions, hooks) splits across worktree paths. Switched back to plain branches in the main checkout, which was strictly better for this workflow. Saved as a feedback memory.
Open follow-ups (issues filed) โ
- #340 โ
validateorchestrator regeneratesversion.json+ client SDK on every run. Stop mutating invalidate; only check freshness. - #342 โ Admin PWA serves stale JS bundle until tabs fully closed. Add
skipWaiting+clients.claim()or a "new version available" banner. - #343 โ Refactor 9 API client base-URL definitions to a single registry-driven helper.
- (unfiled) โ IAM binding for new Cloud Run services should be a one-time setup task per service, not a per-deploy step. Either grant the WIF SA
roles/run.admin, or document a manualgcloud run services add-iam-policy-bindingrunbook step for each new service. - (unfiled) โ
octokit/search/codeendpoint deprecation 2026-09-27. The newrepoFs.jswill need to switch to the GraphQL search API or the announced replacement before then.
Files in this branch โ
- services/api/assistant/src/services/repoFs.js โ full rewrite, GitHub-backed
- services/api/assistant/test-repofs.mjs โ smoke test
- services/api/assistant/Dockerfile โ drop snapshot COPY + env
- services/api/assistant/.gcloudignore โ new; controls upload separately from gitignore
- services/api/assistant/.gitignore โ drop
repo-snapshot/, add.forge-pkg/ - services/api/assistant/package.json โ
@octokit/rest - .github/workflows/deploy-dev.yml โ drop staging step; wire
GITHUB_TOKEN/OWNER/REPOfrom Secret Manager