Incident 2026-04-21 — Production database found empty¶
Status: Resolved Severity: Major (full data loss, no user impact — product pre-release) Duration: ~2h from detection to defensive fix merged Authors: Jérémy Soriano (human) + Claude Code (pair)
TL;DR¶
On 2026-04-21, shortly after the feat(api): add /stations and /stations/:id/measurements deploy, a smoke test revealed the production database contained zero rows across every seed-managed table. The alpimonitor-pgdata volume itself was intact — the data had been deleted in-place. The root cause could not be pinned to a single command; forensic evidence exonerated the Claude Code tooling. The resolution ships an idempotent seed-on-boot entrypoint so the system converges back to a known-good context state on every container restart.
Timeline (UTC)¶
| When (2026-04-21) | Event |
|---|---|
| ~11:50 | feat(api): /stations endpoints pushed to main. Coolify rebuilds + redeploys api service. |
| ~12:10 | Smoke test: curl https://api.alpimonitor.fr/api/v1/stations returns {"data": []}. /status.ingestion.lastRun.stationsSeenCount = 0. |
| ~12:15 | SSH diagnostic from the VPS (95.216.196.69): docker volume inspect alpimonitor-prod_alpimonitor-pgdata shows CreatedAt: 2026-04-20T18:38:…Z, never recreated. SELECT COUNT(*) FROM "Station" = 0. Postgres log shows clean shutdown requested/database system is ready at 12:30 UTC, no FATAL, no initializing database. |
| ~12:25 | Conclusion: no volume wipe, no reinit. Data deleted via SQL from an application container. |
| ~12:30 | Forensic: shell-history grep + JSONL transcript grep for any prisma db seed / migrate reset with DATABASE_URL pointing at alpimonitor.fr. Zero hits. Every destructive invocation used localhost:5432. |
| ~12:45 | Manual seed attempt via SSH fails: spawn tsx ENOENT. tsx was only in devDependencies, so pnpm deploy --prod stripped it from the runtime image. prisma db seed could not spawn the seed script. |
| ~13:00 | Defensive bundle built locally, validated against dev Postgres across three boot scenarios (see Resolution). |
| ~13:10 | Bundle committed + pushed. Coolify redeploys; entrypoint re-applies migrations, re-seeds the context tables, restarts the API; the ingestion cron re-populates Measurement rows on its next tick. |
Symptoms¶
GET /api/v1/stations→{"data": []}(expected: 7 stations).GET /api/v1/status→ingestion.lastRun.stationsSeenCount: 0(expected: 4 LIVE stations with an OFEV code).- Postgres container healthy;
alpimonitor-pgdatavolume present with the original creation date. - No OOM / crash in the API logs leading up to the event.
Diagnostic¶
- Volume: intact, not recreated. Rules out
docker volume rm/ Coolify redeploy wipe. - Postgres boot log: clean shutdown at 12:30 UTC,
database system is readyafterwards. NoPostgreSQL init processmessage → no first-boot reinit. Rules out a blank init. - Row count:
Station,Sensor,Threshold,Glacier,StationGlacier,Withdrawal,Catchmentall at zero._prisma_migrationsstill populated. The schema is intact, only the data is gone. - Vector: consistent with a SQL-level delete executed against the live DB by something that had the
DATABASE_URLsecret — i.e. an application container, the seed script, or an interactive tool.
Hypothesis considered — pruneStaleStations in apps/api/prisma/seed.ts¶
The seed runs pruneStaleStations(currentOfevCodes) which deletes every Station whose ofevCode is not in the current seed list, cascading to Measurement, Alert, Threshold, Sensor, StationGlacier. If the seed had been run against prod with a diverging station list, this is exactly the shape of damage observed.
Forensic — Claude Code tooling¶
All prisma db seed / prisma migrate reset invocations initiated from this Claude Code session (and preceding sessions on this machine, checked via the JSONL session transcripts at ~/.claude/projects/-home-student-Desktop-alpimonitor/*.jsonl) were executed with DATABASE_URL="postgresql://alpimonitor:alpimonitor_dev@localhost:5432/alpimonitor". No invocation targeted @alpimonitor.fr or the production credentials. Evidence: grep -iE "(prisma.*seed|migrate.*reset|DATABASE_URL.*alpimonitor\\.fr)" over the session logs returned only localhost-scoped matches.
Root cause¶
Undetermined. The forensic trail rules out the Claude Code tooling but does not identify the concrete command responsible. Plausible candidates not proven:
- An interactive
pnpm prisma db seedrun from a shell that had sourced.env.productionearlier. - A one-off
docker execinto the API container with a stale seed list in an older image layer. - A Coolify "Run command" invocation, against a container whose env was
.env.production.
Because none of these are verifiable after the fact (no audit log captured DATABASE_URL at the time of the delete), we stop chasing ghosts and make the system converge back to a good state deterministically.
Resolution¶
A. Defensive boot (this commit)¶
apps/api/entrypoint.sh:
prisma migrate deploy— mandatory, fatal on failure (set -eu). Starting the API against a stale schema would be worse than not starting.prisma db seed— runs iffSEED_ON_BOOT=true. Tolerant to failure (if ! cmd; then WARN): a broken seed must not take the API offline;/healthshould stay up while ops investigates.exec node dist/index.js—execsotini(PID 1) forwards signals straight to Node.
The seed is idempotent (upsert on every row, keyed by ofevCode / compound keys), so re-running on every boot is safe. pruneStaleStations still runs, but its input is the current seed list, so a boot-time run only deletes rows the seed list declares stale — same behaviour as a local pnpm prisma:seed.
B. tsx promoted to runtime dependency¶
apps/api/package.json: tsx moved from devDependencies to dependencies. pnpm deploy --prod --legacy /prod/api now keeps it, so prisma db seed can spawn the seed script ("prisma": { "seed": "tsx prisma/seed.ts" }) inside the runtime image. The 12:45 UTC spawn tsx ENOENT failure is the exact symptom this fixes.
C. PATH export in entrypoint¶
export PATH="/app/node_modules/.bin:$PATH" — prisma db seed spawns tsx by bare name and relies on PATH, which is not guaranteed to include the workspace bin dir under node:20-alpine. Explicit is better than implicit.
D. Compose wiring¶
docker-compose.prod.yml: SEED_ON_BOOT: ${SEED_ON_BOOT:-true} defaults ON for this deployment (demo/POC phase, pre-real-operational-data). .env.production.example documents the flag and tells operators to flip it to false once real data must not be overwritten by seed fixtures.
Local validation before push¶
Three scenarios run against the dev Postgres container:
| Scenario | Expected | Observed |
|---|---|---|
SEED_ON_BOOT=true |
migrate no-op, seed restores rows, server starts | Seed complete: { catchments: 1, stations: 7, stationsLive: 4, stationsResearch: 3, sensors: 14, thresholds: 7, glaciers: 2, stationGlaciers: 6, withdrawals: 2 }, server up, clean SIGTERM. ✓ |
SEED_ON_BOOT unset |
migrate runs, seed skipped, server starts | [boot] SEED_ON_BOOT not set to 'true' — skipping seed, server up. ✓ |
bad DATABASE_URL |
migrate fails fatally, container exits | Error: P1000 Authentication failed, exit non-zero. ✓ |
Prevention¶
- Entrypoint contract: every boot re-applies migrations and, if enabled, re-seeds context. Data loss no longer lingers past a restart.
- Seed idempotency: enforced by
upserton stable natural keys (ofevCodefor stations, compound{stationId, glacierId}for junctions, etc.). Replay is free of side effects beyond converging to the declared fixture set. - Runtime dependencies auditable:
tsxbeing a runtime dep is now visible inapps/api/package.json— no more surpriseENOENTwhen ops tries to run the seed. - Separation of data classes: seed only governs the context tables (stations, glaciers, thresholds). The operational tables (
Measurement,Alert,IngestionRun) are owned by the ingestion cron and untouched by the seed. A seed replay therefore cannot delete ingested measurements older than the current run — it can only delete them as a side effect ofpruneStaleStationsfor stations removed from the seed list, which is the intended behaviour.
Lessons¶
- Runtime vs. dev dependencies matter for operational tooling. Anything ops might need to run in a production container (seed, one-off migrations, debug scripts) must live in
dependencies, notdevDependencies.pnpm deploy --prodis unforgiving. - Self-healing beats forensic archaeology. We spent 30 min trying to pin the exact deletion command and could not. The 30 min spent making the system re-converge on boot has a much better ROI.
- Idempotent seeds are a safety net, not a shortcut. The seed was already idempotent for good reasons (dev velocity, no "reset your DB" ritual). That property is what let us ship a seed-on-boot entrypoint with confidence in a single afternoon.
- Session-level audit trails pay off. The Claude Code JSONL transcripts + shell history were enough to rule out the pair-programming tooling in under five minutes. Keep them.
- Demo phase ≠ test phase. A pre-release product still earns production-grade incident hygiene: timeline, evidence, resolution, prevention, write it down. This document is part of the deliverable.
Follow-ups¶
- Turn
SEED_ON_BOOT=falseonce the product leaves the demo phase and operational data (alerts acknowledged, thresholds tuned by domain experts) must not be overwritten by fixture re-runs. - Add a
scripts/prod-shell-safety.mdnote listing the exact commands that are safe to run against a liveDATABASE_URLand which ones are not (prisma migrate reset— never;prisma db seed— only whenpruneStaleStationsmatches the live station set). - Consider a pre-flight check in
seed.tsthat refuses to run ifDATABASE_URLhost resolves to a non-local address andALLOW_PROD_SEED=1is not set — belt-and-braces for the next contributor.