Incident 2026-04-21 — Production database found empty¶

Status: Resolved Severity: Major (full data loss, no user impact — product pre-release) Duration: ~2h from detection to defensive fix merged Authors: Jérémy Soriano (human) + Claude Code (pair)

TL;DR¶

On 2026-04-21, shortly after the feat(api): add /stations and /stations/:id/measurements deploy, a smoke test revealed the production database contained zero rows across every seed-managed table. The alpimonitor-pgdata volume itself was intact — the data had been deleted in-place. The root cause could not be pinned to a single command; forensic evidence exonerated the Claude Code tooling. The resolution ships an idempotent seed-on-boot entrypoint so the system converges back to a known-good context state on every container restart.

Timeline (UTC)¶

When (2026-04-21)	Event
~11:50	`feat(api): /stations endpoints` pushed to `main`. Coolify rebuilds + redeploys `api` service.
~12:10	Smoke test: `curl https://api.alpimonitor.fr/api/v1/stations` returns `{"data": []}`. `/status.ingestion.lastRun.stationsSeenCount` = 0.
~12:15	SSH diagnostic from the VPS (`95.216.196.69`): `docker volume inspect alpimonitor-prod_alpimonitor-pgdata` shows `CreatedAt: 2026-04-20T18:38:…Z`, never recreated. `SELECT COUNT(*) FROM "Station"` = 0. Postgres log shows clean `shutdown requested`/`database system is ready` at 12:30 UTC, no FATAL, no `initializing database`.
~12:25	Conclusion: no volume wipe, no reinit. Data deleted via SQL from an application container.
~12:30	Forensic: shell-history `grep` + JSONL transcript `grep` for any `prisma db seed` / `migrate reset` with `DATABASE_URL` pointing at `alpimonitor.fr`. Zero hits. Every destructive invocation used `localhost:5432`.
~12:45	Manual seed attempt via SSH fails: `spawn tsx ENOENT`. `tsx` was only in `devDependencies`, so `pnpm deploy --prod` stripped it from the runtime image. `prisma db seed` could not spawn the seed script.
~13:00	Defensive bundle built locally, validated against dev Postgres across three boot scenarios (see Resolution).
~13:10	Bundle committed + pushed. Coolify redeploys; entrypoint re-applies migrations, re-seeds the context tables, restarts the API; the ingestion cron re-populates `Measurement` rows on its next tick.

Symptoms¶

GET /api/v1/stations → {"data": []} (expected: 7 stations).
GET /api/v1/status → ingestion.lastRun.stationsSeenCount: 0 (expected: 4 LIVE stations with an OFEV code).
Postgres container healthy; alpimonitor-pgdata volume present with the original creation date.
No OOM / crash in the API logs leading up to the event.

Diagnostic¶

Volume: intact, not recreated. Rules out docker volume rm / Coolify redeploy wipe.
Postgres boot log: clean shutdown at 12:30 UTC, database system is ready afterwards. No PostgreSQL init process message → no first-boot reinit. Rules out a blank init.
Row count: Station, Sensor, Threshold, Glacier, StationGlacier, Withdrawal, Catchment all at zero. _prisma_migrations still populated. The schema is intact, only the data is gone.
Vector: consistent with a SQL-level delete executed against the live DB by something that had the DATABASE_URL secret — i.e. an application container, the seed script, or an interactive tool.

Hypothesis considered — `pruneStaleStations` in `apps/api/prisma/seed.ts`¶

The seed runs pruneStaleStations(currentOfevCodes) which deletes every Station whose ofevCode is not in the current seed list, cascading to Measurement, Alert, Threshold, Sensor, StationGlacier. If the seed had been run against prod with a diverging station list, this is exactly the shape of damage observed.

Forensic — Claude Code tooling¶

All prisma db seed / prisma migrate reset invocations initiated from this Claude Code session (and preceding sessions on this machine, checked via the JSONL session transcripts at ~/.claude/projects/-home-student-Desktop-alpimonitor/*.jsonl) were executed with DATABASE_URL="postgresql://alpimonitor:alpimonitor_dev@localhost:5432/alpimonitor". No invocation targeted @alpimonitor.fr or the production credentials. Evidence: grep -iE "(prisma.*seed|migrate.*reset|DATABASE_URL.*alpimonitor\\.fr)" over the session logs returned only localhost-scoped matches.

Root cause¶

Undetermined. The forensic trail rules out the Claude Code tooling but does not identify the concrete command responsible. Plausible candidates not proven:

An interactive pnpm prisma db seed run from a shell that had sourced .env.production earlier.
A one-off docker exec into the API container with a stale seed list in an older image layer.
A Coolify "Run command" invocation, against a container whose env was .env.production.

Because none of these are verifiable after the fact (no audit log captured DATABASE_URL at the time of the delete), we stop chasing ghosts and make the system converge back to a good state deterministically.

Resolution¶

A. Defensive boot (this commit)¶

apps/api/entrypoint.sh:

prisma migrate deploy — mandatory, fatal on failure (set -eu). Starting the API against a stale schema would be worse than not starting.
prisma db seed — runs iff SEED_ON_BOOT=true. Tolerant to failure (if ! cmd; then WARN): a broken seed must not take the API offline; /health should stay up while ops investigates.
exec node dist/index.js — exec so tini (PID 1) forwards signals straight to Node.

The seed is idempotent (upsert on every row, keyed by ofevCode / compound keys), so re-running on every boot is safe. pruneStaleStations still runs, but its input is the current seed list, so a boot-time run only deletes rows the seed list declares stale — same behaviour as a local pnpm prisma:seed.

B. `tsx` promoted to runtime dependency¶

apps/api/package.json: tsx moved from devDependencies to dependencies. pnpm deploy --prod --legacy /prod/api now keeps it, so prisma db seed can spawn the seed script ("prisma": { "seed": "tsx prisma/seed.ts" }) inside the runtime image. The 12:45 UTC spawn tsx ENOENT failure is the exact symptom this fixes.

C. `PATH` export in entrypoint¶

export PATH="/app/node_modules/.bin:$PATH" — prisma db seed spawns tsx by bare name and relies on PATH, which is not guaranteed to include the workspace bin dir under node:20-alpine. Explicit is better than implicit.

D. Compose wiring¶

docker-compose.prod.yml: SEED_ON_BOOT: ${SEED_ON_BOOT:-true} defaults ON for this deployment (demo/POC phase, pre-real-operational-data). .env.production.example documents the flag and tells operators to flip it to false once real data must not be overwritten by seed fixtures.

Local validation before push¶

Three scenarios run against the dev Postgres container:

Scenario	Expected	Observed
`SEED_ON_BOOT=true`	migrate no-op, seed restores rows, server starts	`Seed complete: { catchments: 1, stations: 7, stationsLive: 4, stationsResearch: 3, sensors: 14, thresholds: 7, glaciers: 2, stationGlaciers: 6, withdrawals: 2 }`, server up, clean `SIGTERM`. ✓
`SEED_ON_BOOT` unset	migrate runs, seed skipped, server starts	`[boot] SEED_ON_BOOT not set to 'true' — skipping seed`, server up. ✓
bad `DATABASE_URL`	migrate fails fatally, container exits	`Error: P1000 Authentication failed`, exit non-zero. ✓

Prevention¶

Entrypoint contract: every boot re-applies migrations and, if enabled, re-seeds context. Data loss no longer lingers past a restart.
Seed idempotency: enforced by upsert on stable natural keys (ofevCode for stations, compound {stationId, glacierId} for junctions, etc.). Replay is free of side effects beyond converging to the declared fixture set.
Runtime dependencies auditable: tsx being a runtime dep is now visible in apps/api/package.json — no more surprise ENOENT when ops tries to run the seed.
Separation of data classes: seed only governs the context tables (stations, glaciers, thresholds). The operational tables (Measurement, Alert, IngestionRun) are owned by the ingestion cron and untouched by the seed. A seed replay therefore cannot delete ingested measurements older than the current run — it can only delete them as a side effect of pruneStaleStations for stations removed from the seed list, which is the intended behaviour.

Lessons¶

Runtime vs. dev dependencies matter for operational tooling. Anything ops might need to run in a production container (seed, one-off migrations, debug scripts) must live in dependencies, not devDependencies. pnpm deploy --prod is unforgiving.
Self-healing beats forensic archaeology. We spent 30 min trying to pin the exact deletion command and could not. The 30 min spent making the system re-converge on boot has a much better ROI.
Idempotent seeds are a safety net, not a shortcut. The seed was already idempotent for good reasons (dev velocity, no "reset your DB" ritual). That property is what let us ship a seed-on-boot entrypoint with confidence in a single afternoon.
Session-level audit trails pay off. The Claude Code JSONL transcripts + shell history were enough to rule out the pair-programming tooling in under five minutes. Keep them.
Demo phase ≠ test phase. A pre-release product still earns production-grade incident hygiene: timeline, evidence, resolution, prevention, write it down. This document is part of the deliverable.

Follow-ups¶

Turn SEED_ON_BOOT=false once the product leaves the demo phase and operational data (alerts acknowledged, thresholds tuned by domain experts) must not be overwritten by fixture re-runs.
Add a scripts/prod-shell-safety.md note listing the exact commands that are safe to run against a live DATABASE_URL and which ones are not (prisma migrate reset — never; prisma db seed — only when pruneStaleStations matches the live station set).
Consider a pre-flight check in seed.ts that refuses to run if DATABASE_URL host resolves to a non-local address and ALLOW_PROD_SEED=1 is not set — belt-and-braces for the next contributor.