Fix wedged-container outage: TCP healthcheck + tini-managed PID 1 #1

Merged
jknapp merged 1 commits from fix/wedged-container-outage into main 2026-05-05 14:30:12 +00:00
Owner

Summary

Fixes the transcribe.shadowdao.com-01 outage from 2026-05-04 (~17h wedged container after a 10s postgresql restart). Three coordinated changes to the cnoc image:

  • Healthcheck (Dockerfile): replaced wget --spider http://localhost/ (which passed on the redirect-only port-80 vhost regardless of Node's state) with TCP probes on 127.0.0.1:3000 and :80. Tenant-agnostic — no /ping dependency. Catches the exact incident scenario: when pm2's --no-daemon process exits, port 3000 stops listening → unhealthy.
  • PID 1 (scripts/entrypoint.sh): exec tini -- su - $user -c "...pm2..." so pm2 (under tini) becomes PID 1. When pm2 exhausts max_restarts and exits, PID 1 exits and Docker's unless-stopped policy restarts the container — which is the actual recovery mechanism. Logs are tailed in the background with -F so logrotate-recreated files keep streaming.
  • tini (Dockerfile): added from EPEL for proper signal forwarding to pm2 and zombie reaping of nginx/crond/memcached children that reparent to PID 1.

Out of scope (deliberately, per the incident report's Fixes 3–5): generate-ecosystem-config.sh backoff defaults, patching existing tenants' ecosystem.config.js, and app-side DB retry. With the changes in this PR, pm2 hitting max_restarts: 5 results in a clean ~30–60s container restart instead of a 17-hour wedge — those further fixes can land separately as nice-to-haves.

Test plan (staging)

  • Build and push under a test tag (e.g. cnoc:node22-staging); pull on staging host
  • docker exec <c> ps -ef --forest | head — PID 1 is tini, not bash entrypoint.sh
  • Wait past 60s start_period, confirm Health.Status = healthy
  • docker exec <c> su - <user> -c "pm2 stop all"RestartCount increments within ~30s
  • docker exec <c> nginx -s stopHealth.Status flips to unhealthy within ~100s
  • Postgres-blip simulation: docker restart postgresql on staging — container survives via auto-restart, comes back in ~30–60s
  • Logrotate test: truncate -s 0 .../out.log, write a new line, confirm it shows in docker logs
  • Confirm a tenant shipping its own ecosystem.config.js still starts cleanly
## Summary Fixes the `transcribe.shadowdao.com-01` outage from 2026-05-04 (~17h wedged container after a 10s postgresql restart). Three coordinated changes to the `cnoc` image: - **Healthcheck** (`Dockerfile`): replaced `wget --spider http://localhost/` (which passed on the redirect-only port-80 vhost regardless of Node's state) with TCP probes on `127.0.0.1:3000` and `:80`. Tenant-agnostic — no `/ping` dependency. Catches the exact incident scenario: when pm2's `--no-daemon` process exits, port 3000 stops listening → unhealthy. - **PID 1** (`scripts/entrypoint.sh`): `exec tini -- su - $user -c "...pm2..."` so pm2 (under tini) becomes PID 1. When pm2 exhausts `max_restarts` and exits, PID 1 exits and Docker's `unless-stopped` policy restarts the container — which is the actual recovery mechanism. Logs are tailed in the background with `-F` so logrotate-recreated files keep streaming. - **tini** (`Dockerfile`): added from EPEL for proper signal forwarding to pm2 and zombie reaping of nginx/crond/memcached children that reparent to PID 1. Out of scope (deliberately, per the incident report's Fixes 3–5): `generate-ecosystem-config.sh` backoff defaults, patching existing tenants' `ecosystem.config.js`, and app-side DB retry. With the changes in this PR, pm2 hitting `max_restarts: 5` results in a clean ~30–60s container restart instead of a 17-hour wedge — those further fixes can land separately as nice-to-haves. ## Test plan (staging) - [ ] Build and push under a test tag (e.g. `cnoc:node22-staging`); pull on staging host - [ ] `docker exec <c> ps -ef --forest | head` — PID 1 is `tini`, not `bash entrypoint.sh` - [ ] Wait past 60s start_period, confirm `Health.Status = healthy` - [ ] `docker exec <c> su - <user> -c "pm2 stop all"` — `RestartCount` increments within ~30s - [ ] `docker exec <c> nginx -s stop` — `Health.Status` flips to `unhealthy` within ~100s - [ ] Postgres-blip simulation: `docker restart postgresql` on staging — container survives via auto-restart, comes back in ~30–60s - [ ] Logrotate test: `truncate -s 0 .../out.log`, write a new line, confirm it shows in `docker logs` - [ ] Confirm a tenant shipping its own `ecosystem.config.js` still starts cleanly
jknapp added 1 commit 2026-05-05 14:23:15 +00:00
A 10s postgresql restart took down transcribe.shadowdao.com-01 for ~17h
because pm2 gave up after 5 fast retries, the entrypoint's trailing
tail -f kept PID 1 alive, and the healthcheck (wget --spider on nginx
port 80) succeeded on the 301-to-https redirect regardless of whether
Node was alive.

Three coordinated fixes to the cnoc image:

- HEALTHCHECK: replace the redirect-passing wget probe with TCP-level
  checks on 127.0.0.1:3000 (Node) and :80 (nginx). Tenant-agnostic, no
  /ping dependency — catches the exact incident scenario (port 3000
  closed when pm2 exits).
- entrypoint.sh: exec pm2 via tini so it becomes PID 1. When pm2
  exhausts max_restarts and exits, the container exits and the
  unless-stopped restart policy brings it back. Logs are tailed in the
  background with -F (logrotate-safe).
- Dockerfile: install tini from EPEL for proper signal forwarding and
  zombie reaping of nginx/crond children that reparent to PID 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jknapp merged commit bdea2939c7 into main 2026-05-05 14:30:12 +00:00
jknapp deleted branch fix/wedged-container-outage 2026-05-05 14:32:45 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: cloud-hosting-platform/cloud-node-container#1