Fix wedged-container outage: TCP healthcheck + tini-managed PID 1 #1
Reference in New Issue
Block a user
Delete Branch "fix/wedged-container-outage"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Fixes the
transcribe.shadowdao.com-01outage from 2026-05-04 (~17h wedged container after a 10s postgresql restart). Three coordinated changes to thecnocimage:Dockerfile): replacedwget --spider http://localhost/(which passed on the redirect-only port-80 vhost regardless of Node's state) with TCP probes on127.0.0.1:3000and:80. Tenant-agnostic — no/pingdependency. Catches the exact incident scenario: when pm2's--no-daemonprocess exits, port 3000 stops listening → unhealthy.scripts/entrypoint.sh):exec tini -- su - $user -c "...pm2..."so pm2 (under tini) becomes PID 1. When pm2 exhaustsmax_restartsand exits, PID 1 exits and Docker'sunless-stoppedpolicy restarts the container — which is the actual recovery mechanism. Logs are tailed in the background with-Fso logrotate-recreated files keep streaming.Dockerfile): added from EPEL for proper signal forwarding to pm2 and zombie reaping of nginx/crond/memcached children that reparent to PID 1.Out of scope (deliberately, per the incident report's Fixes 3–5):
generate-ecosystem-config.shbackoff defaults, patching existing tenants'ecosystem.config.js, and app-side DB retry. With the changes in this PR, pm2 hittingmax_restarts: 5results in a clean ~30–60s container restart instead of a 17-hour wedge — those further fixes can land separately as nice-to-haves.Test plan (staging)
cnoc:node22-staging); pull on staging hostdocker exec <c> ps -ef --forest | head— PID 1 istini, notbash entrypoint.shHealth.Status = healthydocker exec <c> su - <user> -c "pm2 stop all"—RestartCountincrements within ~30sdocker exec <c> nginx -s stop—Health.Statusflips tounhealthywithin ~100sdocker restart postgresqlon staging — container survives via auto-restart, comes back in ~30–60struncate -s 0 .../out.log, write a new line, confirm it shows indocker logsecosystem.config.jsstill starts cleanly