A 10s postgresql restart took down transcribe.shadowdao.com-01 for ~17h because pm2 gave up after 5 fast retries, the entrypoint's trailing tail -f kept PID 1 alive, and the healthcheck (wget --spider on nginx port 80) succeeded on the 301-to-https redirect regardless of whether Node was alive. Three coordinated fixes to the cnoc image: - HEALTHCHECK: replace the redirect-passing wget probe with TCP-level checks on 127.0.0.1:3000 (Node) and :80 (nginx). Tenant-agnostic, no /ping dependency — catches the exact incident scenario (port 3000 closed when pm2 exits). - entrypoint.sh: exec pm2 via tini so it becomes PID 1. When pm2 exhausts max_restarts and exits, the container exits and the unless-stopped restart policy brings it back. Logs are tailed in the background with -F (logrotate-safe). - Dockerfile: install tini from EPEL for proper signal forwarding and zombie reaping of nginx/crond children that reparent to PID 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.3 KiB
Executable File
3.3 KiB
Executable File