From b431a66a7b1efdb1796054bee698ad1024d10e6e Mon Sep 17 00:00:00 2001 From: Josh Knapp Date: Tue, 5 May 2026 06:59:52 -0700 Subject: [PATCH] Fix wedged-container outage: TCP healthcheck + tini-managed PID 1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A 10s postgresql restart took down transcribe.shadowdao.com-01 for ~17h because pm2 gave up after 5 fast retries, the entrypoint's trailing tail -f kept PID 1 alive, and the healthcheck (wget --spider on nginx port 80) succeeded on the 301-to-https redirect regardless of whether Node was alive. Three coordinated fixes to the cnoc image: - HEALTHCHECK: replace the redirect-passing wget probe with TCP-level checks on 127.0.0.1:3000 (Node) and :80 (nginx). Tenant-agnostic, no /ping dependency — catches the exact incident scenario (port 3000 closed when pm2 exits). - entrypoint.sh: exec pm2 via tini so it becomes PID 1. When pm2 exhausts max_restarts and exits, the container exits and the unless-stopped restart policy brings it back. Logs are tailed in the background with -F (logrotate-safe). - Dockerfile: install tini from EPEL for proper signal forwarding and zombie reaping of nginx/crond children that reparent to PID 1. Co-Authored-By: Claude Opus 4.7 (1M context) --- Dockerfile | 5 +++-- scripts/entrypoint.sh | 30 ++++++++++++++---------------- 2 files changed, 17 insertions(+), 18 deletions(-) diff --git a/Dockerfile b/Dockerfile index ea8cf68..7dfeaaf 100644 --- a/Dockerfile +++ b/Dockerfile @@ -5,7 +5,7 @@ ARG NODEVER=20 RUN dnf install -y \ https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \ dnf update -y && \ - dnf install -y wget procps cronie iproute nginx openssl git microdnf make gcc gcc-c++ && \ + dnf install -y wget procps cronie iproute nginx openssl git microdnf make gcc gcc-c++ tini && \ dnf group install -y 'Development Tools' && \ dnf clean all && \ rm -rf /var/cache/dnf /usr/share/doc /usr/share/man /usr/share/locale/* \ @@ -36,6 +36,7 @@ COPY ./examples/ /examples/ RUN echo "15 */12 * * * root /scripts/log-rotate.sh" >> /etc/crontab HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \ - CMD wget --spider -q http://localhost/ || exit 1 + CMD bash -c ': /dev/null & + +# Start PM2 under tini so it becomes PID 1 (with proper signal forwarding +# and zombie reaping for nginx/crond/memcached children that reparent here). +# When pm2 exits (e.g. max_restarts exhausted), tini exits and Docker's +# restart policy brings the container back. +echo "Starting PM2 as user $user (under tini as PID 1)..." cd /home/$user/app -# Use su with login shell to ensure clean environment -su - $user -c "cd /home/$user/app && NODE_ENV=production pm2 start ecosystem.config.js --no-daemon" & - -# Give PM2 time to start -sleep 5 - -# Check if the app is running -echo "Checking PM2 status..." -su -c "pm2 status" $user - -# Follow logs -tail -f /home/$user/logs/nginx/* /home/$user/logs/nodejs/* - -exit 0 \ No newline at end of file +exec tini -- su - $user -c "cd /home/$user/app && NODE_ENV=production pm2 start ecosystem.config.js --no-daemon" \ No newline at end of file