Fix wedged-container outage: TCP healthcheck + tini-managed PID 1
A 10s postgresql restart took down transcribe.shadowdao.com-01 for ~17h because pm2 gave up after 5 fast retries, the entrypoint's trailing tail -f kept PID 1 alive, and the healthcheck (wget --spider on nginx port 80) succeeded on the 301-to-https redirect regardless of whether Node was alive. Three coordinated fixes to the cnoc image: - HEALTHCHECK: replace the redirect-passing wget probe with TCP-level checks on 127.0.0.1:3000 (Node) and :80 (nginx). Tenant-agnostic, no /ping dependency — catches the exact incident scenario (port 3000 closed when pm2 exits). - entrypoint.sh: exec pm2 via tini so it becomes PID 1. When pm2 exhausts max_restarts and exits, the container exits and the unless-stopped restart policy brings it back. Logs are tailed in the background with -F (logrotate-safe). - Dockerfile: install tini from EPEL for proper signal forwarding and zombie reaping of nginx/crond children that reparent to PID 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -5,7 +5,7 @@ ARG NODEVER=20
|
|||||||
RUN dnf install -y \
|
RUN dnf install -y \
|
||||||
https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \
|
https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \
|
||||||
dnf update -y && \
|
dnf update -y && \
|
||||||
dnf install -y wget procps cronie iproute nginx openssl git microdnf make gcc gcc-c++ && \
|
dnf install -y wget procps cronie iproute nginx openssl git microdnf make gcc gcc-c++ tini && \
|
||||||
dnf group install -y 'Development Tools' && \
|
dnf group install -y 'Development Tools' && \
|
||||||
dnf clean all && \
|
dnf clean all && \
|
||||||
rm -rf /var/cache/dnf /usr/share/doc /usr/share/man /usr/share/locale/* \
|
rm -rf /var/cache/dnf /usr/share/doc /usr/share/man /usr/share/locale/* \
|
||||||
@@ -36,6 +36,7 @@ COPY ./examples/ /examples/
|
|||||||
RUN echo "15 */12 * * * root /scripts/log-rotate.sh" >> /etc/crontab
|
RUN echo "15 */12 * * * root /scripts/log-rotate.sh" >> /etc/crontab
|
||||||
|
|
||||||
HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
|
HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
|
||||||
CMD wget --spider -q http://localhost/ || exit 1
|
CMD bash -c ': </dev/tcp/127.0.0.1/3000 && : </dev/tcp/127.0.0.1/80' \
|
||||||
|
|| exit 1
|
||||||
|
|
||||||
ENTRYPOINT [ "/scripts/entrypoint.sh" ]
|
ENTRYPOINT [ "/scripts/entrypoint.sh" ]
|
||||||
@@ -89,20 +89,18 @@ if [ ! -f /home/$user/app/ecosystem.config.js ]; then
|
|||||||
chown $user:$user /home/$user/app/ecosystem.config.js
|
chown $user:$user /home/$user/app/ecosystem.config.js
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# Start PM2 as the user with HOME environment set
|
# Mirror logs to docker logs in the background.
|
||||||
echo "Starting PM2 as user $user..."
|
# Use -F (capital) so logrotate-recreated files keep streaming.
|
||||||
|
tail -F /home/$user/logs/nginx/access.log \
|
||||||
|
/home/$user/logs/nginx/error.log \
|
||||||
|
/home/$user/logs/nodejs/app.log \
|
||||||
|
/home/$user/logs/nodejs/out.log \
|
||||||
|
/home/$user/logs/nodejs/error.log 2>/dev/null &
|
||||||
|
|
||||||
|
# Start PM2 under tini so it becomes PID 1 (with proper signal forwarding
|
||||||
|
# and zombie reaping for nginx/crond/memcached children that reparent here).
|
||||||
|
# When pm2 exits (e.g. max_restarts exhausted), tini exits and Docker's
|
||||||
|
# restart policy brings the container back.
|
||||||
|
echo "Starting PM2 as user $user (under tini as PID 1)..."
|
||||||
cd /home/$user/app
|
cd /home/$user/app
|
||||||
# Use su with login shell to ensure clean environment
|
exec tini -- su - $user -c "cd /home/$user/app && NODE_ENV=production pm2 start ecosystem.config.js --no-daemon"
|
||||||
su - $user -c "cd /home/$user/app && NODE_ENV=production pm2 start ecosystem.config.js --no-daemon" &
|
|
||||||
|
|
||||||
# Give PM2 time to start
|
|
||||||
sleep 5
|
|
||||||
|
|
||||||
# Check if the app is running
|
|
||||||
echo "Checking PM2 status..."
|
|
||||||
su -c "pm2 status" $user
|
|
||||||
|
|
||||||
# Follow logs
|
|
||||||
tail -f /home/$user/logs/nginx/* /home/$user/logs/nodejs/*
|
|
||||||
|
|
||||||
exit 0
|
|
||||||
Reference in New Issue
Block a user