swap werkzeug dev server for gunicorn + accept all HTTP methods on default/blocked pages

Two related fixes for the issues the AI Monitor surfaced on whp01 on 2026-05-12 (haproxy-manager going "healthy but stalled" after long uptime, and noise from POST /blocked-ip returning 405): 1. Production WSGI server. The Flask app was running on werkzeug's built-in dev server (the one that prints "WARNING: This is a development server" on every startup). werkzeug is single-threaded and accumulates worker state over long uptimes; after ~24h on whp01 the health endpoint stops responding while the container still reports "healthy" because Docker's HEALTHCHECK uses an HTTP probe from inside the same werkzeug process that's stalled. Replace with gunicorn (gthread worker class, --max-requests=1000 with jitter so workers recycle periodically). Two gunicorn instances, one per Flask app — port 8000 for the management API, port 8080 for the default/blocked-ip page server. Both lift their app objects from the haproxy_manager module so gunicorn can import them. Required structural change: default_app was created INSIDE the __name__ == '__main__' block at module bottom, where gunicorn could never reach it. Moved to module level. The __main__ block now stays only for `python haproxy_manager.py` local-dev workflow. Container init (init_db, certbot register, generate_config, start_haproxy) extracted into a do_initial_setup() function called from a new scripts/init.py. start-up.sh runs init.py to completion before either gunicorn binds, which keeps HAProxy startup off the WSGI workers' fork paths (no race between workers all trying to start_haproxy() at once). 2. /blocked-ip and / accept ALL methods. HAProxy proxies blocked-IP traffic to default_app preserving the original verb, so a blocked POST request used to hit Flask's GET-only route and get a 405 + the AI Monitor flagged the noise. Adding the full method list lets the 403 page render regardless of verb. Gunicorn settings tunable via env (workers, timeout, max-requests). API gets --timeout 120 because ACME cert issuance can be slow; the default page server stays on the gunicorn default 30s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:24:28 -07:00
parent 8a86beac73
commit bdd7d2f098
4 changed files with 131 additions and 38 deletions
--- a/scripts/start-up.sh
+++ b/scripts/start-up.sh
@@ -1,6 +1,20 @@
 #!/usr/bin/env bash
+# Container entrypoint. Two-phase startup:
+#   1. One-shot init (init.py): DB schema, certbot register, config gen, start HAProxy.
+#      Runs synchronously and to completion so haproxy is up before the API binds.
+#   2. WSGI serving via gunicorn (replacing the Flask dev server). Two gunicorn
+#      instances:
+#        - port 8080 -> default_app  (default page + blocked-ip page; HAProxy
+#          proxies unmatched / blocked traffic here)
+#        - port 8000 -> app          (management API)
+#
+# Why gunicorn:
+#   Flask's built-in werkzeug "development server" is single-threaded and leaks
+#   workers under sustained load. It carried haproxy-manager for a long time but
+#   stalled out around 24-48h uptime ("healthy" health-check, but every request
+#   queued behind a stuck worker). Gunicorn with --max-requests cycles workers
+#   periodically, which prevents the slow-leak failure mode entirely.

-# Exit on error
 set -eo pipefail

 # Ensure trusted IP whitelist files exist (volume-mounted /etc/haproxy may shadow image defaults)
@@ -9,4 +23,39 @@ mkdir -p /etc/haproxy
 [ -f /etc/haproxy/trusted_ips.map ]  || : > /etc/haproxy/trusted_ips.map

 cron &
-python /haproxy/haproxy_manager.py
+
+# Phase 1: container init
+python /haproxy/scripts/init.py
+
+# Phase 2: WSGI servers
+# Tunable via env: HAPROXY_MGR_API_WORKERS (default 1), HAPROXY_MGR_API_TIMEOUT
+# (default 120 — API can do slow ACME calls), HAPROXY_MGR_MAX_REQUESTS (default
+# 1000 — worker recycle frequency).
+API_WORKERS="${HAPROXY_MGR_API_WORKERS:-1}"
+API_TIMEOUT="${HAPROXY_MGR_API_TIMEOUT:-120}"
+MAX_REQ="${HAPROXY_MGR_MAX_REQUESTS:-1000}"
+MAX_REQ_JITTER="${HAPROXY_MGR_MAX_REQUESTS_JITTER:-100}"
+
+# Default page server on :8080. Stays in the background.
+# --threads 4 lets one worker handle bursts of blocked-IP/default-page hits
+# without forking. --max-requests recycles the worker to bound memory drift.
+gunicorn \
+    --bind 0.0.0.0:8080 \
+    --workers 1 --threads 4 --worker-class gthread \
+    --max-requests "${MAX_REQ}" --max-requests-jitter "${MAX_REQ_JITTER}" \
+    --timeout 30 \
+    --access-logfile - --error-logfile - --log-level info \
+    --pythonpath /haproxy \
+    'haproxy_manager:default_app' &
+
+# Main API server on :8000 in the foreground. exec so signals propagate
+# correctly and the container exits if the API dies (docker --restart picks it
+# up). Longer --timeout because cert issuance hits ACME and can take a while.
+exec gunicorn \
+    --bind 0.0.0.0:8000 \
+    --workers "${API_WORKERS}" --threads 4 --worker-class gthread \
+    --max-requests "${MAX_REQ}" --max-requests-jitter "${MAX_REQ_JITTER}" \
+    --timeout "${API_TIMEOUT}" \
+    --access-logfile - --error-logfile - --log-level info \
+    --pythonpath /haproxy \
+    'haproxy_manager:app'