Two related fixes for the issues the AI Monitor surfaced on whp01 on
2026-05-12 (haproxy-manager going "healthy but stalled" after long
uptime, and noise from POST /blocked-ip returning 405):
1. Production WSGI server. The Flask app was running on werkzeug's
built-in dev server (the one that prints "WARNING: This is a
development server" on every startup). werkzeug is single-threaded
and accumulates worker state over long uptimes; after ~24h on whp01
the health endpoint stops responding while the container still
reports "healthy" because Docker's HEALTHCHECK uses an HTTP probe
from inside the same werkzeug process that's stalled.
Replace with gunicorn (gthread worker class, --max-requests=1000
with jitter so workers recycle periodically). Two gunicorn instances,
one per Flask app — port 8000 for the management API, port 8080 for
the default/blocked-ip page server. Both lift their app objects from
the haproxy_manager module so gunicorn can import them.
Required structural change: default_app was created INSIDE the
__name__ == '__main__' block at module bottom, where gunicorn could
never reach it. Moved to module level. The __main__ block now stays
only for `python haproxy_manager.py` local-dev workflow.
Container init (init_db, certbot register, generate_config,
start_haproxy) extracted into a do_initial_setup() function called
from a new scripts/init.py. start-up.sh runs init.py to completion
before either gunicorn binds, which keeps HAProxy startup off the
WSGI workers' fork paths (no race between workers all trying to
start_haproxy() at once).
2. /blocked-ip and / accept ALL methods. HAProxy proxies blocked-IP
traffic to default_app preserving the original verb, so a blocked
POST request used to hit Flask's GET-only route and get a 405 +
the AI Monitor flagged the noise. Adding the full method list lets
the 403 page render regardless of verb.
Gunicorn settings tunable via env (workers, timeout, max-requests).
API gets --timeout 120 because ACME cert issuance can be slow; the
default page server stays on the gunicorn default 30s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
certbot uses fasteners (fcntl-based locking) to serialize concurrent
invocations. The kernel auto-releases fcntl locks when the holding
process exits, but the .certbot.lock FILES persist on disk — and we've
seen real cases where subsequent runs report "Another instance of
Certbot is already running" even when no certbot process is alive.
Observed during the 2026-05-09 bundling rollout when a hung worker
held a lock across container-internal Python crashes.
When SSL is blocked on a customer site, this is high-impact: the
certbot lock can sit stale until somebody manually deletes it.
clear_stale_certbot_locks():
- probes each known lock path with fcntl.LOCK_NB
- if the lock is unheld → file is stale → delete it
- if the lock IS held → leave it alone (real certbot is running)
Wired in:
- container startup (init block)
- /api/ssl single-domain handler
- /api/ssl/bundle handler
- /api/certificates/renew handler
Safe to call repeatedly; never deletes a lock a real process holds, so
can never trigger concurrent certbot runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bundle endpoint correctly issued multi-SAN certs but left old
single-SAN .pem files (e.g. <name>-0001.pem) in /etc/haproxy/certs/.
HAProxy's `bind ... ssl crt /etc/haproxy/certs` loads everything in the
directory and picked the alphabetically-first matching file — typically
the older single-SAN one — so the new bundle had no effect on what was
served. Repro on peptidesaver.net: bundle covered 4 SANs but HAProxy
kept serving peptidesaver.net-0001.pem (single SAN, April-issued).
After a successful bundle write, walk SSL_CERTS_DIR and remove any
.pem whose CN is in the new bundle's name list (excluding the bundle's
own combined file). Drop the matching certbot lineage with
`certbot delete --cert-name <X> -n` so `certbot renew` stops touching
the dead lineage too.
Returns a `cleanup` summary in the API response so callers can log /
display what was deleted.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WHP's renewal orchestrator now bundles a site's domains into one cert
covering all SANs, instead of N separate single-domain orders. Single
ACME order = better behavior under Let's Encrypt's 50/hour orders limit
when many domains need attention at once.
Endpoint: POST /api/ssl/bundle
Body: {"primary": "example.com", "sans": ["www.example.com", ...]}
- Uses --cert-name <primary> so the lineage stays stable across renewals
(no -0001/-0002 proliferation seen with the legacy single-domain flow).
- Single combined .pem at /etc/haproxy/certs/<primary>.pem; HAProxy SNI-
matches against the cert's SAN list, so one file serves all included
hostnames.
- Updates the domains table for every SAN in the bundle.
- Hard cap at 100 SANs (LE limit).
Existing /api/ssl single-domain endpoint kept for backwards compat.
The WHP haproxy_manager::bundleSSL() helper falls back to a per-domain
loop if /api/ssl/bundle returns 404, so the WHP side keeps working
during the rolling image upgrade window.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Volume-mounted /etc/haproxy can shadow the image-baked
trusted_ips.list/trusted_ips.map, causing HAProxy to fail
config validation with "failed to open pattern file" on
non-WHP deployments. Touch empty files if they don't exist
so the ACLs always parse.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The resolvers section was inserted inside the global section, causing
HAProxy to parse global directives (pidfile, maxconn, etc.) as
resolver keywords. Moved resolvers to its own top-level section
between global and defaults where HAProxy expects it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When Docker containers restart, they can get new IPs on the bridge
network. HAProxy caches DNS at config load time, so stale IPs cause
503s until config is regenerated.
Added a 'docker_dns' resolvers section pointing to Docker's embedded
DNS (127.0.0.11) with 10s hold time. Backend servers now use
'resolvers docker_dns init-addr last,libc,none' so HAProxy:
- Re-resolves container names every 10 seconds
- Falls back to last known IP if DNS is temporarily unavailable
- Starts even if a backend can't be resolved yet (init-addr none)
This eliminates 503s from container restarts, scaling, and recreation
without requiring a HAProxy config regeneration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The renewal script was exiting immediately when certbot returned a
non-zero exit code, which happens when ANY cert fails to renew. A
single dead domain (e.g., DNS no longer pointed here) would block
ALL other certificates from being processed and combined for HAProxy.
Now logs the failures but continues to copy/combine successfully
renewed certificates and reload HAProxy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents HAProxy health checks, watchdog, rate limiting, trusted IP
whitelist, timeout hardening, HTTP/2 protection, and the AI-powered
log monitor system with two-tier analysis, auto-remediation, and
notification support.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Generous thresholds that accommodate sites with many images/assets
while still catching obvious automated floods:
- Request rate: tarpit at 300 req/s, block at 500 req/s
- Connection rate: 500/10s
- Concurrent connections: 500
- Error rate: 100/30s
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous thresholds (200/500 req/10s) were too aggressive — WordPress
login pages with their CSS/JS/image assets can easily burst 30-50
requests per page load, triggering tarpits and blocks on legitimate
users.
New thresholds:
- Request rate: tarpit at 1000/10s (100 req/s), block at 2000/10s (200 req/s)
- Connection rate: 300/10s (was 150)
- Concurrent connections: 200 (was 100)
- Error rate: 50/30s (was 20)
These still catch real floods and scanners while giving normal web
traffic plenty of headroom.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds trusted_ips.list and trusted_ips.map files that exempt specific
IPs from all rate limiting rules. Supports both direct source IP
matching (is_trusted_ip) and proxy-header real IP matching
(is_whitelisted). Files are baked into the image and can be updated
by editing and rebuilding.
Adds phone system IP 172.116.197.166 to the whitelist.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gives more headroom for customers with code that makes frequent
callbacks to itself, while still catching connection floods.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Activate HAProxy's built-in attack prevention to stop floods that cause
the container to become unresponsive:
- Stick table tracks per-IP: conn_cur, conn_rate, http_req_rate, http_err_rate
- Rate limit rules: deny at 50 req/s, tarpit at 20 req/s, connection
rate limit at 60/10s, concurrent connection cap at 100, error rate
tarpit at 20 errors/30s
- Harden timeouts: http-request 300s→30s, connect 120s→10s, client
10m→5m, keep-alive 120s→30s
- HTTP/2 Rapid Reset protection (CVE-2023-44487): stream and glitch limits
- Stats frontend on localhost:8404 for monitoring
- HEALTHCHECK now validates both port 80 (HAProxy) and 8000 (API)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Captures the Host header in HAProxy httplog output so high-connection
alerts can be correlated to specific domains.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add find_certbot_live_dir() helper to locate the most recent certbot live
directory for a domain, handling -NNNN suffixed dirs from repeated requests.
Fix combined cert filename from *.domain.pem to _wildcard_.domain.pem.
Apply the helper across all SSL endpoints (request, renew, verify, download).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hook scripts are at /haproxy/scripts/ inside the container (per
Dockerfile COPY), not /app/scripts/. Also added logging of certbot
stdout/stderr so failures are visible in haproxy-manager.log.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Support wildcard domains (*.domain.tld) in HAProxy config generation
with exact-match ACLs prioritized over wildcard ACLs. Add DNS-01
challenge endpoints that coordinate with certbot via auth/cleanup
hook scripts for wildcard SSL certificate issuance.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changes:
- Detect SSE via Accept header (text/event-stream) or ?action=stream parameter
- Disable http-server-close to allow long-lived SSE connections
- Enable http-no-delay for immediate event delivery
- Set 1-hour timeouts for SSE support (also fine for normal requests)
- Force Connection: keep-alive for detected SSE requests
Benefits:
- SSE now works automatically without special backend configuration
- Fixes transcription server display disconnection issues
- Normal HTTP requests still work perfectly
- No need for separate SSE-specific backends
Fixes: Server-Sent Events timing out through HAProxy
Improved certificate renewal and sync scripts to be more resilient:
- Removed 'set -e' to prevent silent failures when individual domains error
- Scripts now continue processing remaining domains even if one fails
- Replaced database queries with direct filesystem scanning of /etc/letsencrypt/live/
- Uses 'find' command to discover all domains with Let's Encrypt certificates
- More reliable as it works even if database is out of sync
Benefits:
- No silent failures - errors are logged but don't stop the entire process
- Works independently of database state
- Simpler and more straightforward
- All domains with certificates get processed regardless of database
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Simplified all certificate renewal scripts to be more straightforward and reliable:
- Scripts now just run certbot renew and copy cert+key files to HAProxy format
- Removed overly complex retry logic and error handling
- Both in-container and host-side scripts work with cron scheduling
Added automatic certbot cleanup when domains are removed:
- When a domain is deleted via API, certbot certificate is also removed
- Prevents renewal errors for domains that no longer exist in HAProxy
- Cleans up both HAProxy combined cert and Let's Encrypt certificate
Script changes:
- renew-certificates.sh: Simplified to 87 lines (from 215)
- sync-certificates.sh: Simplified to 79 lines (from 200+)
- host-renew-certificates.sh: Simplified to 36 lines (from 40)
- All scripts use same pattern: query DB, copy certs, reload HAProxy
Python changes:
- remove_domain() now calls 'certbot delete' to remove certificates
- Prevents orphaned certificates from causing renewal failures
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Write combined certificates to temporary file first
- Verify file is not empty before moving to final location
- Use atomic mv operation to prevent HAProxy from reading partial files
- Add proper cleanup of temporary files on all error paths
- Matches robust patterns from haproxy_manager.py
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Update map file format to include value (IP/CIDR 1)
- Fix HAProxy template to use map_ip() for CIDR support
- Update runtime map commands to include value
- Document CIDR range blocking in API documentation
- Support blocking entire network ranges (e.g., 192.168.1.0/24)
This allows blocking compromised ISP ranges and other large-scale attacks.
After certbot renews certificates, the separate fullchain.pem and privkey.pem
files must be combined into a single .pem file for HAProxy. The renewal script
was missing this critical step, causing HAProxy to continue using old certificates.
Changes:
- Add update_combined_certificates() function to renew-certificates.sh
- Query database for all SSL-enabled domains
- Combine Let's Encrypt cert + key files using cat (matches haproxy_manager.py pattern)
- Always update combined certs after renewal, even if certbot says no renewal needed
- Add new sync-certificates.sh script for syncing all existing certificates
- Smart update detection in sync script (only updates when source is newer)
This ensures HAProxy always gets properly formatted certificate files after renewal.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit simplifies the HAProxy configuration by removing automatic
threat detection and blocking rules while preserving essential functionality.
Changes:
- Removed all automatic ACL-based security rules (SQL injection detection,
scanner detection, rate limiting, brute force protection, etc.)
- Removed complex stick-table tracking with 15 GPC counters
- Removed graduated threat response system (tarpit, deny based on threat scores)
- Removed HTTP/2 security tuning parameters specific to threat detection
- Commented out IP header forwarding in hap_backend_basic.tpl
Preserved functionality:
- Real client IP detection from proxy headers (CF-Connecting-IP, X-Real-IP,
X-Forwarded-For) with proper fallback to source IP
- Manual IP blocking via map file (/etc/haproxy/blocked_ips.map)
- Runtime map updates for immediate blocking without reload
- Backend IP forwarding capabilities (available in hap_backend.tpl)
The configuration now focuses on manual IP blocking only, which can be
managed through the API endpoints (/api/blocked-ips).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Fixed crontab permissions (600) and ownership for proper cron execution
- Added PATH environment variable to crontab to prevent command not found issues
- Created dedicated renewal script with comprehensive logging and error handling
- Added retry logic (3 attempts) for HAProxy reload with socket health checks
- Implemented host-side renewal script for external cron scheduling via docker exec
- Added crontab configuration examples for various renewal schedules
- Updated README with detailed certificate renewal documentation
This resolves issues where the cron job would not run or hang during execution.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Remove semicolons from variable initialization in AWK scripts
- Each variable now on separate line to prevent syntax errors
- Fixes "syntax error at or near ," in monitor-attacks.sh and manage-blocked-ips.sh
- Scripts now properly parse HAProxy 3.0.11 threat intelligence data
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Remove reference to non-existent security_blacklist table
- Use single table tracking with consolidated array-based GPC system
- Remove res.hdr(X-Threat-Level) from log-format as response headers not available in request phase
- Maintains threat intelligence logging with available request-phase data
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Replace compound ACL xmlrpc_abuse with separate conditions
- Use xmlrpc_rate_abuse for rate detection and combine with is_xmlrpc in http-request rule
- Prevents ACL-to-ACL reference which is not supported in HAProxy 3.0.11
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add -m int matcher for all var(txn.threat_score) comparisons
- Fix set-header, tarpit, deny, and set-log-level conditions
- Ensures proper variable type matching for HAProxy 3.0.11
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Fix tune.h2.fe-max-total-streams parameter name in global config
- Fix stick-table multiline syntax by removing line continuations
- Replace sc0_get_gpc with sc_get_gpc for proper 3.0.11 syntax
- Replace sc-set-gpc with sc-set-gpt for value assignments
- Update ACL definitions to use correct GPT fetch methods
- Simplify threat scoring to avoid unsupported add-var operations
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Escape inner quotes in the certbot renewal cron job to properly
send reload command to HAProxy via socat after certificate renewal.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Removed all 40X error tracking and rate limiting from HAProxy templates
- Preserved critical IP forwarding headers (X-CLIENT-IP, X-Real-IP, X-Forwarded-For)
- Kept stick table and IP blocking infrastructure for potential future use
- Rate limiting can now be implemented at container level with proper context
This change prevents legitimate developers from being rate-limited during
normal development activities while maintaining proper client IP forwarding
for container-level security and logging.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Remove invalid ACL combination syntax (can't use 'or' to combine ACLs)
- Use multiple http-response lines instead (each line is OR'd together)
- Each line checks specific scan pattern with 404 AND not legitimate assets
- Simplify logic to be HAProxy 3.0 compatible
This fixes the config parsing errors while maintaining the same
detection logic - only counting suspicious script/config 404s, not
missing assets.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Handle common missing files (favicon.ico, robots.txt) without counting as errors
- Return 404 directly from frontend for these files (bypasses backend counting)
- Add clear-ip.sh script to remove specific IPs from stick-table
- Keep trusted networks whitelist for local/private IPs
This prevents legitimate users from being blocked due to browser
requests for common files that don't exist.
Usage: ./scripts/clear-ip.sh <IP_ADDRESS>
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Remove unsupported set-timeout tarpit directives
- Use fixed 30s global tarpit timeout (reduced from 60s)
- Keep escalation tracking via gpc1 for monitoring repeat offenders
- HAProxy 3.0 doesn't support variable tarpit timeouts per request
The escalation level (gpc1) is still tracked and visible in monitoring
but all tarpits use the same 30s delay.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>