Files
cpanel-importer/scripts/scan-dbs.php
Claude (bootstrap) 08b995a29c
All checks were successful
cpanel-importer Build and Push / Build-and-Push (push) Successful in 1m21s
scan-dbs: stream the SQL file instead of loading 5GB+ into memory
Surfaced on whp02 alsacorp retry: scan-dbs.php hit PHP fatal at line 86
"Allowed memory size of 134217728 bytes exhausted (tried to allocate
5488440384 bytes)" while loading alsacorp_alsa1.sql via
file_get_contents. The dump is multi-GB (typical for WooCommerce stores
with media metadata); the 128MB-default PHP memory_limit + the 2GB
cgroup on the container both fail well below the actual file size.

Rewrote the per-DB pass as a streaming loop over 4MB chunks:

  - engine_swap_chunk: same `\bENGINE=MyISAM\b` regex, mutates a
    per-DB counter via reference so the per-chunk counts accumulate
    into a single myisam_to_innodb total.

  - is_wp_chunk_scan: OR-folds the four WP fingerprint regexes
    (CREATE TABLE *_options, *_posts, *_users + the
    'siteurl|home|template|stylesheet' sentinel) into a state dict;
    any chunk that flips a flag from false to true keeps it true for
    the rest of the file. Caller AND-folds at finalization.

  - wp_options_chunk_scan: extracts (option_name, option_value)
    tuples from INSERT INTO options statements as they pass through.
    First occurrence wins so we keep the live value, not later
    duplicates.

  - wp_content_scan_from_values: extracted the finalization logic
    from the legacy wp_content_scan() so the streaming path can
    submit a pre-built option-values map instead of re-scanning the
    full string.

Per-chunk carry: a 128-byte buffer at the end of each chunk is held
back and prepended to the next chunk so a pattern split across a
chunk boundary (e.g. "ENGINE=" at byte 4194302, "MyISAM" at byte
4194304) is still seen by the regex. 128 bytes is generous for our
patterns (longest is "ENGINE = MyISAM" with whitespace flex).

Output goes to a `<db>.sql.tmp` first, then renamed to
`<db>.sql{,.flagged}` once we know the flag verdict — avoids a
partial file if the scan dies mid-stream.

Legacy `engine_swap`, `is_wordpress_dump`, and the unused
`wp_content_scan`+`extract_wp_options` are kept in place for the
small-file path (none of them currently called from the new
streaming loop, but they're public-ish helpers the next dbsanitize
revision could reuse).

Resident memory now bounded to <16 MB per DB regardless of input
file size — should handle the 30 GB+ outliers we'll inevitably see.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 20:31:34 -07:00

22 KiB
Executable File