Files
cpanel-importer/CONTRIBUTING.md
Claude (bootstrap) 5487dfc8f1 Initial bootstrap: cpanel-importer sanitization sandbox
Skeleton for the cpanel-importer Docker container — a one-shot
sandbox the WHP panel invokes BEFORE extracting a customer cpmove
tarball. See cpanel-import-container-spec.md (in /workspace/) for the
full design.

What this ships in v1.0:

- Dockerfile: almalinux:10-minimal + PHP 8.4 (Remi) + ClamAV 1.4 +
  SaneSecurity Foxhole.PHP rules + tar/mariadb-client/rsync. Runs as
  UID 999 (whp-import) via the panel-side --user 999:999 flag.

- scripts/entrypoint.sh: validates env, runs (optional) freshclam,
  drives extract -> scan-files -> scan-dbs -> rsync -> report.json.

- scripts/extract.sh + scripts/lib/scan-symlinks.php: pre-extract
  symlink scan ported standalone from
  web-files/libs/CpanelBackupImporter.php (the existing 2026-05-29
  whp02 destruction-vector fix). Aborts with exit 3 before tar runs
  if any DANGEROUS symlink is found.

- scripts/scan-files.php: ClamAV walk + classify-and-action. v1.0
  ships with an empty cleaner registry — every hit is
  QUARANTINE_ONLY. Cleaner hooks are stubbed for v1.1.

- scripts/scan-dbs.php: regex MyISAM -> InnoDB rewrite (always
  applied), WordPress identification, and ONE WP content scan check
  (siteurl_external_domain). v1.1 will grow the check set.

- scripts/lib/safety-net.php: container-narrow open_basedir
  allow-list, much tighter than the panel-side one.

- .gitea/workflows/build-push.yaml: builds + smoke-tests +
  PHP-syntax-checks + bash-syntax-checks before pushing to
  repo.anhonesthost.net/cloud-hosting-platform/cpanel-importer.

- tests/build-fixtures.sh: builds cpmove-clean.tar.gz (benign WP
  dump) and cpmove-alfa.tar.gz (the ALFA-shell symlink-to-/etc
  vector) for local end-to-end testing.

- README.md / CONTRIBUTING.md: docker-run invocation, bind-mount
  catalog, report.json schema, how to add a cleaner pattern or a WP
  scan signature.

Local acceptance test results:
- clean fixture -> status=completed, 3 MyISAM->InnoDB, no flags, 0
- ALFA fixture -> exit 1, status=failed, failed_stage=extract,
  "tarball contains dangerous symlinks; aborting" on stderr
- compromised-siteurl fixture -> imported_into_new_server=false,
  .flagged file written, summary_for_panel.show_alert=true

Image size: 197 MB compressed (gzipped docker save), ~397 MB unique
layers extracted. Well under the spec's 600 MB compressed / 1.2 GB
extracted budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 19:56:57 -07:00

6.7 KiB
Raw Blame History

Contributing — cpanel-importer

How to add an auto-cleaner pattern

Auto-cleaners live in scripts/scan-files.php, in the $cleaners registry at the top of the main flow.

A cleaner has three parts:

$cleaners['short-cleaner-name'] = [
    'class' => 'KNOWN_REMOVABLE',  // or 'REMOVABLE_WITH_BACKUP'
    'match' => fn(string $sig): bool => str_contains($sig, 'PHP.Trojan.EvalB64'),
    'clean' => function (string $path): bool {
        // Read $path, transform, write back; return true on success.
        // The file at $path is the LIVE extracted file — your edit
        // here is what ends up in /host/sanitized/<id>/extracted/.
        // The original has ALREADY been backed up to <path>.original
        // by the orchestrator before this is called.
    },
];

Safety checklist before merging a new cleaner

  1. Backup is guaranteed. The orchestrator copies the file to <quarantine>/<relpath>.original BEFORE calling clean(). Verify this is still true in scan-files.php if you refactor the dispatch.
  2. Cleaner is idempotent. Running it twice on the same file must produce the same output the second time as the first.
  3. Cleaner is conservative. If the file does NOT match your transform exactly, return false (the orchestrator will fall back to quarantining). Never "best-effort" a half-clean.
  4. Cleaner has a regression test. Add a fixture under tests/fixtures/cleaner-<name>/ with input + expected output, and exercise it from tests/run-tests.sh (or your CI step).
  5. Cleaner classification is correct.
    • KNOWN_REMOVABLE = the whole pattern is known-safe to strip.
    • REMOVABLE_WITH_BACKUP = legit file with injected lines; we are confident in surgical removal but back up anyway.
    • QUARANTINE_ONLY = no clean variant; don't write a clean().
  6. Signature match is tight. Prefer str_contains($sig, 'specific-sig-name') over broad regex matches. A false-positive cleaner can corrupt customer files.

Manual test loop

docker build -t cpanel-importer:dev .
# Place a known-infected synthetic file under tests/fixtures/cleaner-X/in/
# Run scan-files.php directly against it:
docker run --rm \
    --entrypoint /scripts/scan-files.php \
    -v "$PWD/tests/fixtures/cleaner-X/in:/tmp/extract" \
    -v "$PWD/tests/fixtures/cleaner-X/quarantine:/host/quarantine" \
    cpanel-importer:dev \
    --extract /tmp/extract --quarantine /host/quarantine \
    --report /tmp/r.json --import-id test

How to add a WordPress content scan signature

Scan checks live in scripts/scan-dbs.php, in wp_content_scan().

Each check should produce a flag dict on hit:

$flags[] = [
    'severity' => 'high',    // 'high' refuses the DB (per default threshold N=1)
                             // 'medium' / 'low' flag in the report but allow import
    'code'     => 'short_machine_readable_code',
    'details'  => 'Human-readable explanation including the matched value(s).',
];

Safety checklist

  1. Severity reflects confidence. Use high only when a false positive is acceptable for the customer (they re-import via the "import anyway" UI button). Errors of measurement here translate directly to admin support tickets.
  2. Check is fast. The whole .sql dump is in memory as a string; prefer preg_match on the raw string or a pre-built map (see extract_wp_options()) over re-parsing the full dump.
  3. Check is well-tested. Add a fixture under tests/fixtures/wp-scan-<code>/ with a synthetic dump that triggers the flag and one that does not.
  4. Allow-list awareness. If the check is comparing a value against the customer's domain list, use domain_in_allowlist($host, $allowedDomains) so subdomain matches work consistently with the rest of the scanner.
  5. Don't break engine swap. wp_content_scan() runs AFTER the engine swap on the same $rewritten string. Both your check and the engine swap must be tolerant of each other's output.

How to test locally

Build the image

docker build -t cpanel-importer:dev .

Confirm the image is under the budget:

docker images cpanel-importer:dev --format '{{.Size}}'

Target: < 1 GB extracted (spec asks < 600 MB compressed for prod, but local builds typically come in around 700900 MB extracted including ClamAV signature DBs).

Build the fixtures

bash tests/build-fixtures.sh

Two tarballs land under tests/fixtures/:

  • cpmove-clean.tar.gz — a benign cpmove with a WordPress MyISAM dump.
  • cpmove-alfa.tar.gz — same shape PLUS an ALFA-style symlink to /etc.

Run against the clean fixture

mkdir -p /tmp/test-quarantine /tmp/test-sanitized
docker run --rm \
    -e IMPORT_ID=test-clean \
    -e IMPORT_USERNAME=testuser \
    -e IMPORT_BACKUP_FILE=/host/backup/cpmove-clean.tar.gz \
    -e CLAMAV_REFRESH=false \
    -v "$PWD/tests/fixtures/cpmove-clean.tar.gz:/host/backup/cpmove-clean.tar.gz:ro" \
    -v /tmp/test-quarantine:/host/quarantine \
    -v /tmp/test-sanitized:/host/sanitized \
    cpanel-importer:dev

Expect status=completed, MyISAM count > 0, no flags, exit 0.

Run against the ALFA fixture

docker run --rm \
    -e IMPORT_ID=test-alfa \
    -e IMPORT_USERNAME=testuser \
    -e IMPORT_BACKUP_FILE=/host/backup/cpmove-alfa.tar.gz \
    -e CLAMAV_REFRESH=false \
    -v "$PWD/tests/fixtures/cpmove-alfa.tar.gz:/host/backup/cpmove-alfa.tar.gz:ro" \
    -v /tmp/test-quarantine:/host/quarantine \
    -v /tmp/test-sanitized:/host/sanitized \
    cpanel-importer:dev

Expect non-zero exit, status=failed, failed_stage=extract, and stderr from inside the container containing tarball contains dangerous symlinks; aborting.

Iterating on PHP / shell scripts

The scripts/ directory is COPYed in late in the Dockerfile, so edits there only re-trigger the last layer of the build — typical turnaround is ~5 seconds.


Code style

  • Bash scripts: set -euo pipefail, absolute paths only, every external command on its own logical line, comment each non-obvious flag.
  • PHP scripts: 4-space indent, single quotes for non-interpolated strings, <?php opener on line 1, no closing ?>.
  • All scripts must be idempotent — the worker may be re-run against the same IMPORT_ID on retry; second runs must overwrite the prior report.json cleanly.

CI

Pushes to trunk build + push the image to repo.anhonesthost.net/cloud-hosting-platform/cpanel-importer:latest and ...:<sha>. Pushes of a YYYY.MM.NNN tag additionally tag ...:YYYY.MM.NNN. CI runs the smoke test (image starts and echo ok runs) and PHP -l / bash -n syntax checks on every script before pushing.

See .gitea/workflows/build-push.yaml.