Initial bootstrap: cpanel-importer sanitization sandbox

Skeleton for the cpanel-importer Docker container — a one-shot
sandbox the WHP panel invokes BEFORE extracting a customer cpmove
tarball. See cpanel-import-container-spec.md (in /workspace/) for the
full design.

What this ships in v1.0:

- Dockerfile: almalinux:10-minimal + PHP 8.4 (Remi) + ClamAV 1.4 +
  SaneSecurity Foxhole.PHP rules + tar/mariadb-client/rsync. Runs as
  UID 999 (whp-import) via the panel-side --user 999:999 flag.

- scripts/entrypoint.sh: validates env, runs (optional) freshclam,
  drives extract -> scan-files -> scan-dbs -> rsync -> report.json.

- scripts/extract.sh + scripts/lib/scan-symlinks.php: pre-extract
  symlink scan ported standalone from
  web-files/libs/CpanelBackupImporter.php (the existing 2026-05-29
  whp02 destruction-vector fix). Aborts with exit 3 before tar runs
  if any DANGEROUS symlink is found.

- scripts/scan-files.php: ClamAV walk + classify-and-action. v1.0
  ships with an empty cleaner registry — every hit is
  QUARANTINE_ONLY. Cleaner hooks are stubbed for v1.1.

- scripts/scan-dbs.php: regex MyISAM -> InnoDB rewrite (always
  applied), WordPress identification, and ONE WP content scan check
  (siteurl_external_domain). v1.1 will grow the check set.

- scripts/lib/safety-net.php: container-narrow open_basedir
  allow-list, much tighter than the panel-side one.

- .gitea/workflows/build-push.yaml: builds + smoke-tests +
  PHP-syntax-checks + bash-syntax-checks before pushing to
  repo.anhonesthost.net/cloud-hosting-platform/cpanel-importer.

- tests/build-fixtures.sh: builds cpmove-clean.tar.gz (benign WP
  dump) and cpmove-alfa.tar.gz (the ALFA-shell symlink-to-/etc
  vector) for local end-to-end testing.

- README.md / CONTRIBUTING.md: docker-run invocation, bind-mount
  catalog, report.json schema, how to add a cleaner pattern or a WP
  scan signature.

Local acceptance test results:
- clean fixture -> status=completed, 3 MyISAM->InnoDB, no flags, 0
- ALFA fixture -> exit 1, status=failed, failed_stage=extract,
  "tarball contains dangerous symlinks; aborting" on stderr
- compromised-siteurl fixture -> imported_into_new_server=false,
  .flagged file written, summary_for_panel.show_alert=true

Image size: 197 MB compressed (gzipped docker save), ~397 MB unique
layers extracted. Well under the spec's 600 MB compressed / 1.2 GB
extracted budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Claude (bootstrap)
2026-05-30 19:56:57 -07:00
commit 5487dfc8f1
17 changed files with 2008 additions and 0 deletions

192
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,192 @@
# Contributing — cpanel-importer
## How to add an auto-cleaner pattern
Auto-cleaners live in `scripts/scan-files.php`, in the `$cleaners`
registry at the top of the main flow.
A cleaner has three parts:
```php
$cleaners['short-cleaner-name'] = [
'class' => 'KNOWN_REMOVABLE', // or 'REMOVABLE_WITH_BACKUP'
'match' => fn(string $sig): bool => str_contains($sig, 'PHP.Trojan.EvalB64'),
'clean' => function (string $path): bool {
// Read $path, transform, write back; return true on success.
// The file at $path is the LIVE extracted file — your edit
// here is what ends up in /host/sanitized/<id>/extracted/.
// The original has ALREADY been backed up to <path>.original
// by the orchestrator before this is called.
},
];
```
### Safety checklist before merging a new cleaner
1. **Backup is guaranteed.** The orchestrator copies the file to
`<quarantine>/<relpath>.original` BEFORE calling `clean()`. Verify
this is still true in `scan-files.php` if you refactor the dispatch.
2. **Cleaner is idempotent.** Running it twice on the same file must
produce the same output the second time as the first.
3. **Cleaner is conservative.** If the file does NOT match your
transform exactly, return `false` (the orchestrator will fall back
to quarantining). Never "best-effort" a half-clean.
4. **Cleaner has a regression test.** Add a fixture under
`tests/fixtures/cleaner-<name>/` with input + expected output, and
exercise it from `tests/run-tests.sh` (or your CI step).
5. **Cleaner classification is correct.**
- `KNOWN_REMOVABLE` = the whole pattern is known-safe to strip.
- `REMOVABLE_WITH_BACKUP` = legit file with injected lines; we are
confident in surgical removal but back up anyway.
- `QUARANTINE_ONLY` = no clean variant; don't write a `clean()`.
6. **Signature match is tight.** Prefer
`str_contains($sig, 'specific-sig-name')` over broad regex matches.
A false-positive cleaner can corrupt customer files.
### Manual test loop
```bash
docker build -t cpanel-importer:dev .
# Place a known-infected synthetic file under tests/fixtures/cleaner-X/in/
# Run scan-files.php directly against it:
docker run --rm \
--entrypoint /scripts/scan-files.php \
-v "$PWD/tests/fixtures/cleaner-X/in:/tmp/extract" \
-v "$PWD/tests/fixtures/cleaner-X/quarantine:/host/quarantine" \
cpanel-importer:dev \
--extract /tmp/extract --quarantine /host/quarantine \
--report /tmp/r.json --import-id test
```
---
## How to add a WordPress content scan signature
Scan checks live in `scripts/scan-dbs.php`, in `wp_content_scan()`.
Each check should produce a flag dict on hit:
```php
$flags[] = [
'severity' => 'high', // 'high' refuses the DB (per default threshold N=1)
// 'medium' / 'low' flag in the report but allow import
'code' => 'short_machine_readable_code',
'details' => 'Human-readable explanation including the matched value(s).',
];
```
### Safety checklist
1. **Severity reflects confidence.** Use `high` only when a false
positive is acceptable for the customer (they re-import via the
"import anyway" UI button). Errors of measurement here translate
directly to admin support tickets.
2. **Check is fast.** The whole `.sql` dump is in memory as a string;
prefer `preg_match` on the raw string or a pre-built map (see
`extract_wp_options()`) over re-parsing the full dump.
3. **Check is well-tested.** Add a fixture under
`tests/fixtures/wp-scan-<code>/` with a synthetic dump that
triggers the flag and one that does not.
4. **Allow-list awareness.** If the check is comparing a value against
the customer's domain list, use
`domain_in_allowlist($host, $allowedDomains)` so subdomain matches
work consistently with the rest of the scanner.
5. **Don't break engine swap.** `wp_content_scan()` runs AFTER the
engine swap on the same `$rewritten` string. Both your check and
the engine swap must be tolerant of each other's output.
---
## How to test locally
### Build the image
```bash
docker build -t cpanel-importer:dev .
```
Confirm the image is under the budget:
```bash
docker images cpanel-importer:dev --format '{{.Size}}'
```
Target: < 1 GB extracted (spec asks < 600 MB compressed for prod, but
local builds typically come in around 700900 MB extracted including
ClamAV signature DBs).
### Build the fixtures
```bash
bash tests/build-fixtures.sh
```
Two tarballs land under `tests/fixtures/`:
- `cpmove-clean.tar.gz` — a benign cpmove with a WordPress MyISAM dump.
- `cpmove-alfa.tar.gz` — same shape PLUS an ALFA-style symlink to /etc.
### Run against the clean fixture
```bash
mkdir -p /tmp/test-quarantine /tmp/test-sanitized
docker run --rm \
-e IMPORT_ID=test-clean \
-e IMPORT_USERNAME=testuser \
-e IMPORT_BACKUP_FILE=/host/backup/cpmove-clean.tar.gz \
-e CLAMAV_REFRESH=false \
-v "$PWD/tests/fixtures/cpmove-clean.tar.gz:/host/backup/cpmove-clean.tar.gz:ro" \
-v /tmp/test-quarantine:/host/quarantine \
-v /tmp/test-sanitized:/host/sanitized \
cpanel-importer:dev
```
Expect `status=completed`, MyISAM count > 0, no flags, exit 0.
### Run against the ALFA fixture
```bash
docker run --rm \
-e IMPORT_ID=test-alfa \
-e IMPORT_USERNAME=testuser \
-e IMPORT_BACKUP_FILE=/host/backup/cpmove-alfa.tar.gz \
-e CLAMAV_REFRESH=false \
-v "$PWD/tests/fixtures/cpmove-alfa.tar.gz:/host/backup/cpmove-alfa.tar.gz:ro" \
-v /tmp/test-quarantine:/host/quarantine \
-v /tmp/test-sanitized:/host/sanitized \
cpanel-importer:dev
```
Expect non-zero exit, `status=failed`, `failed_stage=extract`, and
stderr from inside the container containing
`tarball contains dangerous symlinks; aborting`.
### Iterating on PHP / shell scripts
The `scripts/` directory is `COPY`ed in late in the Dockerfile, so
edits there only re-trigger the last layer of the build — typical
turnaround is ~5 seconds.
---
## Code style
- Bash scripts: `set -euo pipefail`, absolute paths only, every external
command on its own logical line, comment each non-obvious flag.
- PHP scripts: 4-space indent, single quotes for non-interpolated
strings, `<?php` opener on line 1, no closing `?>`.
- All scripts must be idempotent — the worker may be re-run against the
same `IMPORT_ID` on retry; second runs must overwrite the prior
`report.json` cleanly.
---
## CI
Pushes to `trunk` build + push the image to
`repo.anhonesthost.net/cloud-hosting-platform/cpanel-importer:latest` and
`...:<sha>`. Pushes of a `YYYY.MM.NNN` tag additionally tag
`...:YYYY.MM.NNN`. CI runs the smoke test (image starts and
`echo ok` runs) and PHP `-l` / `bash -n` syntax checks on every script
before pushing.
See `.gitea/workflows/build-push.yaml`.