Files
cpanel-importer/CONTRIBUTING.md
Claude (bootstrap) 5487dfc8f1 Initial bootstrap: cpanel-importer sanitization sandbox
Skeleton for the cpanel-importer Docker container — a one-shot
sandbox the WHP panel invokes BEFORE extracting a customer cpmove
tarball. See cpanel-import-container-spec.md (in /workspace/) for the
full design.

What this ships in v1.0:

- Dockerfile: almalinux:10-minimal + PHP 8.4 (Remi) + ClamAV 1.4 +
  SaneSecurity Foxhole.PHP rules + tar/mariadb-client/rsync. Runs as
  UID 999 (whp-import) via the panel-side --user 999:999 flag.

- scripts/entrypoint.sh: validates env, runs (optional) freshclam,
  drives extract -> scan-files -> scan-dbs -> rsync -> report.json.

- scripts/extract.sh + scripts/lib/scan-symlinks.php: pre-extract
  symlink scan ported standalone from
  web-files/libs/CpanelBackupImporter.php (the existing 2026-05-29
  whp02 destruction-vector fix). Aborts with exit 3 before tar runs
  if any DANGEROUS symlink is found.

- scripts/scan-files.php: ClamAV walk + classify-and-action. v1.0
  ships with an empty cleaner registry — every hit is
  QUARANTINE_ONLY. Cleaner hooks are stubbed for v1.1.

- scripts/scan-dbs.php: regex MyISAM -> InnoDB rewrite (always
  applied), WordPress identification, and ONE WP content scan check
  (siteurl_external_domain). v1.1 will grow the check set.

- scripts/lib/safety-net.php: container-narrow open_basedir
  allow-list, much tighter than the panel-side one.

- .gitea/workflows/build-push.yaml: builds + smoke-tests +
  PHP-syntax-checks + bash-syntax-checks before pushing to
  repo.anhonesthost.net/cloud-hosting-platform/cpanel-importer.

- tests/build-fixtures.sh: builds cpmove-clean.tar.gz (benign WP
  dump) and cpmove-alfa.tar.gz (the ALFA-shell symlink-to-/etc
  vector) for local end-to-end testing.

- README.md / CONTRIBUTING.md: docker-run invocation, bind-mount
  catalog, report.json schema, how to add a cleaner pattern or a WP
  scan signature.

Local acceptance test results:
- clean fixture -> status=completed, 3 MyISAM->InnoDB, no flags, 0
- ALFA fixture -> exit 1, status=failed, failed_stage=extract,
  "tarball contains dangerous symlinks; aborting" on stderr
- compromised-siteurl fixture -> imported_into_new_server=false,
  .flagged file written, summary_for_panel.show_alert=true

Image size: 197 MB compressed (gzipped docker save), ~397 MB unique
layers extracted. Well under the spec's 600 MB compressed / 1.2 GB
extracted budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 19:56:57 -07:00

193 lines
6.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Contributing — cpanel-importer
## How to add an auto-cleaner pattern
Auto-cleaners live in `scripts/scan-files.php`, in the `$cleaners`
registry at the top of the main flow.
A cleaner has three parts:
```php
$cleaners['short-cleaner-name'] = [
'class' => 'KNOWN_REMOVABLE', // or 'REMOVABLE_WITH_BACKUP'
'match' => fn(string $sig): bool => str_contains($sig, 'PHP.Trojan.EvalB64'),
'clean' => function (string $path): bool {
// Read $path, transform, write back; return true on success.
// The file at $path is the LIVE extracted file — your edit
// here is what ends up in /host/sanitized/<id>/extracted/.
// The original has ALREADY been backed up to <path>.original
// by the orchestrator before this is called.
},
];
```
### Safety checklist before merging a new cleaner
1. **Backup is guaranteed.** The orchestrator copies the file to
`<quarantine>/<relpath>.original` BEFORE calling `clean()`. Verify
this is still true in `scan-files.php` if you refactor the dispatch.
2. **Cleaner is idempotent.** Running it twice on the same file must
produce the same output the second time as the first.
3. **Cleaner is conservative.** If the file does NOT match your
transform exactly, return `false` (the orchestrator will fall back
to quarantining). Never "best-effort" a half-clean.
4. **Cleaner has a regression test.** Add a fixture under
`tests/fixtures/cleaner-<name>/` with input + expected output, and
exercise it from `tests/run-tests.sh` (or your CI step).
5. **Cleaner classification is correct.**
- `KNOWN_REMOVABLE` = the whole pattern is known-safe to strip.
- `REMOVABLE_WITH_BACKUP` = legit file with injected lines; we are
confident in surgical removal but back up anyway.
- `QUARANTINE_ONLY` = no clean variant; don't write a `clean()`.
6. **Signature match is tight.** Prefer
`str_contains($sig, 'specific-sig-name')` over broad regex matches.
A false-positive cleaner can corrupt customer files.
### Manual test loop
```bash
docker build -t cpanel-importer:dev .
# Place a known-infected synthetic file under tests/fixtures/cleaner-X/in/
# Run scan-files.php directly against it:
docker run --rm \
--entrypoint /scripts/scan-files.php \
-v "$PWD/tests/fixtures/cleaner-X/in:/tmp/extract" \
-v "$PWD/tests/fixtures/cleaner-X/quarantine:/host/quarantine" \
cpanel-importer:dev \
--extract /tmp/extract --quarantine /host/quarantine \
--report /tmp/r.json --import-id test
```
---
## How to add a WordPress content scan signature
Scan checks live in `scripts/scan-dbs.php`, in `wp_content_scan()`.
Each check should produce a flag dict on hit:
```php
$flags[] = [
'severity' => 'high', // 'high' refuses the DB (per default threshold N=1)
// 'medium' / 'low' flag in the report but allow import
'code' => 'short_machine_readable_code',
'details' => 'Human-readable explanation including the matched value(s).',
];
```
### Safety checklist
1. **Severity reflects confidence.** Use `high` only when a false
positive is acceptable for the customer (they re-import via the
"import anyway" UI button). Errors of measurement here translate
directly to admin support tickets.
2. **Check is fast.** The whole `.sql` dump is in memory as a string;
prefer `preg_match` on the raw string or a pre-built map (see
`extract_wp_options()`) over re-parsing the full dump.
3. **Check is well-tested.** Add a fixture under
`tests/fixtures/wp-scan-<code>/` with a synthetic dump that
triggers the flag and one that does not.
4. **Allow-list awareness.** If the check is comparing a value against
the customer's domain list, use
`domain_in_allowlist($host, $allowedDomains)` so subdomain matches
work consistently with the rest of the scanner.
5. **Don't break engine swap.** `wp_content_scan()` runs AFTER the
engine swap on the same `$rewritten` string. Both your check and
the engine swap must be tolerant of each other's output.
---
## How to test locally
### Build the image
```bash
docker build -t cpanel-importer:dev .
```
Confirm the image is under the budget:
```bash
docker images cpanel-importer:dev --format '{{.Size}}'
```
Target: < 1 GB extracted (spec asks < 600 MB compressed for prod, but
local builds typically come in around 700900 MB extracted including
ClamAV signature DBs).
### Build the fixtures
```bash
bash tests/build-fixtures.sh
```
Two tarballs land under `tests/fixtures/`:
- `cpmove-clean.tar.gz` — a benign cpmove with a WordPress MyISAM dump.
- `cpmove-alfa.tar.gz` — same shape PLUS an ALFA-style symlink to /etc.
### Run against the clean fixture
```bash
mkdir -p /tmp/test-quarantine /tmp/test-sanitized
docker run --rm \
-e IMPORT_ID=test-clean \
-e IMPORT_USERNAME=testuser \
-e IMPORT_BACKUP_FILE=/host/backup/cpmove-clean.tar.gz \
-e CLAMAV_REFRESH=false \
-v "$PWD/tests/fixtures/cpmove-clean.tar.gz:/host/backup/cpmove-clean.tar.gz:ro" \
-v /tmp/test-quarantine:/host/quarantine \
-v /tmp/test-sanitized:/host/sanitized \
cpanel-importer:dev
```
Expect `status=completed`, MyISAM count > 0, no flags, exit 0.
### Run against the ALFA fixture
```bash
docker run --rm \
-e IMPORT_ID=test-alfa \
-e IMPORT_USERNAME=testuser \
-e IMPORT_BACKUP_FILE=/host/backup/cpmove-alfa.tar.gz \
-e CLAMAV_REFRESH=false \
-v "$PWD/tests/fixtures/cpmove-alfa.tar.gz:/host/backup/cpmove-alfa.tar.gz:ro" \
-v /tmp/test-quarantine:/host/quarantine \
-v /tmp/test-sanitized:/host/sanitized \
cpanel-importer:dev
```
Expect non-zero exit, `status=failed`, `failed_stage=extract`, and
stderr from inside the container containing
`tarball contains dangerous symlinks; aborting`.
### Iterating on PHP / shell scripts
The `scripts/` directory is `COPY`ed in late in the Dockerfile, so
edits there only re-trigger the last layer of the build — typical
turnaround is ~5 seconds.
---
## Code style
- Bash scripts: `set -euo pipefail`, absolute paths only, every external
command on its own logical line, comment each non-obvious flag.
- PHP scripts: 4-space indent, single quotes for non-interpolated
strings, `<?php` opener on line 1, no closing `?>`.
- All scripts must be idempotent — the worker may be re-run against the
same `IMPORT_ID` on retry; second runs must overwrite the prior
`report.json` cleanly.
---
## CI
Pushes to `trunk` build + push the image to
`repo.anhonesthost.net/cloud-hosting-platform/cpanel-importer:latest` and
`...:<sha>`. Pushes of a `YYYY.MM.NNN` tag additionally tag
`...:YYYY.MM.NNN`. CI runs the smoke test (image starts and
`echo ok` runs) and PHP `-l` / `bash -n` syntax checks on every script
before pushing.
See `.gitea/workflows/build-push.yaml`.