The boring dependency broke the advanced platform
VCF 9.1 brings enough moving parts that it is tempting to look for a complicated failure when registration or operations integration breaks.
In this rebuild, the root cause was simpler and more painful: the FQDNs in the bring-up specification did not match what DNS could resolve.
The platform components deployed. Some services appeared healthy. Credentials, licensing, and connectivity were not the issue. But the management plane was still built on names, and the names were wrong.
What failed
The failing path was registration and federation around the operations components. The spec referenced service names that either did not exist in forward DNS, differed from the shortened names actually deployed, or had reverse records that did not round-trip cleanly.
That created a split-brain feeling: appliances were present, some endpoints were reachable, and yet the higher-level integration never completed.
This is exactly the kind of failure that wastes time because it looks like an application problem after deployment. It was really a pre-deployment naming problem.
# Sanitized shape of the failure
$ dig +short operations-lb.example.internal
$ dig +short -x 192.0.2.45
old-platform-name.example.internal.
$ installer precheck
ERROR: operations load balancer FQDN does not resolve
ERROR: reverse DNS does not match requested endpoint identity
VCF names are not labels
The important mental shift is that names in the bring-up spec are not friendly labels. They are integration contracts.
Operations registration, SSO federation, monitoring adapter activation, load balancer endpoints, collectors, managers, and service discovery all rely on exact names resolving correctly.
If a component is deployed under a normalized or shortened hostname but the spec still points at the longer original name, the platform does not infer intent. It tries the name it was given.
The checklist that should happen before bring-up
Every FQDN in the spec should have a forward A record before bring-up. Every address should have a reverse PTR. Forward and reverse lookup should round-trip to the same expected name.
That includes management appliances, load balancer VIPs, NSX managers, operations nodes, fleet services, identity services, collectors, automation components, log components, and ESXi hosts.
The point is not only that DNS exists. The point is that the exact spelling in the spec, the deployed hostname, the forward record, and the reverse record all agree.
| Check | Pass condition | Why it matters |
|---|---|---|
| Forward lookup | Every spec FQDN returns the expected management IP or VIP | The installer and registration paths use the names you gave them. |
| Reverse lookup | Every IP returns the same FQDN used in the spec | Identity, federation, and monitoring adapters hate drift. |
| Length/normalization | Deployed hostnames match the names in the spec | Shortened appliance names can leave the spec pointing at ghosts. |
| Precheck | No DNS warnings are waived | The cheapest repair window is before bring-up starts. |
Do not bypass prechecks
The process change is more important than the individual typo.
VCF installer validation exists to catch this class of failure while the environment is still cheap to fix. If the precheck says a name does not resolve, that is not a warning to mentally file away. It is the platform telling you the bring-up contract is invalid.
A later workaround can sometimes repair a record, but it cannot always repair every registration and federation attempt that already failed against the wrong identity.
The reusable lesson
This was a VCF 9.1 rebuild, but the lesson applies to every private AI or cloud foundation build.
Advanced platforms fail on basic substrate. DNS, time, certificates, identity, IP plans, and reverse lookup are not prerequisites you finish once and stop thinking about. They are active dependencies of the control plane.
Before blaming the new thing, prove the boring things. The boring things usually still own the outage.