Forge + ESXi 9.1: What Actually Broke and Why

Forge is an application I am actively developing. Lab-specific addresses, hostnames, and domains have been generalized; the failure modes and implementation lessons are the point.

The pattern

ESXi 9.1 provisioning did not fail in one clean place. It failed in layers. The job engine reported success when the host was not configured. Kickstart syntax looked valid while ESXi ignored the command that mattered. Redfish calls returned HTTP 200 while the server booted the old operating system. Connectivity checks completed while testing from the wrong network.

That is exactly why I am building Forge as a job-driven system with logs, state, and outcome checks. Bare-metal automation cannot trust a happy API response. It has to verify what happened on the machine.

The failure pattern

API layer: Redfish returned success while boot-once was not applied
Installer layer: Kickstart completed while ESXi ignored the network command
Job layer: Forge marked complete before post-install configuration succeeded

1. Jobs reported complete when they were not

The first failure was self-inflicted. Jobs showed complete in the UI even though the log said SSH was unavailable and post-install configuration had been skipped.

The cause was simple: job["status"] = "complete" was unconditional. Forge was marking the code path complete, not the installation outcome complete.

The fix was to make job state conditional on the actual result. If SSH never comes up, the job now fails and returns early. A job is only complete when post-install configuration actually runs.

Job state has to follow outcome

Bad model: Function exits, job becomes complete
Better model: SSH reachable, post-install config applied, job becomes complete
Failure path: SSH unavailable or config skipped, job becomes failed

That distinction matters. A job engine that reports success because the function reached the bottom is not an infrastructure automation system. It is a progress bar with false confidence.

2. ESXi 9.x ignored the network kickstart command

Hosts installed successfully but came up without the expected management VLAN on the port group. The kickstart looked right. The install completed. The network state was wrong.

The root cause was that ESXi 9.x ignored the traditional network kickstart command in this path. No useful error. No obvious warning. Just a host that installed cleanly and did not apply the network configuration Forge asked for.

The fix was to move the host configuration into a %firstboot --interpreter=busybox block and use localcli directly.

%firstboot --interpreter=busybox
localcli network ip interface ipv4 set \
  --interface-name=vmk0 \
  --ipv4=<management-ip> \
  --netmask=<netmask> \
  --type=static

localcli network ip route ipv4 add \
  --gateway=<gateway> \
  --network=0.0.0.0/0

localcli system hostname set \
  --host=<host-name> \
  --domain=<domain>

localcli network ip dns server add --server=<dns-server>
localcli system ntp set --enabled=true --server=<ntp-server>
localcli network firewall ruleset set --enabled=true --ruleset-id=sshServer
/sbin/chkconfig SSH on
/sbin/chkconfig TSM-SSH on

# Set the management VLAN last, because it changes reachability.
localcli network vswitch standard portgroup set \
  --portgroup-name='Management Network' \
  --vlan-id=<management-vlan>

The localcli detail matters. localcli talks directly to the local system and does not depend on hostd being fully available. In early first boot, that makes it a better fit than esxcli for this stage of the install.

The VLAN step also has to be treated carefully. Changing the VLAN on the management port group can drop connectivity. In first boot, that is acceptable because the commands are running locally, but it still needs to be the last network operation.

3. The Redfish boot-once patch that did nothing

Forge logged that it had set one-time boot to virtual CD and then rebooted the server. The server came back running the existing ESXi install. The Redfish calls looked successful. The machine ignored them.

The first problem was the standard Redfish boot override path. A patch to the system boot settings with a virtual CD target returned success on the iDRAC hosts in this lab, but did not actually change the next boot. The allowable values exposed by the BMC made the problem visible: the standard target was not a real option for that platform.

The working path was to use BIOS settings instead: set one-time UEFI boot mode and point the one-time UEFI boot device at the iDRAC virtual media device discovered from live BIOS attributes.

Dell iDRAC boot-once path

No-op path: Standard boot override returned success but did not change next boot
Pending path: BIOS settings patch queued the change but did not apply it by itself
Working path: BIOS settings patch plus Dell OEM job caused the next boot to consume the setting

That still was not enough. On Dell iDRAC, pending BIOS settings need a separate BIOS configuration job. Without that job, the settings sit in the pending endpoint and never get applied on reboot.

# After PATCH /Bios/Settings succeeds, create the Dell BIOS config job.
s.post(
    f"{base}/Managers/iDRAC.Embedded.1/Oem/Dell/Jobs",
    json={
        "TargetSettingsURI":
            "/redfish/v1/Systems/System.Embedded.1/Bios/Settings"
    },
)

The verification was not the HTTP response. The verification was checking current BIOS attributes after reboot and seeing that the one-time boot setting had been consumed.

4. Connectivity tests ran from the wrong machine

Another failure looked like the installed hosts were unreachable. All port checks against the management subnet failed, which suggested the install had not configured the host correctly.

The test was wrong. The development workstation did not have a route to the ESXi management subnet. The checks were proving that the workstation could not reach that network, not that the hosts were down.

The fix was to run SSH and port checks from the Forge server, which does have the right route, while keeping Redfish checks pointed at the reachable BMC network.

A clean negative result and a routing black hole can look identical. Before trusting connectivity tests, verify the test path.

5. Background jobs died during redeploy

Some jobs froze mid-run. Logs stopped updating. Status stayed at booting. The SSH wait never completed.

The cause was process lifecycle, not ESXi. Forge was running install jobs as Python background threads inside the web process. Redeploying the application tore down the container and killed any in-flight threads instantly. The job record kept whatever state and log lines existed at the moment of termination.

The immediate mitigation is simple: do not redeploy while jobs are active. The long-term answer is a persistent job queue or task runner that survives process restarts and can mark interrupted work correctly.

This is one of those product lessons that only shows up once the tool starts managing real hardware. A job engine has to survive the operator's workflow, not just the happy path in code.

6. NTP ran once but did not persist

Fresh installs showed NTP configured and synchronized. After reboot, the service was not running.

The issue was startup policy. Setting NTP servers and enabling the service for the current session did not guarantee the daemon would start at boot.

The fix was to make startup explicit in both first boot and post-install configuration.

/sbin/chkconfig ntpd on

On this ESXi path, chkconfig was the mechanism that mattered for persistence.

Summary

Issue	Misleading signal	Actual cause
Jobs always complete	UI status said complete	Status was assigned unconditionally
Network config missing	Install completed	ESXi 9.x ignored the kickstart network command
Boot-once ignored	Redfish returned success	Standard boot override was not functional for this platform
BIOS setting not consumed	Pending setting existed	Dell required an OEM BIOS configuration job
Connectivity failed	All checks returned negative	The test source had no route to the management subnet
Jobs froze	Logs stopped, status stayed booting	Redeploy killed in-process background threads
NTP not persistent	NTP worked before reboot	Service startup policy was not set

The lesson

The theme across all six issues is that success responses are not the same as success on the metal. A Redfish call can return 200 and do nothing useful. A kickstart can be syntactically valid and still not apply the setting you need. A connection test can complete while proving only that the test source is in the wrong place. A job can finish its code path while the host is not configured.

Forge has to be skeptical by design. The job engine should report outcomes, not code paths. The BMC layer should verify current state, not just response codes. The installer path should test what ESXi actually applied. Connectivity checks should run from the machine that represents the real operational path.

That is the difference between writing an installer wrapper and building infrastructure automation. The wrapper runs commands. The automation system proves the machine ended up where it was supposed to be.