Forge: From Powered-Off to Production-Ready

Forge is an application I am actively developing. This post is the first in a series on the problem space, architecture, implementation direction, and lessons learned while building it.

The problem

Every infrastructure platform has an automation story. Terraform provisions cloud resources. Ansible configures running systems. vLCM manages ESXi at scale. Nutanix Foundation images nodes. VMware Cloud Foundation bring-up orchestrates a full SDDC stack.

Every one of them assumes the operating system is already there.

That assumption is where lab engineers, small infrastructure teams, and anyone who has ever racked a physical server still spend hours they should not have to spend. Booting ISOs. Babysitting installers. Re-running kickstarts. Checking firmware by hand. Running post-install scripts from a terminal. Wondering which host is in which state after a rebuild gets interrupted.

The gap between a powered-off server and a running platform has never had a clean answer. That is the gap Forge is meant to own.

What Forge is

Forge is a bare-metal-to-platform automation system. It owns the physical layer: the part every other tool quietly assumes is already done.

Starting from a powered-off server with an out-of-band management interface, Forge can:

Pull hardware inventory from the BMC over Redfish.
Capture CPU, memory, storage topology, NIC state, and firmware details.
Generate a host-specific OS installer configuration.
Remaster a source ISO with that configuration embedded.
Mount the ISO as virtual media.
Force a one-time boot through the BIOS job queue.
Track the install as a live job with streamed logs.
Apply post-install configuration over SSH once the OS is up.
Hand off to the next platform-specific workflow stage.

No PXE infrastructure. No DHCP reservations. No manually staging ISOs. No hoping a screen session survived. One form submission, one job, one audit trail: from powered-off metal to configured operating system.

The part Forge owns

Input: BMC access, hardware inventory, OS media, host intent
Forge work: Install media, virtual media boot, job tracking, post-install config
Output: A configured host ready for VCF, Nutanix, Proxmox, Azure Local, or Linux workflows

The workflow idea

An OS install is stage one. The larger idea is a workflow engine for platform standup: an ordered chain of jobs across multiple hosts with dependency tracking, retries, logs, and status rollup.

Workflow	Stage 1	Stage 2	Stage 3	End state
VMware VCF	ESXi on N hosts	Installer appliance	Bring-up API	SDDC ready
Nutanix AHV	Phoenix imaging	Cluster formation	Prism configuration	Cluster ready
Proxmox	Proxmox VE on N hosts	Cluster join	Ceph initialization	Cluster ready
Azure Local	Windows Server on N hosts	Failover cluster	S2D / Arc registration	HCI ready
Generic Linux	Distro install	Custom post-install	Optional handoff	Ready host

Each workflow is a template. Assign hosts, provide platform credentials, submit, and let Forge handle sequencing. Every stage is a job. Every job has a log. The whole standup becomes something you can run again later without depending on memory, notes, or a terminal scrollback that disappeared three rebuilds ago.

Workflow state model

Host layer: Inventory, virtual media, one-time boot, OS install, post-install config
Job layer: State, logs, retries, dependencies, and failure reporting
Platform layer: VCF, Nutanix, Proxmox, Azure Local, or Linux handoff workflows

Why it is different

Forge starts from zero. Not from a running operating system. Not from a provisioned VM. Not from a node that a platform installer can already reach. From a powered-off physical server.

That is not a subtle distinction. It is where real infrastructure work begins, and it is where most higher-level automation tools intentionally stop caring.

Forge is also not trying to replace Terraform, Ansible, VCF bring-up, Nutanix Foundation, or platform-native automation. It runs before those tools. The goal is to put the machine into a known, configured, auditable state and then hand off cleanly to whatever comes next.

Hardware-aware by design

The inventory step matters. Forge reads real hardware state from the BMC before writing installer configuration: CPUs, memory, disks, controllers, NICs, link state, firmware versions, and management details.

That turns provisioning from "I hope this host looks like the spreadsheet" into something the system can reason about. If a NIC link is down, storage looks wrong, or firmware does not match the expected profile, that should be visible before the install starts, not discovered halfway through a platform bring-up.

Repeatability is the product

Labs get rebuilt. VCF domains get redeployed. Nutanix clusters get re-imaged. Proxmox clusters get torn down and built again. That is not an edge case. For the environments I work in, rebuilds are part of the operating model.

Forge treats repeatability as the product. Host definitions, installer configs, kickstart libraries, workflow templates, and job history all become part of the system. The tenth rebuild should be as boring as the first.

Why a web app

I want Forge to be visible and auditable from a browser. The work it performs is too important to hide inside a one-off terminal session.

A web app gives the full workflow a surface area: hosts, inventory, job state, live logs, notes, install history, workflow status, and failures that can be reviewed after the fact. If a host fails during install, the question should not be "who had the terminal open?" It should be "what does the job log say?"

Current state

Forge currently handles the full ESXi bare-metal provisioning pipeline on Dell iDRAC hardware:

Redfish inventory and lifecycle actions.
Virtual media mount and BIOS boot-once through the job queue.
Kickstart generation and ISO remastering with host-specific config.
Post-install SSH configuration for networking, hostname, DNS, NTP, and TLS regeneration.
Live job tracking with streamed logs, host notes, and a kickstart library.
Multiple ESXi hosts under management with active installs running in the lab.

VCF Installer appliance deployment is the next active stage, with Nutanix Phoenix ISO support designed in parallel.

Where it goes

The foundation is in place. The Redfish layer is hardware-aware. The job engine exists. The host model is extensible. The next step is to move from individual host provisioning into full platform workflows.

Appliance deployment with govc-based OVA workflows.
Multi-host workflow chaining with dependency graphs and status rollup.
OS pluggability for Phoenix, Windows unattended, Proxmox, and Linux installers.
Workflow templates for VCF, Nutanix, Proxmox, and Azure Local.
Workflow definitions as code: version-controlled, shareable, and executable from an API.

The end state is simple to describe and hard to build: one self-hosted system that can take supported infrastructure from powered-off metal to a running platform with logs, state, and repeatability built in.

The opportunity

Every organization running on-premises infrastructure rebuilds environments. Every platform vendor has a bring-up process that assumes someone else handled the hardware. Every lab engineer has spent a weekend reimaging servers that a well-designed system should have handled in an afternoon.

Forge is my attempt to build that system.

The physical layer is the beginning. The workflow engine is next. The platforms are the payload.