The views in this post are my own. Lab details are generalized and sanitized; hostnames, addresses, credentials, and object identifiers have been removed or replaced with role-based descriptions.
Everything looked fine until it had to work
There is a special kind of frustration that comes from a failure where every component looks healthy until the exact moment traffic needs to move through the whole system.
For five days, every attempt to run Contour in a vSphere Supervisor environment ended the same way. The pod would start, wait exactly ten seconds, and exit with code 1. No drift. No partial success. Just ten seconds, every time.
This was not a prompt-and-answer session. It was an agentic troubleshooting loop: Claude Code moving through evidence, me adding constraints and correcting assumptions, and the investigation tightening one layer at a time.
The first symptom looked like a Kubernetes service endpoint or NSX load-balancer problem. The actual issue was a silent MTU drop three layers below that, in the physical routing path that Geneve overlay packets traverse between an ESXi host and an NSX Edge VM.
This was another joint effort between me and Claude Code. The value was not that an agent magically knew the answer. It was that the agent could keep collecting evidence across layers while I kept adding the operational context it could not infer from commands alone.
The environment
Contour was being deployed as a vSphere Supervisor Service on a new VCF cluster. The important detail is that Contour ran as a CRX pod, which is a vSphere Pod backed by a lightweight VM on ESXi. It was not running as a normal container inside the Supervisor control-plane VM.
That distinction ended up being the whole story. Other Supervisor services ran close to the API server path and did not exercise the same overlay and load-balancer chain. Contour did.
- Service IP
- Kubernetes ClusterIP for the API server
- VIP path
- Distributed Load Balancer to an NSX Edge load-balancer VIP
- Backend
- Supervisor kube-apiserver on an NSX overlay segment
In simplified form, the flow looked like this:
Contour CRX pod
-> Kubernetes service IP
-> NSX Distributed Load Balancer in the ESXi datapath
-> NSX Edge load-balancer VIP
-> Supervisor kube-apiserver backend on an overlay segment
That path matters because the return traffic from the Supervisor VM to the NSX Edge is carried over Geneve. If the underlay path cannot pass the encapsulated frame size, the overlay can look alive while still dropping the first full-size payload that matters.
The crash
The Contour log was short and brutal:
unable to initialize Server dependencies required to start Contour
failed to determine if *v1.Secret is namespaced
failed to get server groups
Get "https://<kubernetes-service-ip>:443/api": net/http: TLS handshake timeout
TLS handshake timeout. In Go, the default TLS handshake timeout is ten seconds. That explained the clock-like precision of the crash, but it did not explain where the packet was disappearing.
Direct access to the Supervisor API server through its management address worked. Certificates were valid. Responses were fast. RBAC, Contour CRDs, the Contour certificate secret, and Kubernetes environment injection all checked out.
The broken endpoint was the API path through the NSX VIP.
Where the agent helped
Claude Code was useful because it could stay in the problem. It could pull logs, compare endpoints, inspect pod environment, read packet captures, summarize what had already been ruled out, and keep following the thread without getting bored or skipping the tedious checks.
That matters in infrastructure troubleshooting. The work is often not one brilliant command. It is accumulation: small facts, failed theories, packet sizes, path differences, and the discipline to keep asking whether the current explanation fits all of the evidence.
But the agent did not have the lived model of the environment. It could observe symptoms, but it still needed human context to decide which symptoms were meaningful and which tests were exercising the wrong path.
The deceptive symptom
Testing the VIP produced the kind of result that sends you down the wrong path if you trust only surface checks:
| Test | Result |
|---|---|
| TCP connect | Succeeded quickly |
| Plain HTTP to the HTTPS port | Returned 400 Bad Request |
| TLS handshake | Hung for about ten seconds |
Plain HTTP working was the trap. The kube-apiserver can reject a cleartext HTTP request on an HTTPS port with a small 400 response. That response does not require a large TLS record to survive the path.
TLS is different. The server eventually has to send its certificate, and that certificate record is much larger than the tiny SYNs, ClientHello fragments, keepalives, and error responses that were making the environment look healthy.
The first reasonable theory was wrong
At first, the failure looked like an NSX load-balancer backend issue. That was a reasonable hypothesis. The VIP accepted TCP, but TLS never completed. The direct API endpoint worked. The service endpoint pointed through the NSX path. It smelled like a broken backend pool or a load-balancer profile problem.
We tried a workaround by pointing Contour at the direct API endpoint. The pods still crashed, but differently: different exit code, different timing, and enough change to prove the original failure path had been partially bypassed.
More importantly, packet capture during the original VIP path showed the first real clue. The Supervisor VM sent a large TCP segment containing the TLS certificate record. The NSX Edge never ACKed it. The Supervisor retransmitted the same segment over and over until the client hit the ten-second TLS timeout.
The packet that gave it away
The key observation was not "TLS failed." It was this:
- The backend TCP connection established.
- Small TLS records moved successfully.
- The large certificate segment was retransmitted repeatedly.
- The Edge side never ACKed that segment.
That shifted the investigation away from certificates, RBAC, CRDs, and basic load-balancer reachability. The system could pass small packets. It could not pass the first large payload in the TLS handshake.
- TCP handshake
- Small packets passed
- TLS setup
- Small records passed
- Certificate
- Large Geneve-encapsulated segment dropped silently
- Application result
- Contour hit Go's ten-second TLS handshake timeout
The physical network looked clean
The obvious suspect for large-packet loss is MTU. We started with the physical switching layer. Host-facing ports were configured for jumbo frames. Uplinks were configured for jumbo frames. The relevant VLANs were trunked where expected. The distributed switch and vmnics were also set for jumbo frames.
That ruled out the easy version of the MTU problem. It did not rule out every L3 hop.
This is where the trap was. A configuration check can tell you what an interface is supposed to support. It does not always prove what the routed path will actually pass, especially when firewall parent interfaces, VLAN sub-interfaces, and inherited MTU behavior are all involved.
Where the agent needed human context
The path was awkward to test. Management-network tests did not always exercise the same flow that a real Contour CRX pod used. Some tests could prove the API server was healthy without proving that the CRX pod to DLB to Edge VIP to overlay backend path worked.
Packet capture was also less direct than I wanted. Standard captures could see symptoms and sizes, but Geneve encapsulation makes it hard to inspect the useful inner payload without the right datapath-level access. NSX trace tooling also could not synthesize the exact payload size that was failing.
BFD was another red herring. The tunnel health signals stayed up because BFD packets are tiny. A healthy BFD session tells you the tunnel can pass keepalives. It does not prove the underlay can carry full-size encapsulated application payloads.
This is the part of agentic operations that interests me. The agent can be excellent at running the maze, but the human still has to know when the maze is drawn wrong. In this case, "the API works from management" was true and still not enough. It was not the same path Contour used.
The wrong big hammer
At one point, the evidence seemed to point at stale NSX Edge state. Both Edge VMs were redeployed from scratch. New transport state, fresh Edge VMs, reconverged control plane, and Edge interfaces with 9000-byte MTU.
TLS still failed.
That was useful, even though it was not satisfying. It eliminated Edge state corruption and forced the investigation back into the underlay path between the host TEP network and the Edge TEP network.
The MTU wall
The breakthrough came from testing the host TEP to Edge TEP path directly with the don't-fragment bit set.
This is where the human/agent loop mattered. Claude Code was useful because it could hold the packet evidence, NSX behavior, pod restart timing, and previous dead ends in one working context. But the next useful question was architectural: who actually routes between the host TEP and Edge TEP networks?
| Payload | Outer IP frame | Result |
|---|---|---|
| 1472 bytes | 1500 bytes | Passed |
| 1473 bytes | 1501 bytes | Dropped |
That is as clean as a signal gets. The wall was exactly 1500 bytes. The physical switches were already jumbo-capable, so the constraint had to be at the Layer 3 routing boundary between the host TEP subnet and the Edge TEP subnet.
The router in that path was a firewall acting as the gateway for both TEP networks. The important finding was not just a line in the interface configuration. The effective path MTU across that routed firewall boundary was 1500 bytes, even though other parts of the fabric and overlay stack appeared to be configured for jumbo frames.
That distinction matters. If you stop at "show me the MTU on the interface," it is easy to convince yourself the network is fine. The don't-fragment test proved the path was not fine.
The math
The TLS certificate segment itself was not exotic. It was a normal large TCP payload. The problem was what happened after NSX wrapped it in Geneve.
inner TCP data payload 1448 bytes
inner IP/TCP headers 40 bytes
Geneve header 8 bytes
outer UDP header 8 bytes
outer IP header 20 bytes
----------------------------------------
outer IP frame ~1538 bytes
A 1538-byte outer frame will not cross a 1500-byte routed interface when fragmentation is not available or not acceptable for that flow. The small packets kept working because they fit. The first full TLS certificate record did not.
That explained everything: TCP connected, plain HTTP returned a small 400, BFD stayed up, and TLS died at ten seconds.
The fix
The fix was to make jumbo behavior explicit on the firewall path: enable jumbo frames and set the fabric-facing parent plus the relevant TEP-facing VLAN interfaces to a jumbo MTU. That removed any ambiguity around parent/sub-interface inheritance and made the effective routed path match what the overlay required.
config system global
set jumbo-frame enable
end
config system interface
edit "<fabric-aggregate>"
set mtu-override enable
set mtu 9000
next
edit "<edge-tep-vlan-interface>"
set mtu-override enable
set mtu 9000
next
edit "<host-tep-vlan-interface>"
set mtu-override enable
set mtu 9000
next
end
After the change, larger don't-fragment tests passed across the TEP path. TLS to the NSX VIP completed. Contour pods came up Running and Ready on the next restart cycle.
What we ruled out
This was not a short investigation. Along the way, we ruled out:
- Contour RBAC and service-account permissions.
- Contour CRDs and certificate secret validity.
- Kubernetes service environment injection.
- Supervisor API server health on the direct management path.
- Distributed switch and ESXi vmnic MTU.
- Physical switch MTU and VLAN trunking.
- NSX distributed firewall policy.
- NSX VDR software drops.
- NSX Edge VM interface MTU.
- NSX Edge state corruption.
- MSS clamping as a practical fix in that NSX version.
That list matters because the first three or four answers were reasonable. They were also wrong.
The lesson: agentic AI still needs architecture judgment
The technical lesson is about MTU. The more interesting lesson is about how agentic AI fits into real troubleshooting.
The MTU lesson is not simply "remember to enable jumbo frames." Everyone knows NSX overlay traffic needs headroom. The sharper lesson is that visible MTU configuration is not the same thing as verified path MTU. Parent interfaces, VLAN sub-interfaces, inherited settings, and firewall routing behavior can make the config look less suspicious than the packet path really is.
NSX overlay networks need jumbo frames on every underlay hop that carries encapsulated traffic, including routed firewall interfaces between TEP subnets. It is not enough for the ESXi vmnics, distributed switch, Edge interfaces, and physical switch ports to be correct. The L3 boundary has to be correct too.
The failure mode is especially nasty because small packets continue to pass. TCP handshakes work. BFD stays up. Some HTTP tests return responses. The tunnel looks alive. Then the first large TLS record gets Geneve-wrapped, crosses the 1500-byte boundary, and disappears.
The misconfiguration had been sitting there quietly. Contour did not create it. Contour found it, because it was the first workload in that environment to exercise the full CRX pod to NSX load-balancer to overlay backend path with a payload large enough to matter.
Claude Code helped compress the investigation. It could test, compare, capture, summarize, and keep pressure on the problem for hours. But it still needed a human operator to recognize when a conclusion did not fit the architecture, ask better questions, and connect a TLS timeout to an underlay routing boundary that was not obvious from the first symptom.