Chasing a 10-Second Timeout: How an MTU Misconfiguration Broke Contour for Five Days

The views in this post are my own. Lab details are generalized and sanitized; hostnames, addresses, credentials, and object identifiers have been removed or replaced with role-based descriptions.

Everything looked fine until it had to work

There is a special kind of frustration that comes from a failure where every component looks healthy until the exact moment traffic needs to move through the whole system.

For five days, every attempt to run Contour in a vSphere Supervisor environment ended the same way. The pod would start, wait exactly ten seconds, and exit with code 1. No drift. No partial success. Just ten seconds, every time.

This was not a prompt-and-answer session. It was an agentic troubleshooting loop: Claude Code moving through evidence, me adding constraints and correcting assumptions, and the investigation tightening one layer at a time.

The first symptom looked like a Kubernetes service endpoint or NSX load-balancer problem. The actual issue was a silent MTU drop three layers below that, in the physical routing path that Geneve overlay packets traverse between an ESXi host and an NSX Edge VM.

This was another joint effort between me and Claude Code. The value was not that an agent magically knew the answer. It was that the agent could keep collecting evidence across layers while I kept adding the operational context it could not infer from commands alone.

The environment

Contour was being deployed as a vSphere Supervisor Service on a new VCF cluster. The important detail is that Contour ran as a CRX pod, which is a vSphere Pod backed by a lightweight VM on ESXi. It was not running as a normal container inside the Supervisor control-plane VM.

That distinction ended up being the whole story. Other Supervisor services ran close to the API server path and did not exercise the same overlay and load-balancer chain. Contour did.

The traffic path that mattered

Service IP: Kubernetes ClusterIP for the API server
VIP path: Distributed Load Balancer to an NSX Edge load-balancer VIP
Backend: Supervisor kube-apiserver on an NSX overlay segment

In simplified form, the flow looked like this:

Contour CRX pod
  -> Kubernetes service IP
  -> NSX Distributed Load Balancer in the ESXi datapath
  -> NSX Edge load-balancer VIP
  -> Supervisor kube-apiserver backend on an overlay segment

That path matters because the return traffic from the Supervisor VM to the NSX Edge is carried over Geneve. If the underlay path cannot pass the encapsulated frame size, the overlay can look alive while still dropping the first full-size payload that matters.

The crash

The Contour log was short and brutal:

unable to initialize Server dependencies required to start Contour
failed to determine if *v1.Secret is namespaced
failed to get server groups
Get "https://<kubernetes-service-ip>:443/api": net/http: TLS handshake timeout

TLS handshake timeout. In Go, the default TLS handshake timeout is ten seconds. That explained the clock-like precision of the crash, but it did not explain where the packet was disappearing.

Direct access to the Supervisor API server through its management address worked. Certificates were valid. Responses were fast. RBAC, Contour CRDs, the Contour certificate secret, and Kubernetes environment injection all checked out.

The broken endpoint was the API path through the NSX VIP.

Where the agent helped

Claude Code was useful because it could stay in the problem. It could pull logs, compare endpoints, inspect pod environment, read packet captures, summarize what had already been ruled out, and keep following the thread without getting bored or skipping the tedious checks.

That matters in infrastructure troubleshooting. The work is often not one brilliant command. It is accumulation: small facts, failed theories, packet sizes, path differences, and the discipline to keep asking whether the current explanation fits all of the evidence.

But the agent did not have the lived model of the environment. It could observe symptoms, but it still needed human context to decide which symptoms were meaningful and which tests were exercising the wrong path.

The deceptive symptom

Testing the VIP produced the kind of result that sends you down the wrong path if you trust only surface checks:

Test	Result
TCP connect	Succeeded quickly
Plain HTTP to the HTTPS port	Returned 400 Bad Request
TLS handshake	Hung for about ten seconds

Plain HTTP working was the trap. The kube-apiserver can reject a cleartext HTTP request on an HTTPS port with a small 400 response. That response does not require a large TLS record to survive the path.

TLS is different. The server eventually has to send its certificate, and that certificate record is much larger than the tiny SYNs, ClientHello fragments, keepalives, and error responses that were making the environment look healthy.

The first reasonable theory was wrong

At first, the failure looked like an NSX load-balancer backend issue. That was a reasonable hypothesis. The VIP accepted TCP, but TLS never completed. The direct API endpoint worked. The service endpoint pointed through the NSX path. It smelled like a broken backend pool or a load-balancer profile problem.

We tried a workaround by pointing Contour at the direct API endpoint. The pods still crashed, but differently: different exit code, different timing, and enough change to prove the original failure path had been partially bypassed.

More importantly, packet capture during the original VIP path showed the first real clue. The Supervisor VM sent a large TCP segment containing the TLS certificate record. The NSX Edge never ACKed it. The Supervisor retransmitted the same segment over and over until the client hit the ten-second TLS timeout.

The packet that gave it away

The key observation was not "TLS failed." It was this:

The backend TCP connection established.
Small TLS records moved successfully.
The large certificate segment was retransmitted repeatedly.
The Edge side never ACKed that segment.

That shifted the investigation away from certificates, RBAC, CRDs, and basic load-balancer reachability. The system could pass small packets. It could not pass the first large payload in the TLS handshake.

The failure signature

TCP handshake: Small packets passed
TLS setup: Small records passed
Certificate: Large Geneve-encapsulated segment dropped silently
Application result: Contour hit Go's ten-second TLS handshake timeout

The physical network looked clean

The obvious suspect for large-packet loss is MTU. We started with the physical switching layer. Host-facing ports were configured for jumbo frames. Uplinks were configured for jumbo frames. The relevant VLANs were trunked where expected. The distributed switch and vmnics were also set for jumbo frames.

That ruled out the easy version of the MTU problem. It did not rule out every L3 hop.

This is where the trap was. A configuration check can tell you what an interface is supposed to support. It does not always prove what the routed path will actually pass, especially when firewall parent interfaces, VLAN sub-interfaces, and inherited MTU behavior are all involved.

Where the agent needed human context

The path was awkward to test. Management-network tests did not always exercise the same flow that a real Contour CRX pod used. Some tests could prove the API server was healthy without proving that the CRX pod to DLB to Edge VIP to overlay backend path worked.

Packet capture was also less direct than I wanted. Standard captures could see symptoms and sizes, but Geneve encapsulation makes it hard to inspect the useful inner payload without the right datapath-level access. NSX trace tooling also could not synthesize the exact payload size that was failing.

BFD was another red herring. The tunnel health signals stayed up because BFD packets are tiny. A healthy BFD session tells you the tunnel can pass keepalives. It does not prove the underlay can carry full-size encapsulated application payloads.

This is the part of agentic operations that interests me. The agent can be excellent at running the maze, but the human still has to know when the maze is drawn wrong. In this case, "the API works from management" was true and still not enough. It was not the same path Contour used.

The wrong big hammer

At one point, the evidence seemed to point at stale NSX Edge state. Both Edge VMs were redeployed from scratch. New transport state, fresh Edge VMs, reconverged control plane, and Edge interfaces with 9000-byte MTU.

TLS still failed.

That was useful, even though it was not satisfying. It eliminated Edge state corruption and forced the investigation back into the underlay path between the host TEP network and the Edge TEP network.

The MTU wall

The breakthrough came from testing the host TEP to Edge TEP path directly with the don't-fragment bit set.

This is where the human/agent loop mattered. Claude Code was useful because it could hold the packet evidence, NSX behavior, pod restart timing, and previous dead ends in one working context. But the next useful question was architectural: who actually routes between the host TEP and Edge TEP networks?

Payload	Outer IP frame	Result
1472 bytes	1500 bytes	Passed
1473 bytes	1501 bytes	Dropped

That is as clean as a signal gets. The wall was exactly 1500 bytes. The physical switches were already jumbo-capable, so the constraint had to be at the Layer 3 routing boundary between the host TEP subnet and the Edge TEP subnet.

The router in that path was a firewall acting as the gateway for both TEP networks. The important finding was not just a line in the interface configuration. The effective path MTU across that routed firewall boundary was 1500 bytes, even though other parts of the fabric and overlay stack appeared to be configured for jumbo frames.

That distinction matters. If you stop at "show me the MTU on the interface," it is easy to convince yourself the network is fine. The don't-fragment test proved the path was not fine.

The math

The TLS certificate segment itself was not exotic. It was a normal large TCP payload. The problem was what happened after NSX wrapped it in Geneve.

inner TCP data payload        1448 bytes
inner IP/TCP headers           40 bytes
Geneve header                   8 bytes
outer UDP header                8 bytes
outer IP header                20 bytes
----------------------------------------
outer IP frame              ~1538 bytes

A 1538-byte outer frame will not cross a 1500-byte routed interface when fragmentation is not available or not acceptable for that flow. The small packets kept working because they fit. The first full TLS certificate record did not.

That explained everything: TCP connected, plain HTTP returned a small 400, BFD stayed up, and TLS died at ten seconds.

The fix

The fix was to make jumbo behavior explicit on the firewall path: enable jumbo frames and set the fabric-facing parent plus the relevant TEP-facing VLAN interfaces to a jumbo MTU. That removed any ambiguity around parent/sub-interface inheritance and made the effective routed path match what the overlay required.

config system global
    set jumbo-frame enable
end

config system interface
    edit "<fabric-aggregate>"
        set mtu-override enable
        set mtu 9000
    next
    edit "<edge-tep-vlan-interface>"
        set mtu-override enable
        set mtu 9000
    next
    edit "<host-tep-vlan-interface>"
        set mtu-override enable
        set mtu 9000
    next
end

After the change, larger don't-fragment tests passed across the TEP path. TLS to the NSX VIP completed. Contour pods came up Running and Ready on the next restart cycle.

What we ruled out

This was not a short investigation. Along the way, we ruled out:

Contour RBAC and service-account permissions.
Contour CRDs and certificate secret validity.
Kubernetes service environment injection.
Supervisor API server health on the direct management path.
Distributed switch and ESXi vmnic MTU.
Physical switch MTU and VLAN trunking.
NSX distributed firewall policy.
NSX VDR software drops.
NSX Edge VM interface MTU.
NSX Edge state corruption.
MSS clamping as a practical fix in that NSX version.

That list matters because the first three or four answers were reasonable. They were also wrong.

The lesson: agentic AI still needs architecture judgment

The technical lesson is about MTU. The more interesting lesson is about how agentic AI fits into real troubleshooting.

The MTU lesson is not simply "remember to enable jumbo frames." Everyone knows NSX overlay traffic needs headroom. The sharper lesson is that visible MTU configuration is not the same thing as verified path MTU. Parent interfaces, VLAN sub-interfaces, inherited settings, and firewall routing behavior can make the config look less suspicious than the packet path really is.

NSX overlay networks need jumbo frames on every underlay hop that carries encapsulated traffic, including routed firewall interfaces between TEP subnets. It is not enough for the ESXi vmnics, distributed switch, Edge interfaces, and physical switch ports to be correct. The L3 boundary has to be correct too.

The failure mode is especially nasty because small packets continue to pass. TCP handshakes work. BFD stays up. Some HTTP tests return responses. The tunnel looks alive. Then the first large TLS record gets Geneve-wrapped, crosses the 1500-byte boundary, and disappears.

The misconfiguration had been sitting there quietly. Contour did not create it. Contour found it, because it was the first workload in that environment to exercise the full CRX pod to NSX load-balancer to overlay backend path with a payload large enough to matter.

Claude Code helped compress the investigation. It could test, compare, capture, summarize, and keep pressure on the problem for hours. But it still needed a human operator to recognize when a conclusion did not fit the architecture, ask better questions, and connect a TLS timeout to an underlay routing boundary that was not obvious from the first symptom.