The views in this post are my own. Lab details are generalized and sanitized; the technical pattern is the point, not the specific environment.
The interesting part was not that AI helped
The interesting part was that it confidently reached the wrong conclusion, and one piece of human operational context changed the entire investigation.
I recently used Claude Code with full visibility into a spine-leaf lab fabric. Not screenshots. Not pasted snippets. Real access: show commands, shell output, packet captures, and enough context to move across the fabric instead of waiting for me to feed it one command at a time.
The failure involved EVPN BUM replication after replacing Dell OS10 spine switches with SONiC spines running FRR. BGP EVPN sessions were up. Routes were present. But the Dell OS10 leaf switches were not programming the NVE flood list, which meant broadcast, unknown unicast, and multicast traffic stopped crossing VXLAN tunnels between leaves.
The agent worked for almost two hours with very little engagement from me. It jumped across the fabric, checked BGP state, inspected EVPN route tables, looked at NVE state, and captured BGP UPDATEs on the wire.
Then it came back with the wrong conclusion: the Dell switches were the problem and the issue could not be fixed.
That answer did not sit right. I told it one simple thing: this worked when the spines were also Dell OS10.
That changed the investigation. With that one piece of operational history added, the agent spent another hour working through the control-plane behavior and found the real root cause.
The actual failure
The lab had Dell OS10 leaf switches and SONiC/FRR spine switches. After the spine replacement, EVPN Type-3 IMET routes were still being exchanged, but the Dell OS10 leaves did not populate their NVE replication lists.
The visible symptoms were straightforward:
- BGP EVPN sessions were established.
- EVPN Type-3 routes were present in the control plane.
- The NVE flood list stayed empty on the Dell OS10 leaves.
- BUM traffic did not cross VXLAN tunnels between leaf switches.
The important detail was that the routes existed in BGP, but the leaf switches did not install them into the data-plane replication list. That distinction matters. This was not simply "EVPN is broken." It was a control-plane-to-data-plane programming failure triggered by a missing route attribute.
- EVPN control plane
- Type-3 IMET route present
- VXLAN data plane
- NVE flood list missing
- Root cause
- Encapsulation Extended Community not re-advertised
The root cause
Dell OS10 10.5.3.x expects the Encapsulation Extended Community to be present on received EVPN Type-3 IMET routes before it programs the NVE flood list. In this case, the relevant community identifies VXLAN encapsulation.
When the spines were Dell OS10, that extended community was preserved transparently. When the spines were SONiC running FRR 10.3, FRR parsed the received Encapsulation Extended Community and showed the route internally with VXLAN encapsulation, but did not include that community when re-advertising the EVPN route to eBGP peers.
Packet captures confirmed the distinction. The outgoing BGP UPDATEs from the SONiC/FRR spines contained the PMSI Tunnel Attribute with the expected VNI, but did not contain the Encapsulation Extended Community that the Dell OS10 leaves required for data-plane programming.
The result was subtle: the Dell leaf switches accepted the EVPN route into BGP, but did not use it to build the VXLAN flood list.
The secondary issue
A second problem appeared during packet capture and configuration review: graceful shutdown was enabled in FRR on the SONiC spines. That caused FRR to attach the well-known graceful-shutdown community to outgoing eBGP advertisements.
That was not the root cause of the missing flood list, but it was still wrong. It would cause routes from the spines to be treated as least-preferred. Removing that setting was part of the cleanup.
What the diagnostic path looked like
The useful troubleshooting sequence was:
- Confirm EVPN BGP sessions were established.
- Confirm Type-3 IMET routes existed on the spines.
- Confirm the leaf switches received the routes.
- Confirm the NVE replication list stayed empty.
- Capture BGP UPDATEs on the spine-to-leaf path.
- Compare received route attributes with re-advertised attributes.
- Verify whether the missing attribute could be added with policy.
- Remove the unrelated graceful-shutdown misconfiguration.
The packet capture was the turning point. It moved the conversation from "the route is there" to "the route is missing the attribute this receiver requires to program the data plane."
The workaround
There was no practical FRR 10.3 configuration knob in this environment to force the Encapsulation Extended Community back into the outgoing EVPN UPDATEs. The Dell OS10 leaves also could not synthesize the required encapsulation community with an inbound route-map.
The working remediation was to remove graceful shutdown on the SONiC spines and add static remote VTEP entries on the Dell OS10 leaves for the affected VNIs.
router bgp 65000
no bgp graceful-shutdown
clear bgp l2vpn evpn * soft out
On the leaves, each affected virtual network was given the remote VTEP IPs needed for head-end replication:
virtual-network <virtual-network-id>
vxlan-vni <vni>
remote-vtep <remote-vtep-ip>
remote-vtep <remote-vtep-ip>
remote-vtep <remote-vtep-ip>
After that, the kernel bridge forwarding database showed static flood entries for the VXLAN interfaces, and BUM replication was restored.
The tradeoff
The workaround is static. It bypasses the dynamic EVPN control-plane learning path for BUM replication and relies on manually provisioned remote VTEPs.
That has operational consequences:
- New leaf pairs require manual remote-VTEP updates.
- Removed VTEPs can leave stale replication entries behind.
- New or removed VNIs require configuration updates.
- Anycast VTEP behavior must be understood and documented.
It is a valid lab workaround, but not the end state I would want for a production fabric. The long-term answer is a control-plane fix: either a SONiC/FRR version that re-advertises the required extended community correctly, a source patch, or a different BGP implementation in the spine role.
What I learned about using AI this way
The lesson was not that AI magically solved the problem. It did not. It spent a long time exploring and still made a bad call.
The useful part was that it could hold a large operational context, move through the fabric, collect data, and keep grinding through packet-level evidence. The missing part was architectural memory. It did not know that the fabric had worked with Dell OS10 spines until I told it.
One sentence from the operator changed the investigation:
This worked when the spines were OS10.
That is the real pattern. AI can accelerate troubleshooting, but it still needs a human who understands the system well enough to challenge a conclusion, add a missing constraint, and recognize when a technically plausible answer does not fit the history of the environment.
The value was not replacement. It was compression. It compressed the grind of data collection and packet inspection, while the architectural judgment still had to come from someone who knew what the fabric was supposed to be doing.
References
- RFC 7432: BGP MPLS-Based Ethernet VPN
- RFC 9012: The BGP Tunnel Encapsulation Attribute
- FRRouting
- SONiC buildimage