Complexity as a Single Point of Failure

A network can run flawlessly for months, seemingly validating every design decision you made. Sometimes though, all it takes is one packet interacting with an implementation quirk to expose the setup as the house of cards it really is.

The setup

I needed to extend my home network to a barn about 300 feet away. To do this, I installed an EnStation5-AC at the barn, configured it as a client bridge, and pointed it at my house.

My home network is segmented into multiple VLANs. Since I needed at least 2 of these VLANs at the barn, and most Wi-Fi gear does not support VLAN-tagged traffic, I chose to implement multiple VXLAN tunnels instead (one per VLAN). Each VLAN was bridged to its corresponding VXLAN tunnel endpoint (VTEP) on both sides of the Wi-Fi link.

Figure 1. VXLAN-based network architecture

If at this point you, the reader, are wondering: hey, the barn is only 300 feet away, wouldn’t it have been less complicated to just pull some fiber? You would be right. Unfortunately for you and me, we have not even finished describing the complexity of this setup.

See how it says “OPNsense 1” in the diagram above? That’s right, there’s more than one. I have OPNsense set up for high availability, which means there are actually 2 OPNsense routers in an active-passive failover configuration. If one fails (or needs to be taken offline for maintenance), the other one seamlessly takes over.

To make sure that the VXLAN tunnels to the barn aren’t cut off if OPNsense 1 is unavailable, OPNsense 2 also needs to be configured with VTEPs and associated bridges. Since VXLAN is a Layer 4 protocol (the VTEPs are configured to listen on the virtual IP shared between the two routers), there would be no conflicts between the two sets of VTEPs, as only one of the routers would own the virtual IP at any given point in time.

Figure 2. High availability VXLAN endpoint architecture

The fact that each router had bridges between the same logical network segments seemed a bit network-loop-ish to me, so I ran through a few scenarios to make sure there were no such gremlins lurking. For example:

  1. A broadcast packet is sent from the barn over the VXLAN tunnel for VLAN 1
  2. The packet is received and decapsulated by the VTEP on OPNsense 1
  3. The packet exits OPNsense 1 after being forwarded by bridge br1 to interface vlan01
  4. The packet is broadcast by the switch, and enters OPNsense 2 on interface vlan01
  5. The packet is forwarded by bridge br1 to the VTEP, and is encapsulated
  6. The VTEP is unable to send the encapsulated packet, as OPNsense 2 does not have control of the virtual IP the VTEP is bound to

Everything seemed alright, and the network ran fine for months with this setup. Plus, even if there were any network loops, that’s what STP is for, right?

Everything was not, in fact, alright.

One day, the network went down. Hard. Devices on the network seemed to lose not only internet access, but access to other devices on the network as well. Restarting the primary router fixed the network, so hey, it was probably just a wayward solar flare, right?

Nope. The network went down again, and again, seemingly without rhyme or reason. While devices on the network could not access the internet, monitoring from the internet side would never show any problems; services that I had hosted at home would continue to be accessible throughout the outage. And of course, at this point I still had no clue what was causing the outage, and therefore not even an educated guess as to how to reproduce it.

When I have Wireshark open, you know I’m having an excellent day.

Eventually, I got Wireshark capturing during one of the outages. The problem was immediately apparent, scrolling by at a thousand packets per second: a storm of mDNS query responses for the hostname of a new Mac. Great, that makes sense: the Mac was a recent addition to the network, which lines up with when the outages started happening. If the outages were caused by the Mac, taking it off the network must solve the problem, right? Nope. While turning on the Mac had a 99% probability of triggering an outage due to the storm of mDNS responses, turning the Mac off would never stop the storm.

Looking a bit closer at the storm, each packet originated from one of 2 MAC addresses. Since there were 2, I suspected that they belonged to the routers. Sure enough, inspection of the routers showed that those were the MAC addresses for bridge br1 on each one.

Bridges are Layer 2 constructs; when they forward packets, Layer 2 details (e.g. source/destination MAC address) should not change. The fact that these packets had source MAC addresses belonging to the routers means that these mDNS response packets were being generated by the routers themselves, not just being forwarded through.

Why is my router impersonating a client?

My first guess was mDNS repeater functionality. If the mDNS repeater was running on both OPNsense machines, then maybe it was bouncing a response betweeen VLANs (i.e. OPNsense 1 picks up an mDNS response on VLAN 1 and sends it on VLAN 2, where it gets picked up by OPNsense 2 and sent back on VLAN 1). But, I was only seeing the packet storm on VLAN 1; if it was mDNS repeater, I would’ve expected to see the same storm on VLAN 2. Additionally, disabling mDNS repeater did not stop the outages from happening.

I could not think of anything else that would cause OPNsense to generate a packet storm like this. In any case, I figured I needed to take a closer look at the start of a storm to make sense of the whole thing.

Wireshark running? Check. Mac ready to be turned on? Check. Lights, camera, action!

As it came online, the Mac sent a whole bunch of mDNS queries asking about services on its own hostname. It then immediately replied to those queries.1 The packet storm trigger? One of these replies was too big for a single IPv6 packet, and had to be fragmented.

So, what’s wrong with a fragmented packet?

The core issue is a quirk in how the way pf (the OPNsense/BSD packet filter) handles packet fragmentation.

pf is a Layer 3 firewall. For it to filter packets properly, it needs to see the whole packet, which means it will reassemble any fragmented packets it encounters. However, when a fragmented packet is reassembled, all the original Layer 2 headers are thrown away. The firewall now doesn’t know anything (nor does it care) about the source or destination MAC addresses.

This is not a problem in most cases, since routers like OPNsense usually sit between different Layer 2 network segments. Once an incoming packet goes through the firewall, it is sent on the next network segment with the source MAC address set to that of the router and the destination MAC address set to that of the next hop. The data from the original Layer 2 header is irrelevant on the new network segment.

In this case, though, we’re dealing with a bridge; both sides are on the same Layer 2 network segment. All that usually irrelevant Layer 2 header data is now very relevant; treating the packet like one that needs to be routed at Layer 3 leads to some very interesting potential modes of failure.

Since I have two routers on the same Layer 2 network segment with the same quirk, the following set of events would happen (in a loop) every time the Mac sent out a fragmented mDNS response packet:

  1. OPNsense 1 receives the fragmented packet on bridge br1.
  2. pf on OPNsense 1 reassembles the packet as part of its inspection, kicking it onto the Layer 3 processing path.
  3. OPNsense 1 “routes” the packet back out onto the same network, re-fragmenting it in the process. The source MAC address is set to that of bridge br1.
  4. Events 1 through 3 happen on OPNsense 2.

This quirk turned a single connection between 2 bridges into a de facto network loop.2

Let’s step away from the firewall, then.

Since the quirk causing the issue was primarily a BSD thing, I tried moving the entire highly available VXLAN endpoint architecture onto a set of Linux machines instead, using VRRP to share a virtual IP address. Since the Linux machines did not have to do any packet filtering, there would be no packet reassembly required, and therefore no cause for issue.

While this approach did fix the packet storms causing the network outages, it did introduce another problem: none of the IP cameras at the barn were accessible from the house. This was due to one or more of the following:

  • VXLAN tunnels have some overhead. The MTU of a tunnel is slightly smaller than that of its transport.
  • There is no way to configure the IP cameras to use a smaller MTU, and no way to increase the MTU of the Wi-Fi link to allow a larger MTU inside the tunnel.
  • Layer 2 does not perform fragmentation. Instead, packets larger than the MTU are silently dropped.
  • Fragmentation at Layer 3 is only performed by routers.
  • With the new setup, the tunnel endpoint was no longer on the same machine as the router.

It was at this point that I decided VXLAN was not the way.

What now?

In my search to find something that I could use to extend multiple VLANs across a Wi-Fi link without reducing MTU, I found B.A.T.M.A.N. advanced (batman-adv). Commonly used as part of mesh Wi-Fi solutions, batman-adv is a transport-agnostic Layer 2 mesh solution.

It encapsulates and forwards all traffic until it reaches the destination, hence emulating a virtual network switch of all nodes participating. Therefore all nodes appear to be link local and are unaware of the network’s topology as well as unaffected by any network changes.

The easiest way to think about a batman-adv mesh network is to imagine it as a distributed switch. Each system (often called “node”) running batman-adv is equal to a switch port.

The best part is, batman-adv understands that the underlying transport may not always have a large enough MTU to be able to transmit full-sized Layer 2 packets after encapsulation, so it implements transparent Layer 2 fragmentation.

Wait a minute… mesh?

At the house, I use UniFi access points, which support wireless meshing. The firmware on UniFi APs is based on OpenWRT; so is the firmware on the EnStation5-AC at the barn. Finding batman-adv got me thinking… can I configure the EnStation5-AC to play nice with UniFi’s wireless meshing solution? This would quite literally be exactly what I need: a wireless link that can push multiple networks to a remote access point.

UniFi allows root access to their access points if you enable SSH in the controller, and it’s not hard to get root access to most EnGenius access points (including the EnStation5-AC). This made poking around and figuring out how configuration changes in the UI affected the configuration of the underlying OpenWRT system.

The solution

Once wireless uplink (a.k.a. Mesh Parent) is enabled on the UniFi access point, /etc/hostapd/vwire*.cfg contains the details of the “mesh3” network. Turns out, all that is necessary after that is to configure the EnStation5-AC to operate in “WDS Station” mode and fill in the SSID and passphrase from the config file on the UniFi access point.

Not quite as simple as fiber, but definitely much closer.

Figure 3. WDS-based network architecture

Some final thoughts

So far, the new setup has worked perfectly for both tagged VLANs. However, untagged traffic does not pass properly, and the whole thing stops working if the EnStation5-AC is configured with a management VLAN other than the default. I suspect this has something to do with the way the bridge on the EnStation5-AC gets configured, but I somehow also managed to break SSH access in the process of debugging it. Oh well, at some point I’ll have the time to factory reset it and try again… ∎

Footnotes

  1. Why it does this, I have no idea. Seems conceptually similar to Gratuitous ARP. ↩︎
  2. A “loop” that STP had no chance of solving, I might add… ↩︎
  3. It’s not actually a mesh (no 802.11s), it’s just WDS. ↩︎