Building redundant WAN routing and security services mitigates a critical single point of failure in mission-critical networks. If a single gateway fails, a standby unit can seamlessly continue servicing clients without disruption.
There have been many proprietary and standards-based gateway redundancy protocols developed over the years to solve this problem. HSRP, VRRP, GLBP are all common examples.
The Cisco Meraki MX security appliance offers a similar HA solution called warm spare mode. Enabling this option provides a seamless way to create a highly-available pair of MX appliances with automatic configuration, gateway, and VPN peer syncing.
Warm spare mode can be enabled in just a few clicks and removes the complications found in traditional redundancy protocols eliminating by the possibility of mismatched keep alive or timeout values. Cable, enable, and the MXs sync.
Warm Spare in NAT Mode
MX has two different posture options – NAT mode (default) and VPN concentrator (or transparent) mode. The most common implementation is NAT mode, where internet or MPLS uplinks are connected to the WAN1/2 ports and the internal network is connected to the LAN ports. NAT mode is where we’ll start for the purposes of this article.
NAT Mode WAN Connectivity
WAN uplinks should be mirrored across both security appliances in an identical fashion. If you are using a single ISP, then WAN1 on both MX units should be connected to the ISP directly. If you are using two different ISP uplinks, it is recommended to use both WAN interfaces across the primary and standby MXs. Doing so enables seamless failover and uplink parity between across appliances.
The diagrams below illustrate the WAN uplink port mirror recommendation when using a single provider (top) or when using multiple (bottom).
Both MX appliances require individual WAN uplink IP addresses for independent cloud access and uplink monitoring. Using a /29 or larger WAN IP mask allows for three addresses on the shared segment to each provider (ISP gateway, MX1-WAN1, MX2-WAN1).
If two physical Ethernet handoffs are available from your provider, then each can be directly connected to the appropriate WAN interface on each MX. If only a single physical Ethernet handoff is available from your provider, there are two uplink cabling options which we will review in detail below solve this. Both solutions will require using switch ports to bridge the MX WAN and service provider handoffs into an isolated L2 segment. 1 provider will require 3 bridged ports and 2 providers will require 6 bridged ports. The key difference between the solutions is whether dedicated WAN breakout switching is used or if securing ports on the existing downstream switching is preferred.
Option 1: WAN Breakout Switch
Using a dedicated breakout switch is a simple way to split the WAN uplinks between the PE and CPE devices. A MS220-8 is a great solution. While easy to deploy, consider that this also introduces additional hardware to the solution as well as an added point of failure in the redundant WAN path.
Option 2: Use 3/6 Downstream Switch Ports
An alternative option is to quickly configure non-routed WAN-transit VLANs in the downstream LAN switching to serve the same function. This simply creates isolated ports on the local switching to connect the ports. If you have a pair of switches, this can offer a fully redundant solution by consuming either three or six ports in total.
The diagram below shows dual provider connectivity for both single and redundant switch options. This is a slightly more complicated cabling design but provides the highest level of resiliency. If only a single provider is used, one VLAN would be required and would reduce the switch ports consumed down to three.
Now that we have a better understanding of the WAN uplink cabling options, let’s explore the MX LAN-side connectivity and how the warm spare’s VRRP mechanics work behind the scenes.
Warm Spare Details
Warm spare on MX uses the Virtual Router Redundancy Protocol, abbreviated VRRP, for sharing uplink health and connectivity status information between appliances. VRRP heartbeats are sent across the LAN interfaces on each VLAN every second. If no VRRP keepalives are heard by the secondary MX on any VLAN after three seconds, the dead timer will expire triggering a failover event.
Once higher priority heartbeats are seen again by the secondary MX, it immediately relinquishes the gateway response role back to the primary.
Now that we’ve reviewed the basic mechanics of warm spare VRRP on the MX LAN interfaces, we can turn our attention to cabling options.
Warm Spare VRRP Keepalive Connectivity: Direct-attached
Using Warm Spare, a pair of MX appliances can be cabled together using two different methods:
- Directly connected using a dedicated LAN port on each MX
- Indirectly connected using the LAN switching for MX-to-MX communication
Since the first, direct-attached method is generally the preferred method that’s where we’ll begin.
Using this approach, a dedicated heartbeat VLAN interface is created for direct communication. That VLAN is then pruned from the downstream switch ports to avoid a L2 loop. Once the configuration is in place, the appliances can be connected to each other using the LAN ports designated specifically for VRRP heartbeat communication (blue line above).
Features of direct-attached warm spare:
- Fewer false-positive failover triggers. Since the devices are directly attached, a downstream switch error would not erroneously initiate a failover event.
- Faster failover/failback. Each MX is one hop from its peer leading to rapid convergence.
Warm Spare VRRP Keepalive Connectivity: Switch Interconnect
Alternatively, a warm spare pair of MX appliances can be cabled together indirectly using the existing LAN switching for MX-to-MX communication.
In this arrangement there is no dedicated VRRP VLAN defined or MX-to-MX cabling. Instead, VRRP heartbeats are transmitted to each MX as broadcast messages through the LAN switches and existing VLANs.
While this is a supported warm spare design for NAT-mode MXs, I would recommend the direct-attached implementation for the reasons below.
Features of switch-interconnect warm spare:
- Triggers failover if LAN cabling or downstream switch ASIC fails. Any downstream connectivity failure will trigger a failover event.
- Reduced failure control. Since downstream processes like Spanning Tree topology errors, broadcast storms, or switch configuration mistakes can cut communication between MXs, the failover will occur for events outside of MX hardware failure (which could be undesirable).
- Simpler warm spare Configuration. Using the switch fabric for hearbeat communication means no additional cabling or dedicated VLANs are required. This makes the initial configuration very easy, but operationally more difficult to diagnose problems since there are more devices between the two MXs.
Warm Spare For MX Concentrator Mode
Now that we’ve covered the NAT-mode HA deployment details, let’s move onto concentrator mode MX warm spare design.
VPN Concentrator mode was specifically developed for data center deployments where the MX appliance is positioning behind existing DC firewall infrastructure. This leads to some unique data center advantages, like the ability to decouple WAN transport providers (public internet, private MPLS, etc.) from physical interfaces on the MX and instead use the appliance as a high-speed WAN aggregation concentrator.
In a concentrator deployment model, only a single interface is connected – namely WAN1. WAN1 is used to terminate both incoming flows from remote MX peers as well as process outbound flows.
Simple 1 arm connectivity means that the MX appliance will only have a single IP address configured (and optionally a virtual IP which we’ll discuss below). The diagram below shows the concentrator mode deployment upgraded to an HA configuration.
Troubleshooting Warm Spare Sync Issues
Issues related to warm spare syncing usually stem from to one of two problems. Either one of the appliance WAN uplinks isn’t successfully registering to the cloud Dashboard successfully or (more likely) the LAN-side VRRP heartbeats aren’t being bidirectionally received by both units.
Symptoms can include flapping master roles, active/active master roles, or one or more appliances not detected by Dashboard. This can lead to intermittent traffic drop behavior or poor throughput performance from LAN clients.
- Verify that each MX appliance is successfully connected to Dashboard. A solid white LED is a good indication for MX64/65 models. Connecting to the local status pagetakes a little more work, but ultimately is the best way to confirm cloud connectivity from each appliance.
- Review your MX-to-MX heartbeat connectivity. If the MXs are cabled together directly, verify that a dedicated heartbeat VLAN is used, the LAN interface mode is set to access, and the heartbeat VLAN is removed from any other MX LAN interfaces – including all trunk ports.
- If the MXs are using the downstream LAN switching to communicate, remove any direct MX-to-MX cables. Also verify that the switch uplinks to the MX are not in an error condition or Spanning Tree blocking/discarding state. This is the most common cause for warm spare sync issues.
- If all of the above checks pass and you are seeing a dual-master condition on the Appliance status page, then perform a packet capture on the secondary MX LAN interface.
If you see the appliance sending VRRP packets, but not receiving any response then the layer two switch path between appliances needs to be re-examined.
The Wireshark output below show a healthy warm spare pair LAN behavior. Packet captures investigating improper warm spare behavior should always be taken from the secondary appliance LAN interface. Notice bidirectional UDP packets. The source and destination packets are coming using our heartbeat VLAN IP (220.127.116.11) which is shared by both appliances. More importantly, the source MAC address shows both appliances participating in the exchange.
Another indication of a working VRRP process can be seen in the VRRP priority seen from a packet capture the secondary appliance LAN interface. The source MAC address will be that of the virtual VRRP group and the priority should be 255. A priority value of 255 represents the master role and indicates the primary appliance hellos are being seen by the secondary MX.
If the MX appliance are unable to communicate across their LAN ports, a dual-active condition will occur. The packet capture below is taken from the secondary appliance and shows the broken behavior.
Notice that the VRRP priority is set to 235 – indicating the secondary is unable to see the primary’s packets which would be marked as 255. Also notice the missing UDP exchange we observed in the working model above. This indicates the devices are not on a shared layer 2 broadcast segment, filtering is occurring downstream, or an error condition (like switch port STP blocking) is at play.
My goal in this writeup on the details of MX warm spare design is to document proper HA design principles and explain the underlying mechanics. The reality is that enabling hardware redundancy with Meraki MX is very simple, but the details around your appliance mode and cabling certainly matter.
Staying within the recommended deployment guidelines laid out will save you time and future frustration. If you do find that problems with your setup, following the troubleshooting steps should expedite the recovery process and put you back on the path to a reliable, redundant Meraki MX architecture.