Building redundant WAN routing and security services mitigates a critical single point of failure in mission-critical networks. If a single gateway fails, a standby unit can seamlessly continue servicing clients without disruption.
There have been many proprietary and standards-based gateway redundancy protocols developed over the years to solve this problem. HSRP, VRRP, GLBP are all common examples.
The Cisco Meraki MX security appliance offers a similar HA solution called warm spare mode. Enabling this option provides a seamless way to create a highly-available pair of MX appliances with automatic configuration, gateway, and VPN peer syncing.
Warm spare mode can be enabled in just a few clicks and removes the complications found in traditional redundancy protocols eliminating by the possibility of mismatched keep alive or timeout values. Cable, enable, and the MXs sync.
Warm Spare in NAT Mode
MX has two different posture options – NAT mode (default) and VPN concentrator (or transparent) mode. The most common implementation is NAT mode, where internet or MPLS uplinks are connected to the WAN1/2 ports and the internal network is connected to the LAN ports. NAT mode is where we’ll start for the purposes of this article.
NAT Mode WAN Connectivity
WAN uplinks should be mirrored across both security appliances in an identical fashion. If you are using a single ISP, then WAN1 on both MX units should be connected to the ISP directly. If you are using two different ISP uplinks, it is recommended to use both WAN interfaces across the primary and standby MXs. Doing so enables seamless failover and uplink parity between across appliances.
The diagrams below illustrate the WAN uplink port mirror recommendation when using a single provider (top) or when using multiple (bottom).
Both MX appliances require individual WAN uplink IP addresses for independent cloud access and uplink monitoring. Using a /29 or larger WAN IP mask allows for three addresses on the shared segment to each provider (ISP gateway, MX1-WAN1, MX2-WAN1).
This is an important consideration – you will need a minimum of two usable IP addresses from your provider (one per MX). If you have a /30 (only a single usable IP address) then you may need to request a /29 mask from your provider to accommodate terminating both CPE devices.
If two physical Ethernet handoffs are available from your provider, then each can be directly connected to the appropriate WAN interface on each MX. If only a single physical Ethernet handoff is available from your provider, there are two uplink cabling options which we will review in detail below solve this. Both solutions will require using switch ports to bridge the MX WAN and service provider handoffs into an isolated L2 segment. 1 provider will require 3 bridged ports and 2 providers will require 6 bridged ports. The key difference between the solutions is whether dedicated WAN breakout switching is used or if securing ports on the existing downstream switching is preferred.
Option 1: WAN Breakout Switch
Using a dedicated breakout switch is a simple way to split the WAN uplinks between the PE and CPE devices. A MS220-8 is a great solution. While easy to deploy, consider that this also introduces additional hardware to the solution as well as an added point of failure in the redundant WAN path.
Option 2: Use 3/6 Downstream Switch Ports
An alternative option is to quickly configure non-routed WAN-transit VLANs in the downstream LAN switching to serve the same function. This simply creates isolated ports on the local switching to connect the ports. If you have a pair of switches, this can offer a fully redundant solution by consuming either three or six ports in total.
The diagram below shows dual provider connectivity for both single and redundant switch options. This is a slightly more complicated cabling design but provides the highest level of resiliency. If only a single provider is used, one VLAN would be required and would reduce the switch ports consumed down to three.
Now that we have a better understanding of the WAN uplink cabling options, let’s explore the MX LAN-side connectivity and how the warm spare’s VRRP mechanics work behind the scenes.
Warm Spare Details
Warm spare on MX uses the Virtual Router Redundancy Protocol, abbreviated VRRP, for sharing uplink health and connectivity status information between appliances. VRRP heartbeats are sent across the LAN interfaces on each VLAN every second. If no VRRP keepalives are heard by the secondary MX on any VLAN after three seconds, the dead timer will expire triggering a failover event.
Once higher priority heartbeats are seen again by the secondary MX, it immediately relinquishes the gateway response role back to the primary.
WAN interfaces use a series of health checks via the connection monitor service independently and do not use VRRP for warm spare communication.
Now that we’ve reviewed the basic mechanics of warm spare VRRP on the MX LAN interfaces, we can turn our attention to cabling options.
Warm Spare VRRP Keepalive Connectivity: Direct-attached
Using Warm Spare, a pair of MX appliances can be cabled together using two different methods:
- Directly connected using a dedicated LAN port on each MX
- Indirectly connected using the LAN switching for MX-to-MX communication
Since the first, direct-attached method is generally the preferred method that’s where we’ll begin.
Using this approach, a dedicated heartbeat VLAN interface is created for direct communication. That VLAN is then pruned from the downstream switch ports to avoid a L2 loop. Once the configuration is in place, the appliances can be connected to each other using the LAN ports designated specifically for VRRP heartbeat communication (blue line above).
Features of direct-attached warm spare:
- Fewer false-positive failover triggers. Since the devices are directly attached, a downstream switch error would not erroneously initiate a failover event.
- Faster failover/failback. Each MX is one hop from its peer leading to rapid convergence.
Before Configuring Warm Spare, First Verify the Following
- The primary and secondary MX are the same model. Mixing models (even similar form factor like MX64/65) is not supported.
- Both MXs are powered on and have successfully connected to the Cisco Meraki Dashboard. Green status is good.
- Both MXs are running the same firmware version. The only instance where they wouldn’t be running the same level of firmware is if you have elected to run beta firmware previously or one of the units has been promoted to a Stable Release Candidate firmware by Cisco Support. To verify, see Organization > Firmware updates.
- The secondary MX appliance isn’t bound to an existing network. You can check this under Organization > Inventory and verify that the S/N shows a “-” in the Networkcolumn. If the secondary appliance is already assigned to a network, remove it.
- The primary MX appliance is bound to an existing network. If not, create a network, then add the primary MX to it. This is the network that will be used to create the warm spare pair.
Direct-Attached MX Warm Spare Configuration
Step 1: Configure the Warm Spare Heartbeat VLAN
- Dashboard > Select your primary MX network > Security Appliance > Addressing & VLANs page.
- Under the VLANs section, click Add a Local VLAN.
- Assign a name, subnet (can be anything that doesn’t conflict with another local route – I used 18.104.22.168/30 in this example), MX IP, VLAN ID and select Update. Save changes.
Step 2: Assign Switch and Warm Spare LAN Ports
- Security Appliance > Addressing and VLANs page, navigate to the Per-port VLAN configuration section.
- Select the MX LAN port you will use for downstream switch connectivity on both MXs (LAN 3 in this example). If it is a trunk, remove the new heartbeat VLAN from the allowed list. The MX doesn’t participate in STP, so pruning the heartbeat VLAN from the switch-connected trunk ports will prevent unintended BPDU flooding after the heartbeat interconnect is added.
- Select the MX LAN port you will use on each device to directly cable the appliances together (LAN 4 in this example). Change the type to access and the VLAN to the new heartbeat VLAN (1111 in this example).
- Save changes.
Step 3: Cable Secondary MX
- Before pairing the secondary MX appliance, verify that it’s WAN uplink is up and status LED shows as connected to the cloud controller. The local status page can also be used to signal cloud connectivity health.
- Now cable the MX appliances directly together using the heartbeat LAN port provisioned above.
- Do not cable the secondary MX down to the LAN switch infrastructure quite yet. We want the secondary unit to first inherit its new configuration (including VLANs) before connecting the downstream infrastructure.
Step 4: Configure the Secondary Warm Spare MX
- From the Appliance status page, select the Configure warm spare button.
- Click Enable, then select the secondary MX under the Device serial dropdown. If you don’t see it listed, verify that it is in your Organization inventory and is not already bound to an existing network.
- Choose the MX uplink IP option you prefer and click Update. Finally, save your changes. For more information on adding a virtual IP, see details below.
After saving the warm spare configuration it may take 1-2 minutes for the sync process to complete between the two devices. You may notice the roles and status switching between the two devices in Dashboard during that period (as can be seen in the example below). Once the VRRP handshake and config sync process is complete, you should see the primary in the Current master status and the spare in the Passive, ready status.
Warm Spare VRRP Keepalive Connectivity: Switch Interconnect
Alternatively, a warm spare pair of MX appliances can be cabled together indirectly using the existing LAN switching for MX-to-MX communication.
In this arrangement there is no dedicated VRRP VLAN defined or MX-to-MX cabling. Instead, VRRP heartbeats are transmitted to each MX as broadcast messages through the LAN switches and existing VLANs.
While this is a supported warm spare design for NAT-mode MXs, I would recommend the direct-attached implementation for the reasons below.
Features of switch-interconnect warm spare:
- Triggers failover if LAN cabling or downstream switch ASIC fails. Any downstream connectivity failure will trigger a failover event.
- Reduced failure control. Since downstream processes like Spanning Tree topology errors, broadcast storms, or switch configuration mistakes can cut communication between MXs, the failover will occur for events outside of MX hardware failure (which could be undesirable).
- Simpler warm spare Configuration. Using the switch fabric for hearbeat communication means no additional cabling or dedicated VLANs are required. This makes the initial configuration very easy, but operationally more difficult to diagnose problems since there are more devices between the two MXs.
Switch-Interconnect MX Warm Spare Configuration
To configure a warm spare pair using this method, make sure the WAN and LAN ports are connected on both appliances then follow the directions in Step 4: Configure the Secondary Warm Spare MX above.
Warm Spare For MX Concentrator Mode
Now that we’ve covered the NAT-mode HA deployment details, let’s move onto concentrator mode MX warm spare design.
VPN Concentrator mode was specifically developed for data center deployments where the MX appliance is positioning behind existing DC firewall infrastructure. This leads to some unique data center advantages, like the ability to decouple WAN transport providers (public internet, private MPLS, etc.) from physical interfaces on the MX and instead use the appliance as a high-speed WAN aggregation concentrator.
In a concentrator deployment model, only a single interface is connected – namely WAN1. WAN1 is used to terminate both incoming flows from remote MX peers as well as process outbound flows.
Simple 1 arm connectivity means that the MX appliance will only have a single IP address configured (and optionally a virtual IP which we’ll discuss below). The diagram below shows the concentrator mode deployment upgraded to an HA configuration.
Concentrator MX Warm Spare Configuration
- Using the local status page on each appliance, configure a WAN1 IP address, gateway IP, and DNS server(s).
- Verify that the local status page is reporting healthy cloud connectivity. It’s not uncommon for data center appliances to require upstream firewall rules to be addedto allow outbound cloud connectivity.
- From the Dashboard, navigate to primary MX network > Security appliance > Appliance status page, and select the Configure warm spare button.
- Click Enable, then choose the secondary MX under the Device serial dropdown. If you don’t see it listed, verify that it is in your Organization inventory and is not already bound to an existing network.
- Choose the MX uplink IP option you prefer and click Update. Finally, save your changes.
Should I Use a Virtual IP?
When configuring warm spare uplink IPs, an administrator has the option to simply use the existing WAN IPs on each uplink or to create a new virtual IP (vIP) to be shared by both units for each uplink.
By configuring a vIP, VPN traffic is sent/received to the vIP rather than the physical IP addresses of the individual WAN IPs. In the event of an MX failure, the IPSec SA does not need to be reestablished which results in rapid VPN convergence after a local or remote peer drop.
The vIP address has two requirements:
- It must be in the same subnet as the WAN uplinks
- It must be unique (it cannot be the same as either the primary or secondary unit’s IP address
The unique vIP IP address requirement means that a total of four IP addresses would be required per uplink subnet:
- Primary MX WAN interface IP address
- Secondary MX WAN interface IP address
- Shared vIP interface IP address
- Gateway IP address
General guidance is to use a vIP when possible. The extra IP address required is well worth the operational benefit provided.
Troubleshooting Warm Spare Sync Issues
Issues related to warm spare syncing usually stem from to one of two problems. Either one of the appliance WAN uplinks isn’t successfully registering to the cloud Dashboard successfully or (more likely) the LAN-side VRRP heartbeats aren’t being bidirectionally received by both units.
Symptoms can include flapping master roles, active/active master roles, or one or more appliances not detected by Dashboard. This can lead to intermittent traffic drop behavior or poor throughput performance from LAN clients.
- Verify that each MX appliance is successfully connected to Dashboard. A solid white LED is a good indication for MX64/65 models. Connecting to the local status pagetakes a little more work, but ultimately is the best way to confirm cloud connectivity from each appliance.
- Review your MX-to-MX heartbeat connectivity. If the MXs are cabled together directly, verify that a dedicated heartbeat VLAN is used, the LAN interface mode is set to access, and the heartbeat VLAN is removed from any other MX LAN interfaces – including all trunk ports.
- If the MXs are using the downstream LAN switching to communicate, remove any direct MX-to-MX cables. Also verify that the switch uplinks to the MX are not in an error condition or Spanning Tree blocking/discarding state. This is the most common cause for warm spare sync issues.
- If all of the above checks pass and you are seeing a dual-master condition on the Appliance status page, then perform a packet capture on the secondary MX LAN interface.
If you see the appliance sending VRRP packets, but not receiving any response then the layer two switch path between appliances needs to be re-examined.
The Wireshark output below show a healthy warm spare pair LAN behavior. Packet captures investigating improper warm spare behavior should always be taken from the secondary appliance LAN interface. Notice bidirectional UDP packets. The source and destination packets are coming using our heartbeat VLAN IP (22.214.171.124) which is shared by both appliances. More importantly, the source MAC address shows both appliances participating in the exchange.
Another indication of a working VRRP process can be seen in the VRRP priority seen from a packet capture the secondary appliance LAN interface. The source MAC address will be that of the virtual VRRP group and the priority should be 255. A priority value of 255 represents the master role and indicates the primary appliance hellos are being seen by the secondary MX.
If the MX appliance are unable to communicate across their LAN ports, a dual-active condition will occur. The packet capture below is taken from the secondary appliance and shows the broken behavior.
Notice that the VRRP priority is set to 235 – indicating the secondary is unable to see the primary’s packets which would be marked as 255. Also notice the missing UDP exchange we observed in the working model above. This indicates the devices are not on a shared layer 2 broadcast segment, filtering is occurring downstream, or an error condition (like switch port STP blocking) is at play.
My goal in this writeup on the details of MX warm spare design is to document proper HA design principles and explain the underlying mechanics. The reality is that enabling hardware redundancy with Meraki MX is very simple, but the details around your appliance mode and cabling certainly matter.
Staying within the recommended deployment guidelines laid out will save you time and future frustration. If you do find that problems with your setup, following the troubleshooting steps should expedite the recovery process and put you back on the path to a reliable, redundant Meraki MX architecture.