In the last blog I spent a lot of writing (and funky formulas) just to say that the DCI circuits between two Data Centers need to be larger than the biggest DC WAN link plus the inter-DC traffic (which increases your cost). When stretching layer 2 across DCs, there is not much that can be done to force DC ingress traffic to come in via the DC WAN link where the destination workload (VM) is running. However, there are some things you can do to force the egress traffic to go out the DC where the VM is located (and avoid using the DCI circuits) to reduce some of the cost associated with the DCI links. For this blog post I’m going to assume that we have Active/Active DCs.
Dual Default Gateways
When you stretch the layer 2 across the DC, the default getaway for the stretched layer 2 segments could be physically located in one of the two DCs. We don’t care about that use case (that would probably require a standalone blog post to talk about the cons of this design). Instead let’s assume that we have default gateway services in both DCs, and to provide redundancy, we will have two routers as default gateways in each DC, running FHRP (something like VRRP), as shown in the diagram.
In this design, the VM will forward traffic to its local default gateway, which in turn will forward the traffic out of it local DC WAN. For this design to work (1) there must be a mechanism to stop each pair of DC default gateways from seeing each other (otherwise you won’t get both pairs of FHRP routers to be Active with the same virtual MAC) and (2) to prevent a VM in one DC from receiving ARP replies for its default gateway from the other DC’s pair of default gateways. You could achieve this with Access List (too manual) or you could stretch the layer 2 with something like Cisco OTV, which has built-in mechanism (less manual) to isolate FHRP in each DC.
This design does have some potential issues that must be taken into account. If each pair of default gateways uses different virtual MAC addresses when replying to ARP, a VM that moves DCs will lose connectivity (until it re-ARPs for its default gateway). Also, if both members of a pair of default gateways go down, you may have to remove the FHRP isolation to allow the impacted VMs to reach the default gateways in the other DC.
Distributed Default Gateway (Top of Rack)
An alternative to dual default gateways is to stretch the layer 2 using VXLAN (or another tunneling protocol) from the Top of Rack (ToR). In this design all ToR will have the Layer 3 boundary and be the default gateways for their own racks. Every time the ToR gets an ARP request, the ToR will respond to it (and provide local FHRP isolation), as shown in the diagram below. Two examples of this are Arista’s DCI with VXLAN and VARP, Brocade’s IP Fabric with Anycast gateway (for the time being until Broadcom decides what to do with Brocade’s Network business).
One advantage of this design over the previous one: there are a lot more routers (the ToRs) acting as default gateways and built in FHRP isolation. If a ToR dies, only the rack where it resides will be impacted as opposed to the entire DC. Also since all ToR have the use the same virtual MAC, when a VM moves DCs, it continues to have uninterrupted Layer 3 connectivity. One disadvantage is that you would need to fiddle with route advertisements to ensure the ToRs forward traffic straight up the local DC WAN; this many not be as easily done as it sounds.
Side note: there is a variation of this design where the ToR are strictly layer 2 (let’s call them Leafs) and the distribution switches (henceforth Spines) do the Layer 3, thus providing the default gateway services.
Distributed Default Gateway (software)
Just like the physical version, you stretch the layer 2 using a tunneling protocol (like VXLAN or GENEVE) but you have a layer 3 process in each hypervisor that serves as the default gateway (e.g. virtual router). Each virtual router will have the same IP and virtual MAC (thus VMs can move between DCs at will) and locally respond to ARP requests. And like the physical version, you must manipulate routes to force each virtual router to send traffic to its local DC WAN, as shown in the diagram.
VMware’s NSX-v (distributed logical router) achieves this functionality. Each logical router is the “same” in each hypervisor except for their routing tables. Each logical router in each DC will get routes only relevant to it. This way, each logical router is “forced” to forward traffic using its local WAN.
Elver’s Opinion: This blog post should (mostly) conclude my thoughts on stretching Layer 2 across the DCI (think hard before doing it). At first I thought I would use this blog to also talk about local egress in NSX (to wrap up my thoughts on the matter), but as I wrote I realize I would need more space than I thought, so I’ll be writing another blog post just on local egress in NSX.