Thank you 1970s for giving us two great things: yours truly and TCP/IP. One thing TCP/IP assumes is that a subnet resides in a single location (you only have one gateway, and it must reside somewhere). However, developers love(d) to code so their application components reside in the same subnet (and same layer 2 so they don’t have to worry about default gateways and what not).
During DR (Disaster Recovery) scenarios it was typical to migrate an application to the backup DC without re-IPing it. So far so good; subnet still resides in “one location” at a time. However, DR evolved to BC (Business Continuity - think about it, why drop a bunch of money on gear, space, and such not to use it?) and Active/Active DCs, and our good friends the developers decided to make it an infrastructure technical requirement to stretch the layer 2 their applications were using across multiple DCs (heaven forbid they would re-architect their applications or that you suggest GLB). TCP/IP is not happy. Elver neither.
All this presents
a problem an opportunity to network
designers. It is probably better to first illustrate it, and then explain it.
In the diagram, a user wants to reach the presentation layer of some application that serves requests out of two DCs. If the user wants to reach a VM that happens to be in DC2, there is no native way for the network to know where the VM resides and thus forward the traffic directly to DC2. It is a 50/50 chance of which DC will receive (ingress) the traffic (more on that below). This is because the network knows about subnets, not individual IPs. When a router does a lookup in its routing table to decide the next hop for a packet in transit, it looks for the smallest subnet in its routing table that matches the destination IP. If the router has two or more next hops as options for the matching subnet, it would select one (mostly based on some hashing of the header of the packet in transit) and forwards the packet to the selected next hop.
If the user happens to be “closer” to DC2 than DC1, then it is most likely that the user’s traffic will ingress via DC2. However, “closer” is not about physical proximity but about network path cost and other variables. Also, the network is not a static entity; there are changes happening frequently enough that may affect the “closeness” of the user to the DC/VM.
Why am I telling you all this? Because recently I got into a lively conversation while discussing x-vCenter NSX. x-vCenter NSX allows for layer 2 to be stretched across multiple DCs while providing gateway/FHR (Fist Hop Router/Routing). There is nothing in NSX that can force the user’s ingress traffic via the DC where the destination VM is. If anyone ever tells you otherwise, whatever solution they provide is not unique to NSX but rather a general networking trick.
So what are those networking tricks? Here are some (not all-inclusive) of them with their potential impacts:
Active/Passive Ingress – Allow the layer 2 to be stretched across both DCs, but advertise the subnet out of only one of the two DCs. If this feels like cheating, it is because it is cheating. You only solve the ingress problem for some of the VMs, and not the others. You also really don’t have BC here because in case of the “Active” DC going down, some intervention will be required to advertise the subnet out of the “Passive” DC; there will be an outage for the application.
Active/"Active" Ingress – Here you advertise the subnet out of both DCs, but you make one DC look “really farther away” than the other by manipulating the cost of the subnet in the routing protocol (like BGP AS pre-pending). You would have BC since network failover is automated, but again there is cheating here because you are (mostly) solving the problem for some of the VMs and not the others. Also you could have users that are “so close” to the "backup" Active DC that no feasible amount of cost manipulation would affect them.
Advertise Host Routes – There is nothing that prevents the turning a VM’s IP into a /32 subnet and injecting that into the routing process. You can achieve this by adding a static route for each VM IP (/32) in the presentation layer and redistributing the routes into the routing process. Since you can’t get a subnet that is more specific than /32, there would never be a router (outside the DCs) with two equal-costs paths to the /32 pointing to different DCs. You truly get ingress traffic to the DC where the destination VM is. But before I continue explaining this one, let me just note that the burning sensation you are feeling right now on the back of your neck is the Operations Manager giving you the evil look. With this solution you SUBSTANTIALLY increase the size of the routing table and complexity in the network. And this solution breaks down when a VM changes DCs as there is no automated way where the /32 is being injected into he routing table.
Cisco LISP – To wrap it up it is worth mentioning Cisco LISP (Locator ID Separation Protocol). LISP attempts to solve the ingress situation by leveraging the /32 trick but restricting where the /32 are sent. The idea is to create a network “bubble” around the DCs and place LISP routers at the edge of the bubble. All users must reside outside of the bubble so all ingress traffic goes through the LISP routers. The LISP routers in term communicate directly with the FHR with the subnet in question (the stretched layer 2 in both DCs) to find out where each VM (IP) resides. When the user traffic reaches the LISP router, the LISP router looks up where the destination IP is located and forwards the traffic to the FHR (via a tunnel). If a VM moves DCs, the FHRs would update the LISP routers with the new VM (IP) location. The problem with this solution is the bubble. Where do you place the LISP routers? and what do you do in a brownfield deployment? It can get expensive and very complicated to achieve.
Elver’s Opinion: As Developers continue to better understand the impact to infrastructure of their design decisions (DevOps), they are building applications that work within the constraints of infrastructure protocols (Cloud Native Apps). So the need to stretch layer 2 across DCs is becoming less and less of an infrastructure technical requirement.