Thank you 1970s for giving us two great things: yours truly
and TCP/IP. One thing TCP/IP assumes is that a subnet resides in a single
location (you only have one gateway, and it must reside somewhere). However,
developers love(d) to code so their application components reside in the same
subnet (and same layer 2 so they don’t have to worry about default gateways and
what not).
During DR (Disaster Recovery) scenarios it was typical to
migrate an application to the backup DC without re-IPing it. So far so good;
subnet still resides in “one location” at a time. However, DR evolved to BC
(Business Continuity - think about it, why drop a bunch of money on gear, space,
and such not to use it?) and Active/Active DCs, and our good friends the
developers decided to make it an infrastructure technical requirement to
stretch the layer 2 their applications were using across multiple DCs (heaven
forbid they would re-architect their applications or that you suggest GLB). TCP/IP is not happy. Elver neither.
All this presents a problem an opportunity to network
designers. It is probably better to first illustrate it, and then explain it.
In the diagram, a user wants to reach the presentation layer
of some application that serves requests out of two DCs. If the user wants to
reach a VM that happens to be in DC2, there is no native way for the network to
know where the VM resides and thus forward the traffic directly to DC2. It is a
50/50 chance of which DC will receive (ingress) the traffic (more on that
below). This is because the network knows about subnets, not individual IPs.
When a router does a lookup in its routing table to decide the next hop for a
packet in transit, it looks for the smallest subnet in its routing table that
matches the destination IP. If the router has two or more next hops as options
for the matching subnet, it would select one (mostly based on some hashing of the
header of the packet in transit) and forwards the packet to the selected next
hop.
If the user happens to be “closer” to DC2 than DC1, then it
is most likely that the user’s traffic will ingress via DC2. However, “closer”
is not about physical proximity but about network path cost and other
variables. Also, the network is not a static entity; there are changes
happening frequently enough that may affect the “closeness” of the user to the
DC/VM.
Why am I telling you all this? Because recently I got into a
lively conversation while discussing x-vCenter NSX. x-vCenter NSX allows for
layer 2 to be stretched across multiple DCs while providing gateway/FHR (Fist Hop Router/Routing). There is
nothing in NSX that can force the user’s ingress traffic via the DC where the
destination VM is. If anyone ever tells you otherwise, whatever solution they
provide is not unique to NSX but rather a general networking trick.
So what are those networking tricks? Here are some (not
all-inclusive) of them with their potential impacts:
Active/Passive Ingress – Allow the layer 2 to be stretched
across both DCs, but advertise the subnet out of only one of the two DCs. If
this feels like cheating, it is because it is cheating. You only solve the ingress
problem for some of the VMs, and not the others. You also really don’t have BC
here because in case of the “Active” DC going down, some intervention will be
required to advertise the subnet out of the “Passive” DC; there will be an
outage for the application.
Active/"Active" Ingress – Here you advertise the subnet out
of both DCs, but you make one DC look “really farther away” than the other by
manipulating the cost of the subnet in the routing protocol (like BGP AS
pre-pending). You would have BC since network failover is automated, but again
there is cheating here because you are (mostly) solving the problem for some of the
VMs and not the others. Also you could have users that are “so close” to the "backup" Active DC that no feasible amount of cost manipulation would affect them. 
Advertise Host Routes – There is nothing that prevents the turning a
VM’s IP into a /32 subnet and injecting that into the routing process. You can
achieve this by adding a static route for each VM IP (/32) in the presentation
layer and redistributing the routes into the routing process. Since you can’t
get a subnet that is more specific than /32, there would never be a router
(outside the DCs) with two equal-costs paths to the /32 pointing to
different DCs. You truly get ingress traffic to the DC where the destination VM
is. But before I continue explaining this one, let me just note that the
burning sensation you are feeling right now on the back of your neck is the Operations
Manager giving you the evil look. With this solution you SUBSTANTIALLY increase
the size of the routing table and complexity in the network. And this solution breaks down when a VM changes DCs as there is no automated way where the /32 is being injected into he routing table.
Cisco LISP – To wrap it up it is worth mentioning Cisco LISP
(Locator ID Separation Protocol). LISP attempts to solve the ingress situation
by leveraging the /32 trick but restricting where the /32 are sent. The idea is
to create a network “bubble” around the DCs and place LISP routers at the edge of
the bubble. All users must reside outside of the bubble so all ingress traffic
goes through the LISP routers. The LISP routers in term communicate directly
with the FHR with the subnet in question (the stretched layer 2 in both DCs) to find out where each VM (IP)
resides. When the user traffic reaches the LISP router, the LISP router looks
up where the destination IP is located and forwards the traffic to the FHR (via
a tunnel). If a VM moves DCs, the FHRs would update the LISP routers with the
new VM (IP) location. The problem with this solution is the bubble. Where do you place the LISP routers? and what do you do in a brownfield deployment? It can get expensive and very complicated to achieve.
Elver’s Opinion: As Developers
continue to better understand the impact to infrastructure of their design decisions
(DevOps), they are building applications that work within the constraints of infrastructure
protocols (Cloud Native Apps). So the need to stretch layer 2 across DCs is
becoming less and less of an infrastructure technical requirement.

 
No comments:
Post a Comment