Wednesday, April 20, 2016

A vCenter, NSX Manager, Multiple DCs...and I Can't Reach Them

Note: I wrote this post in somewhat of a rush and I didn’t have time to do diagrams. I have had a bit of a hiatus and I wanted to add something to the blog but I'll update the content with the diagrams at a later date. The diagrams have been added. Also, a question asked by a good friend inspired this post.

You conceded the point and your team will be using a single vCenter to manage multiple physical Data Centers. All right, not the end of the world, you’ll be fine. But a few developers are requiring Layer 2 across some of those Data Centers. Again, not the end of the World; besides that’s what NSX is for. However, do you understand what the impact to the Virtual Workloads in the stretched Layer 2 (VXLAN) would be if one of those physical Data Centers loses network connection to the Management Plane (vCenter, NSX Manager, etc…) and the Control Plane (NSX Controllers, Logical Router Control VMs, etc…)? To keep the topic to the network impact, we will assume that Virtual Workloads are using Storage local to their Data Center.

Figure 1: DC isolation

Elver’s Opinion: Since we have Logical Switches distributed among multiple physical Data Centers, I’ll make the very safe assumption that you won’t be doing Layer 2 Bridging. If you are trying to do Layer 2 Bridging, call me so I can talk you out of it.

To get the obvious out of the way, if you don’t have the Management Plane, you can forget about vMotion, Storage vMotion and any NSX Configuration changes.

Let’s tackle the slightly not so obvious within Layer 2. Virtual Workloads within the impacted Data Centers, and in the same Logical Switch, will be able to talk to each other via VXLAN (Overlay) as well as to other Virtual Workloads running in VTEPs within other Data Centers they can reach. NSX is built such that the VTEPs (ESXi hosts) will continue to communicate with each other in the event the NSX Controllers are not reachable. There will be an uptick of Broadcasts (specifically ARP Requests) and Unknown Unicast traffic being replicated by the VTEPs, but the uptick shouldn’t be much impacting. At the Control Plane, assuming the NSX Controllers are still operational, they will remove the “isolated” VTEPs, and their associated entries, from all their tables (Connection, VTEP, MAC, ARP) and tell the “reachable” VTEPs to remove the “isolated” VTEPs from their VTEP Tables.

Figure 2: Inter-Logical Switch

If two Virtual Workloads within the impacted Data Centers are in different Logical Switches (VXLANs) and those Logical Switches connect to the same Logical Router, the Virtual Workloads will be able to talk to each other; from the Logical Router’s perspective both subnets are directly connected.

Figure 3: Intra-Logical Switch - Same Logical Router

The not so obvious (because of the depends involved) is the impact on Layer 3 traffic that does not stay confined within the same Logical Router. The impact can be narrowed down to two types of traffic flows. One where the Source and Destination Workloads are hanging off different Logical Routers as their default gateways, and the other where the second Workload is not connected to a Logical Switch (think a Physical Workload or a Virtual Machine in a VLAN):

Elver’s Opinion: Type 2 flows are basically Layer 3 traffic between Virtual and Physical networks.

Type 1:
If the Source and Destination Workloads are hanging off different Logical Routers then you need an NSX Edge or another NFV Appliance to do routing (two Logical Routers can’t connect to the same Logical Switch nor same dvPortgroup). Is this Appliance within the impacted Data Centers? If not, the two Workloads won’t be able to talk to each other because there would be no logical path for the flow to reach the Appliance so that it can do the routing (remember the impacted Data Centers have some sort of “isolation”).

Figure 4: Intra-Logical Switch - Different Logical Routers

If the Appliance is within the impacted Data Center, then the two Workloads may reach each other. I say may because it all depends on whether there is a routing protocol between the Logical Routers and the Appliance. If you are using static routes, then yes the two Workloads can talk to each other. But if you are running a routing protocol, can the Logical Routers’ Control VMs reach the Appliance to exchange routing control traffic? If the Appliance lost connection to one or both of the Logical Routers’ Control VMs, then the Appliance will remove the routes to the Workload’s subnets from its routing table, thus making that subnet unreachable to itself.

While still in Type 1 flow, it is worthwhile to point out that if two Workloads in impacted Data Centers but in different Logical Routers can talk to each other, then Virtual Workloads in non-impacted Data Centers but the same Logical Routers will NOT be able to communicate because they won’t be able to reach the NSX Edge or NFV Appliance.

Type 2:
If the second Workload is not connected to a Logical Switch (Physical Workload or VM in a VLAN), we definitely need a Perimeter Edge or an NFV Appliance with a NFV connection to a VLAN dvPortgroup. We will assume that we are running a routing protocol. It is similar to the Layer 3 Type 1 flow but with a few variants.

Figure 5: Virtual to Physical

Variant 1: The Appliance can reach the Logical Router Control VM AND the second Workload is in one of the impacted Data Centers. In this instance communication between the Workloads will happen. However a Virtual Workload hanging off the same Logical Router but not within the impacted Data Centers will NOT be able to talk to the second Workload because the Appliance wouldn’t be reachable to it.

Variant 2: The Appliance can reach the Logical Router Control VM AND the second Workload is not in one of the impacted Data Centers. In this case the two won’t be able to communicate; remember the impacted Data Centers are “isolated” thus no traffic can come in or go out.

Variant 3: The Appliance can’t reach the Logical Router Control VM. It doesn’t matter where the second Workload is because the Appliance will remove the Virtual Workload’s network, thus making that subnet unreachable to itself. Thus the Workloads won’t be able to talk to each other. However, if you are using static routes, refer back to variants 1 and 2.

Variant 4: The Appliance is not in the impacted Data Centers. In this case there is no way for Logical Router to reach the Appliance, thus the Workloads won’t be able to talk to each other.

To wrap it up, please note that if the Perimeter Edge or NFV Appliance is located in one of the impacted Data Centers, no Virtual to Physical network traffic in the non-impacted Data Centers will be possible. 

No comments:

Post a Comment