Monday, February 13, 2017

DC Egress Traffic with Stretched Layer 2

In the last blog I spent a lot of writing (and funky formulas) just to say that the DCI circuits between two Data Centers need to be larger than the biggest DC WAN link plus the inter-DC traffic (which increases your cost). When stretching layer 2 across DCs, there is not much that can be done to force DC ingress traffic to come in via the DC WAN link where the destination workload (VM) is running. However, there are some things you can do to force the egress traffic to go out the DC where the VM is located (and avoid using the DCI circuits) to reduce some of the cost associated with the DCI links. For this blog post I’m going to assume that we have Active/Active DCs.

Dual Default Gateways
When you stretch the layer 2 across the DC, the default getaway for the stretched layer 2 segments could be physically located in one of the two DCs. We don’t care about that use case (that would probably require a standalone blog post to talk about the cons of this design). Instead let’s assume that we have default gateway services in both DCs, and to provide redundancy, we will have two routers as default gateways in each DC, running FHRP (something like VRRP), as shown in the diagram.

In this design, the VM will forward traffic to its local default gateway, which in turn will forward the traffic out of it local DC WAN. For this design to work (1) there must be a mechanism to stop each pair of DC default gateways from seeing each other (otherwise you won’t get both pairs of FHRP routers to be Active with the same virtual MAC) and (2) to prevent a VM in one DC from receiving ARP replies for its default gateway from the other DC’s pair of default gateways. You could achieve this with Access List (too manual) or you could stretch the layer 2 with something like Cisco OTV, which has built-in mechanism (less manual) to isolate FHRP in each DC.

This design does have some potential issues that must be taken into account. If each pair of default gateways uses different virtual MAC addresses when replying to ARP, a VM that moves DCs will lose connectivity (until it re-ARPs for its default gateway). Also, if both members of a pair of default gateways go down, you may have to remove the FHRP isolation to allow the impacted VMs to reach the default gateways in the other DC.

Distributed Default Gateway (Top of Rack)
An alternative to dual default gateways is to stretch the layer 2 using VXLAN (or another tunneling protocol) from the Top of Rack (ToR). In this design all ToR will have the Layer 3 boundary and be the default gateways for their own racks. Every time the ToR gets an ARP request, the ToR will respond to it (and provide local FHRP isolation), as shown in the diagram below. Two examples of this are Arista’s DCI with VXLAN and VARP, Brocade’s IP Fabric with Anycast gateway (for the time being until Broadcom decides what to do with Brocade’s Network business).

One advantage of this design over the previous one: there are a lot more routers (the ToRs) acting as default gateways and built in FHRP isolation. If a ToR dies, only the rack where it resides will be impacted as opposed to the entire DC. Also since all ToR have the use the same virtual MAC, when a VM moves DCs, it continues to have uninterrupted Layer 3 connectivity. One disadvantage is that you would need to fiddle with route advertisements to ensure the ToRs forward traffic straight up the local DC WAN; this many not be as easily done as it sounds.

Side note: there is a variation of this design where the ToR are strictly layer 2 (let’s call them Leafs) and the distribution switches (henceforth Spines) do the Layer 3, thus providing the default gateway services.

Distributed Default Gateway (software)
Just like the physical version, you stretch the layer 2 using a tunneling protocol (like VXLAN or GENEVE) but you have a layer 3 process in each hypervisor that serves as the default gateway (e.g. virtual router). Each virtual router will have the same IP and virtual MAC (thus VMs can move between DCs at will) and locally respond to ARP requests. And like the physical version, you must manipulate routes to force each virtual router to send traffic to its local DC WAN, as shown in the diagram.

VMware’s NSX-v (distributed logical router) achieves this functionality. Each logical router is the “same” in each hypervisor except for their routing tables. Each logical router in each DC will get routes only relevant to it. This way, each logical router is “forced” to forward traffic using its local WAN.

Elver’s Opinion: This blog post should (mostly) conclude my thoughts on stretching Layer 2 across the DCI (think hard before doing it). At first I thought I would use this blog to also talk about local egress in NSX (to wrap up my thoughts on the matter), but as I wrote I realize I would need more space than I thought, so I’ll be writing another blog post just on local egress in NSX.

Friday, January 13, 2017

Impact of Stretched Layer 2 on DCI

I was not clear in my DC Ingress blog post as to why it matters which is the entry/exit point for flows coming from/going outside the DCs for the application that is using the stretched layer 2 in an infrastructure supporting BC with an Active/Active WAN architecture. One word can summarize why it matters: cost. The moment you allow traffic not sourced to/destined in the DCs to go over the links between the Data Centers, that link becomes a transit segment and you must increase its speed to accommodate the additional traffic.

Let me put back up the $1 diagram I used last time, but now showing the connection between the Data Centers (the Data Center Interconnect, or DCI).

The DCI between the two DCs needs to be big enough to handle all inter-DC traffic (traffic with source and destination of the DCs; doesn’t include transit traffic coming from/going outside the DCs). Lets call traffic from DC1 to DC2 DCI1 and DC2 to DC1 DCI2. The speed of your DC1 WAN circuit must be as big as the amount of ingress traffic in DC1. Same goes for DC2. If we call the DC ingress traffic DCi1 and DCi2, and we are not doing any sort of route manipulation, then some DCi1 traffic will transit the DCI to reach VMs in DC2 and some DCi2 traffic will transit the DCI to reach VMs in DC1.

Since we don’t know how much “some” is going to be, we should architect for worst-case scenario, like a WAN disruption changing flow patterns, or risk having some traffic dropped before it goes over the DCI. So this is how much traffic the DCI would have to handle:

If DCI1 + DCi1  DCI2 + DCi2 then DCI1 + DCi1, else DCI2 + DCi2

What this little formula says is that the speed of the DCI link must be as big as the larger of traffic from DC1 or from DC2 (I’m making the assumption the DCI is symmetrical; none of that asymmetrical bandwidth you get from your home ISP).

But this formula is not complete. You see, the VMs will be sending traffic back to the user (egress traffic). Let’s pretend the traffic flow goes back the same way the ingress traffic came (worst-case again, as we can't predict what would happen in the WAN). Using DCe1 to represent the VMs in DC1 replying back to the user and DCe2 to represent the VMs in DC2 replying back to the user, the formula becomes this:

If DCI1 + DCi1 + DCe1  DCI2 + DCi2 + DCe2 then DCI1 + DCi1 + DCe1, else DCI2 + DCi2 + DCe2
This formula is a bit long, so let’s do some thinking and see if we can simplify this. Since we are architecting for worst-case scenario and we are thinking BC, we can use the larger of DCi1 or DCi2 and call it DCiB. DCiB will be coming in one of the WAN circuits of the DCs. Let’s give the same treatment to DCe1 and DCe2, and call it DCeB. DCeB will be going out of one of the WAN circuits in the DCs.

Elver’s Opinion: Since flow patterns are never static, it is a good idea to make the WAN circuits in both DCs the same size, and be the larger of DCiB or DCeB.

For sizing our DCI, we actually care about the larger of DCiB or DCeB; let’s call it DCB. The reason for this is that in the event of WAN failure at DC2 all ingress traffic comes in DC1 and transits over the DCI to DC2, and all egress traffic will go from DC2 and transit the DCI to DC1 (and the following week, flow patterns reverse). This allows us to replace DCiX + DCeX for DCB.

We now make some substitutions to get this:

If DCI1 + DCB  DCI2 + DCB then DCI1 + DCB, else DCI2 + DCB

Which can be rewritten as:

If DCI1  DCI2 then DCI1 + DCB, else DCI2 + DCB

All of this writing and formulas just to say that the DCI speed must be at least as big as your largest DC WAN circuit plus the largest inter-DC traffic. Or put another way, the DCI circuit speed will be the inter-DC traffic plus the transit traffic…and transit traffic we established is the ingress/egress traffic in support of the application that is using the stretched layer 2.

The higher the speed of the DCI circuit(s), the higher the cost. It might not be as obvious, but the higher cost is not just for the actual circuit. It is also for the hardware that is needed at both ends of the circuit to support it and the intra-DC hardware required to support any other higher-speed links that will have to carry the transit traffic.

I’ll write another post to discuss how to minimize the egress traffic becoming DCI transit traffic. It is quite straightforward nowadays to accomplish, with most major network vendors providing solutions for it. I will give special placement to NSX, as it has to achieve it doing something different from what the other vendors do.

Elver’s Opinion: Yes there are traffic pattern schemes that would leverage a smaller size DCI than the last formula above. However those cases don’t occur much in the wild when you are tasked to provide an infrastructure that supports BC with Active/Active WAN and stretched layer 2 for applications.

Wednesday, January 11, 2017

DC Ingress Traffic with Stretched Layer 2

Thank you 1970s for giving us two great things: yours truly and TCP/IP. One thing TCP/IP assumes is that a subnet resides in a single location (you only have one gateway, and it must reside somewhere). However, developers love(d) to code so their application components reside in the same subnet (and same layer 2 so they don’t have to worry about default gateways and what not).

During DR (Disaster Recovery) scenarios it was typical to migrate an application to the backup DC without re-IPing it. So far so good; subnet still resides in “one location” at a time. However, DR evolved to BC (Business Continuity - think about it, why drop a bunch of money on gear, space, and such not to use it?) and Active/Active DCs, and our good friends the developers decided to make it an infrastructure technical requirement to stretch the layer 2 their applications were using across multiple DCs (heaven forbid they would re-architect their applications or that you suggest GLB). TCP/IP is not happy. Elver neither.

All this presents a problem an opportunity to network designers. It is probably better to first illustrate it, and then explain it.

In the diagram, a user wants to reach the presentation layer of some application that serves requests out of two DCs. If the user wants to reach a VM that happens to be in DC2, there is no native way for the network to know where the VM resides and thus forward the traffic directly to DC2. It is a 50/50 chance of which DC will receive (ingress) the traffic (more on that below). This is because the network knows about subnets, not individual IPs. When a router does a lookup in its routing table to decide the next hop for a packet in transit, it looks for the smallest subnet in its routing table that matches the destination IP. If the router has two or more next hops as options for the matching subnet, it would select one (mostly based on some hashing of the header of the packet in transit) and forwards the packet to the selected next hop.

If the user happens to be “closer” to DC2 than DC1, then it is most likely that the user’s traffic will ingress via DC2. However, “closer” is not about physical proximity but about network path cost and other variables. Also, the network is not a static entity; there are changes happening frequently enough that may affect the “closeness” of the user to the DC/VM.

Why am I telling you all this? Because recently I got into a lively conversation while discussing x-vCenter NSX. x-vCenter NSX allows for layer 2 to be stretched across multiple DCs while providing gateway/FHR (Fist Hop Router/Routing). There is nothing in NSX that can force the user’s ingress traffic via the DC where the destination VM is. If anyone ever tells you otherwise, whatever solution they provide is not unique to NSX but rather a general networking trick.

So what are those networking tricks? Here are some (not all-inclusive) of them with their potential impacts:

Active/Passive Ingress – Allow the layer 2 to be stretched across both DCs, but advertise the subnet out of only one of the two DCs. If this feels like cheating, it is because it is cheating. You only solve the ingress problem for some of the VMs, and not the others. You also really don’t have BC here because in case of the “Active” DC going down, some intervention will be required to advertise the subnet out of the “Passive” DC; there will be an outage for the application.

Active/"Active" Ingress – Here you advertise the subnet out of both DCs, but you make one DC look “really farther away” than the other by manipulating the cost of the subnet in the routing protocol (like BGP AS pre-pending). You would have BC since network failover is automated, but again there is cheating here because you are (mostly) solving the problem for some of the VMs and not the others. Also you could have users that are “so close” to the "backup" Active DC that no feasible amount of cost manipulation would affect them.

Advertise Host Routes – There is nothing that prevents the turning a VM’s IP into a /32 subnet and injecting that into the routing process. You can achieve this by adding a static route for each VM IP (/32) in the presentation layer and redistributing the routes into the routing process. Since you can’t get a subnet that is more specific than /32, there would never be a router (outside the DCs) with two equal-costs paths to the /32 pointing to different DCs. You truly get ingress traffic to the DC where the destination VM is. But before I continue explaining this one, let me just note that the burning sensation you are feeling right now on the back of your neck is the Operations Manager giving you the evil look. With this solution you SUBSTANTIALLY increase the size of the routing table and complexity in the network. And this solution breaks down when a VM changes DCs as there is no automated way where the /32 is being injected into he routing table.

Cisco LISP – To wrap it up it is worth mentioning Cisco LISP (Locator ID Separation Protocol). LISP attempts to solve the ingress situation by leveraging the /32 trick but restricting where the /32 are sent. The idea is to create a network “bubble” around the DCs and place LISP routers at the edge of the bubble. All users must reside outside of the bubble so all ingress traffic goes through the LISP routers. The LISP routers in term communicate directly with the FHR with the subnet in question (the stretched layer 2 in both DCs) to find out where each VM (IP) resides. When the user traffic reaches the LISP router, the LISP router looks up where the destination IP is located and forwards the traffic to the FHR (via a tunnel). If a VM moves DCs, the FHRs would update the LISP routers with the new VM (IP) location. The problem with this solution is the bubble. Where do you place the LISP routers? and what do you do in a brownfield deployment? It can get expensive and very complicated to achieve.

Elver’s Opinion: As Developers continue to better understand the impact to infrastructure of their design decisions (DevOps), they are building applications that work within the constraints of infrastructure protocols (Cloud Native Apps). So the need to stretch layer 2 across DCs is becoming less and less of an infrastructure technical requirement.

Monday, November 7, 2016

vRNI - Initial Setup

We now have vRNI installed and ready to go. The first thing you probably want to do is to change the default passwords. You can either setup LADP/AD or create a local admin account to login to the Platform. Either way, you want to not have to use the default admin@local account.

To setup AD or create local user account, scoop over to the top right of the screen to click the cog and choose settings (where we will spend most of the time in this post). I didn’t get around seting up my LDAP server so I’ll be skipping that part (and you can always google how to configure a LDAP server if you don’t already know). So I just created a new user (where it says User Management), elver@piratas.caribe, and gave it a role of administrator (the user must be in the form of an email address). I then log off admin@local and re-logged in with the new user. Returning to User Management you now have the option of deleting the admin@local account.

Elver’s Opinion: You also want to change the CLI user password but I couldn’t figure out how to do it. I reached out to some folks at VMware and will put an update here once I hear back from them.

Next you want to add some Data Sources. vRNI’s purpose for its existence is to gather data from different Data Center infrastructure entities, such as vCenter, NSX Manager (the main vRNI selling point), physical servers and network devices (another vRNI selling point) and do some wizardly on that data. Collectively these guys are referred to as Data Sources. Two Data Sources you really want to add are vCenter and NSX Manager. There does not seem to be a limit of how many of each you can add, however every NSX Manager must be linked to an existing vCenter Data Source (so vCenter must always be added first).

When adding a Data Source you select the type of Data Source you want and then populate the required fields. For vCenter, you must provide:

  • The vRNI Proxy to use (if the Platform has two or more Proxies associated with it. More on that in a future post)
  • The IP or FQDN of vCenter
  • Admin credentials for vCenter

Once vRNI validates it can authenticate with vCenter, you have the option to enable IPFIX (or Netflow if you prefer to use Cisco’s terminology) in any vDS that exists in vCenter. If you do enable IPFIX in the vDS, you will have an option to enable it per dvPortgroup. Then give your vCenter a vRNI nickname and save it (submit). Btw, enabling IPFIX will cause vRNI to configure IPFIX for you in the vDS using the Proxy’s IP as the collector. If your proxy is behind a NAT, you will need to go to vCenter, and manually edit the collector’s IP to the NATted IP AND punch a hole in the NAT router to allow IPFIX traffic to get back to the Proxy (UDP default port 2055)

Elver’s Opinion: Be careful with enabling IPFIX/Netflow in a production environment as it may tax the ESXi hosts. Only enable it if there is business value in doing so AND your ESXi hosts are not currently burdened with production workloads.

The steps to add NSX Manager are similar to those of vCenter’s but you need to select the vCenter that is associated with NSX Manager (otherwise how would vRNI correlate NSX Manager’s data with that of vCenter’s?). In addition, you can have vRNI to connect to the NSX Controllers to collect control plane data from it and to the NSX Edges (directly via SSH to the NSX Edges or via NSX Manager’s central CLI, which requires NSX 6.2).

Elver’s Opinion: I added a Brocade VDX as a source but I couldn’t get SNMP collection to work. Seriously, it is SNMP; that should work just because. I’ll keep trying and put up something in a future post if I’m successful. I’m also going to add my UCS once I get my mobile AC up and running in the server room.

And speaking of data, what exactly if vRNI collecting from vCenter? For starters, it collects a list of all VMs under vCenter’s management as well as compute, storage, VM placement (what host/cluster the VM is) and network information (basically the same info you get when using vCenter’s native monitor’s view). From NSX Manager, it collects info such as what Logical Switches the VM’s connect to and who is their default gateway (this is where the NSX Manager to vCenter correlation comes in).

Now the last paragraph is no reason to go buy vRNI. Hell, there are a million and one tools/collectors that can do this, many that are free or low cost. However, what vRNI can do (enter Platform) is correlate all the data and events collected from all the sources that would in the pass take operations team hours to do (which is why the Platform appliance has such a BIG CPU/Memory footprint). It has built in modules that can link vCenter and NSX data, and present nice pictures and charts to help identify problems in the environment (in particular, the network infrastructure). This is a time saver (and for a business, higher uptime with less negative reputational/financial impact).

I’ll see about writing the next post on how to use some of the operations and troubleshooting goodies of vRNI. I can’t promise when I will get around to do it, but I do promise that I will.

Elver's Opinion: Do you see the Topology chart in the last picture? I don't like it. It is a poor attempt to put unrelated information (storage, network, hosts, etc...) for the VM into one picture. Luckily , you can drag charts around and move them somewhere where they bother you less.

Friday, October 28, 2016

vRealize Network Insight - Installation

I really dislike software bugs. I spent hours trying to deploy the ova for vRealize Network Insight Platform to only have vCenter tell me that the operation was cancelled by a user. Forgive me, but last I checked I wasn’t hacked so there is no other user, I’m THE ONLY USER in my lab and I'm not cancelling the operation. Thank god that vSphere 6.5 will have actionable logs. Any way, after updating vCenter 6 from U1 to U2 and reinstalling the client integration plugin I was able to deploy vRNI Platform 3.1.

I’m not going cover how to deploy an ova but to deploy vRNI Platform you will need a static IP and all the goodness that comes with it. Oh, and you will also need to know if you want to deploy the appliance medium size or large size. Regardless of the size, the size on disk can be substantial (even thin provisioned I think the thing is still big). You can go here to get the official instructions on how to install it.

After deploying vRNI31 (that’s what I named the Platform), I couldn’t ping the thing. Turns out that for whatever reason the network information was not populated in the appliance (the boot logs showed “configure virtual network devices” failed). When I opened the console of vRNI31 to reach the CLI I discovered that I needed a username/password to get in (of course I would need one). The default credentials of admin@local/admin didn’t work, nor every permutation of root and admin I could think of (I’m a terrible hacker). So I decided to just walk away from the laptop and come back later.

When I returned I remembered that there is a CLI Guide for vRNI and in it was the CLI credential of consoleuser/ark1nc0ns0l3 (which made me wonder how long before they change the password to vmw@arec0ns0l3). In the CLI I typed setup and re-entered the network information I provided during the ova deployment and presto. I was now able to reach the Platform’s login page, https://vRNI31/ (I updated my lab DNS server), to proceed with the installation.

As you can see from the above figure, you need a license key to do anything with vRNI. Enter the license key and press Validate, followed by Activate. Of course, if the key can’t be validated you will be told the key is invalid. After activation, you get this window below.

Here you need to create a secret key (by pressing Generate) that will be used to establish communications with the vRNI Proxy. Before continuing I probably should take a paragraph or two to do a high-level explanation of what vRNI is.

vRealize Network Infrastructure is a product (since renamed) VMware acquired by buying a company named Arkin. vRNI helps Operations and Security teams manage and troubleshoot some virtual (vSphere and NSX) and physical Networks and Security. vRNI has a descent list of vendors and devices that it supports. vRNI works by polling the sources (vCenter or Cisco UCS for example) every so often (defaults to 10 minutes) and using some jujitsu white magic to help identify issues that might exist in the environment (like a host with a misconfigured MTU for example).

vRNI comes as two appliances, Platform and Proxy. The first one you install, which I installed above, is the Platform. The Platform does all the computations and smart stuff. The second one (which is what I’m installing next) is the Proxy, the one that does the data polling (and can be configured as a IPFix collector). vRNI supports some scalability by allowing you to cluster multiple Platforms and Proxies. I’ll cover in subsequent posts some of the things vRNI can help you do but for now, back to installing the Proxy.

vRNI Proxy is the second ova that you need for vRNI to work. The only differences between deploying the Platform and Proxy appliances are the shared secret (in the Proxy), the definition of medium and large, and the size of the disk.

By the way, I had the same problem of the network information, and the shared secret, not being populated to the Proxy (why does this keep happening to me?), so I added it via the CLI (console), as well as the shared secret (which I did via a new ssh connection):

Sure enough, a few seconds after adding the shared secret, the Proxy reached out to the Platform and was detected by it. Back in Platform  login page (https://vRNI31/), I clicked finished and I was prompted to enter the login credentials (admin@local/admin), where it sent me to the vRNI home page.

I’ll do my very best to write a follow up soon with a post(s) on how to add data sources and what to do with the information gathered. In the mean time, ta ta for now.

Elver’s Opinion: For a while VMware tried to position vRealize Log Insight as a Network and Security operations tool, but it is not. vRLI was built primarily to handle virtual compute, not Network and Security. As much lipstick as VMware put on it (via Management Packs), it just wasn’t enough. vRNI is by no means the ultimate N&S operations tool but it is way better than vRLI ever was for the job.