Monday, November 7, 2016

vRNI - Initial Setup

We now have vRNI installed and ready to go. The first thing you probably want to do is to change the default passwords. You can either setup LADP/AD or create a local admin account to login to the Platform. Either way, you want to not have to use the default admin@local account.

To setup AD or create local user account, scoop over to the top right of the screen to click the cog and choose settings (where we will spend most of the time in this post). I didn’t get around seting up my LDAP server so I’ll be skipping that part (and you can always google how to configure a LDAP server if you don’t already know). So I just created a new user (where it says User Management), elver@piratas.caribe, and gave it a role of administrator (the user must be in the form of an email address). I then log off admin@local and re-logged in with the new user. Returning to User Management you now have the option of deleting the admin@local account.

Elver’s Opinion: You also want to change the CLI user password but I couldn’t figure out how to do it. I reached out to some folks at VMware and will put an update here once I hear back from them.

Next you want to add some Data Sources. vRNI’s purpose for its existence is to gather data from different Data Center infrastructure entities, such as vCenter, NSX Manager (the main vRNI selling point), physical servers and network devices (another vRNI selling point) and do some wizardly on that data. Collectively these guys are referred to as Data Sources. Two Data Sources you really want to add are vCenter and NSX Manager. There does not seem to be a limit of how many of each you can add, however every NSX Manager must be linked to an existing vCenter Data Source (so vCenter must always be added first).

When adding a Data Source you select the type of Data Source you want and then populate the required fields. For vCenter, you must provide:

  • The vRNI Proxy to use (if the Platform has two or more Proxies associated with it. More on that in a future post)
  • The IP or FQDN of vCenter
  • Admin credentials for vCenter

Once vRNI validates it can authenticate with vCenter, you have the option to enable IPFIX (or Netflow if you prefer to use Cisco’s terminology) in any vDS that exists in vCenter. If you do enable IPFIX in the vDS, you will have an option to enable it per dvPortgroup. Then give your vCenter a vRNI nickname and save it (submit). Btw, enabling IPFIX will cause vRNI to configure IPFIX for you in the vDS using the Proxy’s IP as the collector. If your proxy is behind a NAT, you will need to go to vCenter, and manually edit the collector’s IP to the NATted IP AND punch a hole in the NAT router to allow IPFIX traffic to get back to the Proxy (UDP default port 2055)

Elver’s Opinion: Be careful with enabling IPFIX/Netflow in a production environment as it may tax the ESXi hosts. Only enable it if there is business value in doing so AND your ESXi hosts are not currently burdened with production workloads.

The steps to add NSX Manager are similar to those of vCenter’s but you need to select the vCenter that is associated with NSX Manager (otherwise how would vRNI correlate NSX Manager’s data with that of vCenter’s?). In addition, you can have vRNI to connect to the NSX Controllers to collect control plane data from it and to the NSX Edges (directly via SSH to the NSX Edges or via NSX Manager’s central CLI, which requires NSX 6.2).

Elver’s Opinion: I added a Brocade VDX as a source but I couldn’t get SNMP collection to work. Seriously, it is SNMP; that should work just because. I’ll keep trying and put up something in a future post if I’m successful. I’m also going to add my UCS once I get my mobile AC up and running in the server room.

And speaking of data, what exactly if vRNI collecting from vCenter? For starters, it collects a list of all VMs under vCenter’s management as well as compute, storage, VM placement (what host/cluster the VM is) and network information (basically the same info you get when using vCenter’s native monitor’s view). From NSX Manager, it collects info such as what Logical Switches the VM’s connect to and who is their default gateway (this is where the NSX Manager to vCenter correlation comes in).

Now the last paragraph is no reason to go buy vRNI. Hell, there are a million and one tools/collectors that can do this, many that are free or low cost. However, what vRNI can do (enter Platform) is correlate all the data and events collected from all the sources that would in the pass take operations team hours to do (which is why the Platform appliance has such a BIG CPU/Memory footprint). It has built in modules that can link vCenter and NSX data, and present nice pictures and charts to help identify problems in the environment (in particular, the network infrastructure). This is a time saver (and for a business, higher uptime with less negative reputational/financial impact).

I’ll see about writing the next post on how to use some of the operations and troubleshooting goodies of vRNI. I can’t promise when I will get around to do it, but I do promise that I will.

Elver's Opinion: Do you see the Topology chart in the last picture? I don't like it. It is a poor attempt to put unrelated information (storage, network, hosts, etc...) for the VM into one picture. Luckily , you can drag charts around and move them somewhere where they bother you less.

Friday, October 28, 2016

vRealize Network Insight - Installation

I really dislike software bugs. I spent hours trying to deploy the ova for vRealize Network Insight Platform to only have vCenter tell me that the operation was cancelled by a user. Forgive me, but last I checked I wasn’t hacked so there is no other user, I’m THE ONLY USER in my lab and I'm not cancelling the operation. Thank god that vSphere 6.5 will have actionable logs. Any way, after updating vCenter 6 from U1 to U2 and reinstalling the client integration plugin I was able to deploy vRNI Platform 3.1.

I’m not going cover how to deploy an ova but to deploy vRNI Platform you will need a static IP and all the goodness that comes with it. Oh, and you will also need to know if you want to deploy the appliance medium size or large size. Regardless of the size, the size on disk can be substantial (even thin provisioned I think the thing is still big). You can go here to get the official instructions on how to install it.

After deploying vRNI31 (that’s what I named the Platform), I couldn’t ping the thing. Turns out that for whatever reason the network information was not populated in the appliance (the boot logs showed “configure virtual network devices” failed). When I opened the console of vRNI31 to reach the CLI I discovered that I needed a username/password to get in (of course I would need one). The default credentials of admin@local/admin didn’t work, nor every permutation of root and admin I could think of (I’m a terrible hacker). So I decided to just walk away from the laptop and come back later.

When I returned I remembered that there is a CLI Guide for vRNI and in it was the CLI credential of consoleuser/ark1nc0ns0l3 (which made me wonder how long before they change the password to vmw@arec0ns0l3). In the CLI I typed setup and re-entered the network information I provided during the ova deployment and presto. I was now able to reach the Platform’s login page, https://vRNI31/ (I updated my lab DNS server), to proceed with the installation.

As you can see from the above figure, you need a license key to do anything with vRNI. Enter the license key and press Validate, followed by Activate. Of course, if the key can’t be validated you will be told the key is invalid. After activation, you get this window below.

Here you need to create a secret key (by pressing Generate) that will be used to establish communications with the vRNI Proxy. Before continuing I probably should take a paragraph or two to do a high-level explanation of what vRNI is.

vRealize Network Infrastructure is a product (since renamed) VMware acquired by buying a company named Arkin. vRNI helps Operations and Security teams manage and troubleshoot some virtual (vSphere and NSX) and physical Networks and Security. vRNI has a descent list of vendors and devices that it supports. vRNI works by polling the sources (vCenter or Cisco UCS for example) every so often (defaults to 10 minutes) and using some jujitsu white magic to help identify issues that might exist in the environment (like a host with a misconfigured MTU for example).

vRNI comes as two appliances, Platform and Proxy. The first one you install, which I installed above, is the Platform. The Platform does all the computations and smart stuff. The second one (which is what I’m installing next) is the Proxy, the one that does the data polling (and can be configured as a IPFix collector). vRNI supports some scalability by allowing you to cluster multiple Platforms and Proxies. I’ll cover in subsequent posts some of the things vRNI can help you do but for now, back to installing the Proxy.

vRNI Proxy is the second ova that you need for vRNI to work. The only differences between deploying the Platform and Proxy appliances are the shared secret (in the Proxy), the definition of medium and large, and the size of the disk.

By the way, I had the same problem of the network information, and the shared secret, not being populated to the Proxy (why does this keep happening to me?), so I added it via the CLI (console), as well as the shared secret (which I did via a new ssh connection):

Sure enough, a few seconds after adding the shared secret, the Proxy reached out to the Platform and was detected by it. Back in Platform  login page (https://vRNI31/), I clicked finished and I was prompted to enter the login credentials (admin@local/admin), where it sent me to the vRNI home page.

I’ll do my very best to write a follow up soon with a post(s) on how to add data sources and what to do with the information gathered. In the mean time, ta ta for now.

Elver’s Opinion: For a while VMware tried to position vRealize Log Insight as a Network and Security operations tool, but it is not. vRLI was built primarily to handle virtual compute, not Network and Security. As much lipstick as VMware put on it (via Management Packs), it just wasn’t enough. vRNI is by no means the ultimate N&S operations tool but it is way better than vRLI ever was for the job.

Friday, September 16, 2016

Restoring NSX from Old Backup - With Control VM

Ok, yesterday I posted Restoring NSX from Old Backup - Impact on Distributed Network where I said I was 5 sigma sure the Control VM wouldn’t make a difference to the restore. 5 sigma is probably not as good as 6 sigma (whatever that is) so this post is to show NSX Manager’s recovery being done when the Logical Router is deployed with its Control VM.

Here is the logical view of the network with a Global Logical Router that has a Control VM:

That’s right, his diagram is the same diagram from yesterday, with no Control VM. That’s because the logical diagram depicts the Data Plane, not the Control Plane (gotcha). However, the Control VM does have a connection for its HA interface (formerly known as Management Interface), which I dropped in the dvPortgroup COM_A1-VMMGT. Below is a diagram of the vDS after deploying the Control VM (this time I showed the Uplinks so you can see the two ESXi hosts…sorry for missing that yesterday).

So, I removed the logical router (default+edge-6) that was there, made a backup (Backup4) of NSX Manager, deployed a new logical router (piratas+edge-7) with its Contol VM (that’s how I got the above vDS screenshot) and did one more backup (Backup5) to easily return back to the end-state. Below is a screenshot of the new logical router.

Bonus Point 1: What is the Tenant name of the new Logical Router? Answer provided at the end of this post… Now back to the show.

And here is what com-a1-esxi01 sees:

Bonus Point 2: Why does the output for Edge Active says No? Answer provided at the end of the post.

And here is the same output (with some additional show commands to find the host-id of com-a1-esxi01), but taken from NSXMGR:

And here is me consoling in the Control VM and showing its routing table:

After restoring Backup4 (no logical router), here is what the com-a1-esxi01 host sees.

Even NSX Manager also forgot about it:

However, vCenter still sees the Control VM (it is a VM after all):

We can also console in to the Control VM (or if we had bothered to put an IP in the HA interface and enabled SSH, we could've gone in-band) and show the routing table:

Are you surprised the Control VM still shows the LIFs as connected? Let’s ponder on this for a bit. The Control VM doesn’t communicate directly with the ESXi hosts, so it has no clue that all of them dropped the Logical Router. It receives its information (configuration wise, like the LIFs and IPs) from NSX Manager. NSX Manager has not told the Control VM (since it forgot about the Control VM's existence) that the Logical Router is no longer around, thus the Control VM continues to believe all is good and the LIFs are still connected (up/up)...even after a few hours of not “hearing” from NSX Manager.

After a few hours, I restored from Backup5 (the end-state), the logical router came back, and NSX Manager remembered about the Control VM.

Elver’s Opinion: I don’t think I have an opinion today (something all wise married men know how to do too well)…but I would brag a little that I was right when I said yesterday that the restore would have the same impact to the logical router whether it has a Control VM or not.

Bonus Points Answers: Gotcha again (actually, I lied this time). Instead giving you the answers, how about you tweet the answers to me, @ElverS_Opinion? The first person to tweet both answers will get a signed copy, in two languages mind you, of the VCP6-NV Official Cert Book1. Just make sure you follow me so you can send me your mailing address via private IM.

1 Offer only valid for those that can locate the Seven Kingdoms in a map, agree with the fact that Citizen Kane is the best movie EVER and know what is Bachata.

Thursday, September 15, 2016

Restoring NSX from Old Backup - Impact on Distributed Network

I’ve been slacking (from writing) for a few months now but at VMWorld 2016 ‪@LuisChanu‪ reminded me of a blog I had promised him. My first ever blog was NSXManager Backup and Restore but he wanted to know of a few “whatifs”, like what would happen if you restore NSX Manager using an old backup. So this post is to fulfill (better late than never) my promise to Luis and write about what happens to the distributed network when you use old backups to restore NSX Manager.

To get us started. Below is a logical diagram of the NSX setup. We have one Global Logical Router and two Global Logical Switches. Logical Switch 1 has VM ServerWeb01 and Logical Switch 2 has VMs ServerApp01 and ServerApp02. ServerWeb01 and ServerApp02 are running in the same ESXi host, com-a1-esxi01 (not shown in the diagram).

I used a single cluster with ESXi hosts com-a1-esxi01 and com-a1-esxi02, both members of the same vDS. The initial (no logical switches deployed yet) vDS topology is shown below.

I made two backups (actually three backups) of NSX Manager to a FTP server. Backup1 does not have any logical switches or the logical router. Backup2 has the logical switches but not the logical router. Backup3 is my end-state with all configurations (I did it so I could quickly go back to a working state during testing).

Elver’s Opinion: I’ve used the built-in backup feature of NSX to do this lab. I’m 5 sigma confident that the same result would’ve been obtained if you used another method that backups the NSX Manager Appliance. Btw, if writing-slacking is really out of me, I’ll soon do a follow up post to cover the impact to NSX Security when restoring NSX Manager from an old backup.

So we have our vSphere/NSX environment working the way we want it (end-state) when a gamma ray hits the right (or wrong depending how you look at it) chip in one of the DIMMs that happened to be hosting the memory pages of NSX Manager, corrupting its database and rendering it useless (yes, it could happen specially if your ESXi host is onboard the International Space Station).

Elver’s Opinion: Instead of restoring NSX in this ET event, you could call VMware support. They have some tricks up their sleeves to recover from some types of database corruptions.

Just before the gamma ray hit the RAM, this is what our vDS looked like:

And the deployed Logical Switches:

And the deployed Logical Router:

And what com-a1-esxi01 saw:

Good to know: A quick detour to point out something about the CLI output. Notice that both logical switches (VXLAN 32000 and 32001) have a Port Count of 2. Each logical switch has one connected VM running in com-a1-esxi01 plus one LIF from the logical router.

Now back on the road, we did some ping tests (from ServerApp01 to ServerWeb01) to show that traffic is flowing between the two logical switches, via the logical router.

Let’s go ahead and restore from Backup2, the one that has the logical switches but not the logical router. After NSX Manager finishes the restore, we log back in to the Web Client and see the logical router missing from the Network and Security view (which is what would be expected since we restored from a backup that didn’t have a logical router).

One thing NSX Manager does after reestablishing the connection to vCenter, it reaches out to the ESXi hosts (vCenter has nothing to do with this) and asks them (politely) to get rid of the logical router that it does not know about (actually, NSX Manager pushes the logical routers that it does know about to the ESXi hosts and the hosts purge everything else). Below is a CLI output from com-a1-esxi01 showing the logical switches with the Port Count field down to 1 and no logical router present.

If we try to ping from the VMs to the default gateway (remember the LIFs are gone), the pings fail.

Just for kicks and giggles, I restored Backup3 and the logical router returned. I was able to ping between Layer 2 segments via the logical router.

Elver’s Opinion: I deployed the logical router without a Control VM as I expect (again, with 5 sigma certainty, for which I expect to be nominated to a Novell Prize) that the results would be the same as if I had deployed the Control VM.

Now to restore from Backup1, with no logical switches and no logical router. After the usual routine of waiting for NSX Manager to finish the restore and logging back in to the Web Client, I confirmed there were no logical switches in the Network and Security view.

However (and this should’ve been expected by you), the dvPortgroups representing the logical switches remained. dvPortgroups are owned by vCenter and vCenter was not part of the restore process. Looking at the ESXi host, it still had the information for the logical switches:

Again, this should’ve been expected because the difference between a VLAN dvPortgroup and a VXLAN dvPortgroup are the Opaque Network fields (VXLAN ID, Multicast address) in the VXLAN dvPortgroup that were pushed by vCenter to each of the ESXi hosts in the vDS. NSX Manager gave the Opaque Field values to vCenter. When NSX Manager is restored from the old backup, it is not aware of the VXLAN dvPortgroup thus it has no way of telling vCenter to clean up (which is a good thing by the way). You won’t be able to make any changes to those logical switches (VXLAN dvPortgroups) but the Data Plane will continue to run.

A quick ping between ServerApp01 and ServerApp02 (which were running in different hosts) proved VXLAN was working between the VTEPs.

Elver’s Opinion: So we have a split verdict on the impact to the distributed network of restoring NSX from old backups. The Layer 3 (logical routers) would get affected (this is bad) while the Layer 2 (this is good) would not. As an aside, I didn’t test with the NSX Edge appliance as once it is deployed (configs pushed by NSX Manager) the Edge goes about its business in the Control/Data planes irrespective of what happens to NSX Manager.