Thursday, September 15, 2016

Restoring NSX from Old Backup - Impact on Distributed Network

I’ve been slacking (from writing) for a few months now but at VMWorld 2016 ‪@LuisChanu‪ reminded me of a blog I had promised him. My first ever blog was NSXManager Backup and Restore but he wanted to know of a few “whatifs”, like what would happen if you restore NSX Manager using an old backup. So this post is to fulfill (better late than never) my promise to Luis and write about what happens to the distributed network when you use old backups to restore NSX Manager.

To get us started. Below is a logical diagram of the NSX setup. We have one Global Logical Router and two Global Logical Switches. Logical Switch 1 has VM ServerWeb01 and Logical Switch 2 has VMs ServerApp01 and ServerApp02. ServerWeb01 and ServerApp02 are running in the same ESXi host, com-a1-esxi01 (not shown in the diagram).


I used a single cluster with ESXi hosts com-a1-esxi01 and com-a1-esxi02, both members of the same vDS. The initial (no logical switches deployed yet) vDS topology is shown below.


I made two backups (actually three backups) of NSX Manager to a FTP server. Backup1 does not have any logical switches or the logical router. Backup2 has the logical switches but not the logical router. Backup3 is my end-state with all configurations (I did it so I could quickly go back to a working state during testing).

Elver’s Opinion: I’ve used the built-in backup feature of NSX to do this lab. I’m 5 sigma confident that the same result would’ve been obtained if you used another method that backups the NSX Manager Appliance. Btw, if writing-slacking is really out of me, I’ll soon do a follow up post to cover the impact to NSX Security when restoring NSX Manager from an old backup.

So we have our vSphere/NSX environment working the way we want it (end-state) when a gamma ray hits the right (or wrong depending how you look at it) chip in one of the DIMMs that happened to be hosting the memory pages of NSX Manager, corrupting its database and rendering it useless (yes, it could happen specially if your ESXi host is onboard the International Space Station).

Elver’s Opinion: Instead of restoring NSX in this ET event, you could call VMware support. They have some tricks up their sleeves to recover from some types of database corruptions.

Just before the gamma ray hit the RAM, this is what our vDS looked like:


And the deployed Logical Switches:


And the deployed Logical Router:


And what com-a1-esxi01 saw:


Good to know: A quick detour to point out something about the CLI output. Notice that both logical switches (VXLAN 32000 and 32001) have a Port Count of 2. Each logical switch has one connected VM running in com-a1-esxi01 plus one LIF from the logical router.

Now back on the road, we did some ping tests (from ServerApp01 to ServerWeb01) to show that traffic is flowing between the two logical switches, via the logical router.


Let’s go ahead and restore from Backup2, the one that has the logical switches but not the logical router. After NSX Manager finishes the restore, we log back in to the Web Client and see the logical router missing from the Network and Security view (which is what would be expected since we restored from a backup that didn’t have a logical router).




One thing NSX Manager does after reestablishing the connection to vCenter, it reaches out to the ESXi hosts (vCenter has nothing to do with this) and asks them (politely) to get rid of the logical router that it does not know about (actually, NSX Manager pushes the logical routers that it does know about to the ESXi hosts and the hosts purge everything else). Below is a CLI output from com-a1-esxi01 showing the logical switches with the Port Count field down to 1 and no logical router present.


If we try to ping from the VMs to the default gateway (remember the LIFs are gone), the pings fail.


Just for kicks and giggles, I restored Backup3 and the logical router returned. I was able to ping between Layer 2 segments via the logical router.

Elver’s Opinion: I deployed the logical router without a Control VM as I expect (again, with 5 sigma certainty, for which I expect to be nominated to a Novell Prize) that the results would be the same as if I had deployed the Control VM.

Now to restore from Backup1, with no logical switches and no logical router. After the usual routine of waiting for NSX Manager to finish the restore and logging back in to the Web Client, I confirmed there were no logical switches in the Network and Security view.



However (and this should’ve been expected by you), the dvPortgroups representing the logical switches remained. dvPortgroups are owned by vCenter and vCenter was not part of the restore process. Looking at the ESXi host, it still had the information for the logical switches:


Again, this should’ve been expected because the difference between a VLAN dvPortgroup and a VXLAN dvPortgroup are the Opaque Network fields (VXLAN ID, Multicast address) in the VXLAN dvPortgroup that were pushed by vCenter to each of the ESXi hosts in the vDS. NSX Manager gave the Opaque Field values to vCenter. When NSX Manager is restored from the old backup, it is not aware of the VXLAN dvPortgroup thus it has no way of telling vCenter to clean up (which is a good thing by the way). You won’t be able to make any changes to those logical switches (VXLAN dvPortgroups) but the Data Plane will continue to run.

A quick ping between ServerApp01 and ServerApp02 (which were running in different hosts) proved VXLAN was working between the VTEPs.



Elver’s Opinion: So we have a split verdict on the impact to the distributed network of restoring NSX from old backups. The Layer 3 (logical routers) would get affected (this is bad) while the Layer 2 (this is good) would not. As an aside, I didn’t test with the NSX Edge appliance as once it is deployed (configs pushed by NSX Manager) the Edge goes about its business in the Control/Data planes irrespective of what happens to NSX Manager.

No comments:

Post a Comment