I’ve been slacking (from writing) for a few months now but
at VMWorld 2016 @LuisChanu reminded
me of a blog I had promised him. My first ever blog was NSXManager Backup and Restore but he wanted to know of a few “whatifs”, like what
would happen if you restore NSX Manager using an old backup. So this post is to
fulfill (better late than never) my promise to Luis and write about what happens to the distributed network
when you use old backups to restore NSX Manager.
To get us started. Below is a logical diagram of the NSX setup. We
have one Global Logical Router and two Global Logical Switches. Logical Switch
1 has VM ServerWeb01 and Logical Switch 2 has VMs ServerApp01 and ServerApp02. ServerWeb01
and ServerApp02 are running in the same ESXi host, com-a1-esxi01 (not shown in
the diagram).
I used a single cluster with ESXi hosts com-a1-esxi01 and
com-a1-esxi02, both members of the same vDS. The initial (no logical switches
deployed yet) vDS topology is shown below.
I made two backups (actually three backups) of NSX Manager
to a FTP server. Backup1 does not have any logical switches or the logical
router. Backup2 has the logical switches but not the logical router. Backup3 is my end-state with all configurations (I did it so I could
quickly go back to a working state during testing).
Elver’s Opinion: I’ve used the built-in backup feature of
NSX to do this lab. I’m 5 sigma confident that the same result would’ve been
obtained if you used another method that backups the NSX Manager Appliance.
Btw, if writing-slacking is really out of me, I’ll soon do a follow up post to
cover the impact to NSX Security when restoring NSX Manager from an old backup.
So we have our vSphere/NSX environment working the way we want
it (end-state) when a gamma ray hits the right (or wrong depending how you look
at it) chip in one of the DIMMs that happened to be hosting the memory pages of NSX
Manager, corrupting its database and rendering it useless (yes, it could happen
specially if your ESXi host is onboard the International Space Station).
Elver’s Opinion: Instead of restoring NSX in this ET event, you could call
VMware support. They have some tricks up their sleeves to recover from some
types of database corruptions.
Just before the gamma ray hit the RAM, this is what our vDS
looked like:
And the deployed Logical Switches:
And the deployed Logical Router:
And what com-a1-esxi01 saw:
Good to know: A quick detour to point out something about the CLI output.
Notice that both logical switches (VXLAN 32000 and 32001) have a Port Count of 2. Each logical switch has
one connected VM running in com-a1-esxi01 plus one LIF from the logical router.
Now back on the road, we did some ping tests (from
ServerApp01 to ServerWeb01) to show that traffic is flowing between the two
logical switches, via the logical router.
Let’s go ahead and restore from Backup2, the one that has
the logical switches but not the logical router. After NSX Manager finishes the
restore, we log back in to the Web Client and see the logical router missing
from the Network and Security view (which is what would be expected since we
restored from a backup that didn’t have a logical router).
One thing NSX Manager does after reestablishing the
connection to vCenter, it reaches out to the ESXi hosts (vCenter has nothing to
do with this) and asks them (politely) to get rid of the logical router that it
does not know about (actually, NSX Manager pushes the logical routers that it does know about to the ESXi hosts and
the hosts purge everything else). Below is a CLI output from com-a1-esxi01 showing the logical
switches with the Port Count field down to 1 and
no logical router present.
If we try to ping from the VMs to the default gateway
(remember the LIFs are gone), the pings fail.
Just for kicks and giggles, I restored Backup3 and the
logical router returned. I was able to ping between Layer 2 segments via the
logical router.
Elver’s Opinion: I deployed the logical router without a
Control VM as I expect (again, with 5 sigma certainty, for which I expect to be nominated to a Novell Prize) that the results would
be the same as if I had deployed the Control VM.
Now to restore from Backup1, with no logical switches and no
logical router. After the usual routine of waiting for NSX Manager to finish
the restore and logging back in to the Web Client, I confirmed there were no
logical switches in the Network and Security view.
However (and this should’ve been expected by you), the
dvPortgroups representing the logical switches remained. dvPortgroups are owned
by vCenter and vCenter was not part of the restore process. Looking at the ESXi
host, it still had the information for the logical switches:
Again, this should’ve been expected because the difference
between a VLAN dvPortgroup and a VXLAN dvPortgroup are the Opaque Network
fields (VXLAN ID, Multicast address) in the VXLAN dvPortgroup that were pushed
by vCenter to each of the ESXi hosts in the vDS. NSX Manager gave the Opaque Field values to vCenter. When NSX Manager is restored
from the old backup, it is not aware of the VXLAN dvPortgroup thus it has no
way of telling vCenter to clean up (which is a good thing by the way). You
won’t be able to make any changes to those logical switches (VXLAN
dvPortgroups) but the Data Plane will continue to run.
A quick ping between ServerApp01 and ServerApp02 (which were
running in different hosts) proved VXLAN was working between the VTEPs.
Elver’s Opinion: So we have a split verdict on the impact to
the distributed network of restoring NSX from old backups. The Layer 3 (logical
routers) would get affected (this is bad) while the Layer 2 (this is good)
would not. As an aside, I didn’t test with the NSX Edge appliance as once it is
deployed (configs pushed by NSX Manager) the Edge goes about its business in
the Control/Data planes irrespective of what happens to NSX Manager.
No comments:
Post a Comment