Saturday, August 24, 2013

VMWare DataCenters and My stupidity == learning new trick

So I decided in a demo environment that I would try to move an ADDC from one datacenter to another- without realizing that you can't. Oh sure, there's a slightly convoluted way of doing so via off-lining the VM, and doing a cold migration between datacenters, but as I found out very quickly- that can be filled with some very nasty gotchas.

Some background on this VM- it's just the test auth for a proof of concept (POC) system, so it does double duty as the RAS host as well. VPN in, do my work, happy clam. Note- this box is the VPN server.

Did I mention it's the VPN server yet? Very important point, and one that makes me damn glad I had an out of bandwidth management setup.

Anyway, I start the preparations to move this VM by... shutting down the VM. Makes sense, it's a cold migration right? Yeah, guess who winds up disconnected? So without a hint of panic, I try to login to the host itself, thinking that AD going offline just popped VCenter. Can't reach the host- very quickly realize I can't reach any of the hosts. Now the panic briefly pops in, however it doesn't last long as I realize I can just log into the console.

And discover another mild gotcha.

The esxi shell is disabled by default.

Mind you, I've not played with 5.1 under the hood that much, so a quick tour of google and I find out how to access the shell once again- F2 at the console, login and navigate to "Troubleshooting Mode Options", enable the shell, and exit out. Finally, I've got shell!

So I login, execute quick vim-cmd vmsvc/getallvms in order to locate the vmid of the vm in question, followed by vim-cmd vmsvc/power.on . No real panic as yet, but that's because I figured I'm already fired anyway, how much worse can it get? Less than 5 minutes later, I'm able to login to the VPN and restart my connection to VCenter.

When I notice the error logs make no mention of VCenter ever having an issue. The only issue on the whole system was the fact that the esxi shell had been enabled on one of the hosts. Then it dawns on me- my original thought of short downtime not having any effect on the network was correct, however in my raging stupidity I'd forgotten where the vpn lived. That issue gets fixed today.

However, in looking around, I had an "AH HA!" moment that I needed to test (in my homelab... even if it's a POC, it's not for my tests- just the customer's). What if I join the new host to the current datacenter, do a live migration to the new host, remove the host and rejoin it to the correct datacenter? One quick test with a host running one half of my local AD (Yes, in the home test lab I run two dc's minimum... I hate rebuilding them.That'll teach me to cut corners).

Removing the host from the datacenter does not delete the VMs on that host. Joining a host with VMs doesn't delete the VMs (but I already knew that)- So what's to keep this from working? As near as I can tell nothing- just make sure the VMs you are trying to move aren't in a cluster, as the cluster will more than likely try to bring them back after you've removed the host.

In conclusion- if you can remove and join hosts from the vmware datacenter, you can do a live migration of VMs between datacenters.It just takes a bit of forethought and planning.

And being smart enough to remember where your VPN lives.