Wednesday, December 3, 2014

VNXe- proof of how to do storage configuration wrong... very wrong.

~This article was originally to be published in February of 2014~

For the last few weeks, I've been dealing with an environment where storage has been massively misconfigured. Dual VNXe 3100s, only one in use. Storage network running on a single, incredibly slow HP 1410 switch, single NIC, single path... pretty much everything you can do wrong and still have it work.

So I've been trying to fix this in order to get backups to run in a reasonable amount of time- added dedicated switching, trying to get the second VNXe available for access to the local vmware cluster, etc. The first thing I run into is how incredibly slow this system is- any thing I do literally takes 160 seconds to complete. How do I know? I sat there with a stopwatch and clocked several of my actions, because I couldn't believe how slow it seemed- I must have been spoiled! I found to my horror that it's literally 2 minutes, 40 seconds per action. Dear god I hate this thing.

Going over the vmhosts, I come to find that there isn't consistent datastore mappings- each host has most datastores in common, but there were several that were only on one host, or two hosts. So I spend some time correcting this when I come across another horrible discovery- there were duplicate entries for hosts, and some were flat out wrong (or in one case, completely identical!)

This is where I made the critical mistake- I had gotten the switching in place, I had gotten the SANs moved over, but I hadn't cleaned up the access lists. I started on that, with the rational thought that it would work in a sensible manner. In my mind, I would simply change access over from IP allow lists to IQNs (simpler to manage, right?)- oh so wrong.

This thing had been setup to talk to vCenter, and so talk it did. For some ungodly reason, it decided that since vCenter knew about the datastore associated with the LUN that it could not change the access method (IQN vs IP)- and throws up this beautiful error:

"The changes could not be applied the following error was encountered:
 The datastore name is already in use on the ESX server
error code: 0x600d50"

WTF? Why on earth would this matter? Proper MPIO would allow multiple connections to the same LUN from the same initiator without error, so why on earth would this matter in the least? So, being the trusting, happy go-lucky admin that I am, I click OK.

And developed an ulcer. That instant.

Two of the three hosts were kicked off the VNXe immediately.

Ok I think to myself, easy fix. I'll just go back in and give myself permission again, undoing my changes.

Oh no. no no no- it's not that easy. It's nowhere near that easy. Nothing I do is letting me reattach these LUNs. _Nothing_. Now I'm getting spooked- is the data even still there? Did I somehow just obliterate the customer's data? I dig through emails and documents, finding the credentials I need to get on EMCs support site, where I run into the first hurdle.

Error code 0x600d50 is apparently not something customers need know about, so if you get that error, you're pretty well screwed. It also doesn't help that every reference to it is in regards to renaming datastores on the vmware side without doing it on the storage side- apparently does bad things. But this doesn't concern me, right? I made no such change!

There's apparently a lot more to this error than one condition- but it's so piss poorly documented that one will never find out. Now I'm really panicked, so I click on the chat with support button. I fill in all the details, and even manage to find the serial number for the device I'm having a headache with- and then click "submit".

And promptly get told support's not available via web chat.


Calling in to EMC's support line, I get told immediately by a very friendly recording that I'll get quicker support... if I use the chat client. Yep, this was going to be one of those calls. After navigating the menu system, I get to a young man who is completely lost by the gibberish coming out of my mouth- but he does get me in touch with a woman who understands me perfectly (I wish I could say the same about what she was saying)- I managed to secure a call back promise from her.

I wish I could say the nightmare ends here. The customer is down, and I've notified them that I'm working on it. They're mostly ok with it, as I'm working on it. At this point I'm pretty freaked, as I'm waiting on a call back that I'm not even sure is coming. 30 minutes later, the call back finally happens. And I walk the tech through what's going on, and he immediately starts trying to do all the things I had done.

Which is about when I really start to shiver- the tech was actually expecting it to work too. He gets the senior tech involved, which I overhear in the background saying that I need to remove all the datastores from the VM hosts.

Can't do it-  vmware refuses to unmount the datastores as long as there's a vm on it. So after arguing with support and realizing I'm not going to get anywhere otherwise, I power off every vm. and rescan the HBAs...

Which changes nothing.
At all.

And now I'm freaked because the support techs want me to remove the VMs from inventory. Now that I can't see the datastores in order to record what VM's go where (why is this a problem you ask? because I just took over the environment!)

I flat out refuse- and we go onto the next step. Which involves resetting the SAN. Needless to say, I refused to do that too.

What did eventually work you ask? Going into the VNXe where the problem originally existed and removing every host entry. Letting the SAN sit and fiddle with itself for awhile, while rescanning the HBA on the hosts. This at least cleared out the datastore lists- not a reassuring thing in the least mind you. Now I re-added the hosts one by one until I had them back in using the access methods I wanted them to use.

And I attached the first LUN.

And waited forever, or so it seemed. This was when I decided to use a stopwatch to figure out how long this was taking. Each LUN I reattach is taking ~3 minutes per LUN. 44 LUNs.

44 LUNs at 3 minutes per. 132 minutes to reattach these datastores. 132 minutes before I can even attempt to get the customer back online. 132 minutes of mind numbing, nail biting, customer frustrating hell.

So, I ask... Is there some way I can do these in bulk? "Nope"

I finally managed to thank the tech for his time and get off the phone. Where upon I've been stewing for over 2 hours, thinking about how utterly stupid this is. Fuming that I never had issues like this with Equallogic, HUS, or even open source linux iSCSI targets.

Why had I never run into these problems? Because none of those care one iota about what's accessing the volume. None of them try to do any screwed up LUN per datastore mapping, nor trying to enforce single host access or otherwise- why? Because they assume the storage admin knows what he's doing it.