Storage | vNelsonTX

First, I haven’t been ignoring my duties to this blog by not posting articles, the time at my job has kept me away from posting regularly scheduled articles but it hasn’t stopped me from producing some great topics for future articles, this being one of these.

ESXi5.5 Purple Screen of Death = Oh My!

In our quest to migrating our environment to ESXi 5.5 and View 5.3 we had to do some maintenance on some of our file servers, the quick and dirty was to build massive TB VMDK’s for robocopy jobs as we migrated to newer File Servers. Part of this process included kicking off Veeam backups of these temporary File Servers. During the course of the reverse incremental job (multiple rounds of robocopy!) we encountered some PSOD’s (purple screen of death) on the temporary cluster where the file servers were located.

Seeing as this was my first experience with a PSOD, yes I know I’m so lucky! I proceeded to establish a SR, support request, with VMware and Veeam. Then I began retracing my steps, trying to understand if it was a misconfiguration, something I didn’t enter correctly, etc. The cause of the host failures surprised me, considering it is such a big selling feature of ESXi5.5. What we discovered is that when you have a VMDK that is larger than 1.9TB, snapshots are in the SESparse format. There can be a memory heap issue related to ESXi5.5 that can cause host failure. So when we kicked off our Veeam Backups, Veeam uses the VMware Snapshot model for backups, the snapshot files were in the SESparse format and after 45 minutes the hosts failed.

The resolution from VMware was to reduce all VMDK’s to below the 1.9TB threshold and to wait patiently for the release of a upcoming patch to ESXi5.5 coming in July. Which brings me to my final thoughts, I have flashbacks of Jerry Springer: Just because you can, doesn’t mean you should.

Just Because You Can, Doesn’t Mean You Should

One of the big selling points for me about ESXi5.5 was the support for larger than 2TB VMDK’s think of the possibilities that this could bring an organization: Large File Servers, Exchange Datastores, SQL Databases, etc. But why? Why would you want to subject yourself, your company to the risk of placing all of your important files on one big drive, why not spread that risk out across multiple stores, servers, etc. It flies in the face of KISS, keep it simple stupid, that my friend Brad Christian constantly reminds me of!

So going forward, tread cautiously with each new feature a software comes out with, it may be great on paper, but does it really fit for your organization, your initiatives, your systems?

I’ve talked with several colleagues in the virtualization arena and one of the things they all say is “VDI is tough, it’s always changing, there is nothing harder than virtualizing desktops!” I have learned this lesson the hard way. Two years ago our company deployed VMware’s VDI solution View (now Horizon View) as a proof on concept (POC) to a group of test users, these users ranged from task workers to advanced users running CPU and Graphics intensive applications. That test group was roughly 10 people, 6 months later we deployed VDI in waves to various departments and grew to over 50 users.

Now before I go any further I want to give you a background of the equipment we used to deploy the POC:

Dell Poweredge R620 – Intel Xeon E5-2690 2.9 Ghz, 128GB RAM, (6) 1GB NIC’s
HP ProCurve 5412zl L2/L3 Switch
Dual Dell PowerConnect 24 Port Gigabit Managed Swtiches (SAN Network)
Dell Equalogic PS 6100 (48TB Raw) – Total IOPS – 1300

The POC had been deployed before I joined the company and at the time the VDI experience was very good. But as we continued into production, we started seeing performance hits at random times. I started in April of 2012 and was working in another area of IT but was quickly attracted to the allure of VDI and everything VMware. So in my spare time I started doing research into VDI performance issues, I learned about PCoIP offloading, CPU and RAM issues, sizing Gold Images properly, etc. I threw everything out that I knew and started over with new Gold Images, same performance issues. This all happened over 15 months.

The problem was right in front of us…

Then it occurred to me (read: Google, forums, talking w/ vExperts) that storage was our issue. I started reading everything about Total and Peak IOPS and how it relates to VDI, I started scoring our various Gold Images and discovered that some of our images had Peak IOPS of over 150! Do the math…..the Equalogic that we were running had a peak of 1300 IOPS, at this point we had over 180 users, so do that easy math: 180 users x 25 IOPS (average) = 4500 IOPS!!!!! Houston, we have a problem.

The Solution…sorta

So what did we do? It’s simple but not easy! We realized that as we grew our VDI environment that we improved everything except storage. We upgraded to bigger, more powerful hosts, improved our Core Switch architecture and expanded to larger SAN switches, upgraded our Power and Environmental systems. We did every upgrade except storage. This is not a slight towards our team or myself, we just didn’t have the knowledge and experience to truly understand what we were dealing with in VDI. Getting back to the solution (that is the title of this article right?) we started meeting with and sizing solutions around various vendors and in the meantime I got the idea to buy a Synology NAS load it up with some SSD’s and give us a fairly inexpensive band aid until we can properly implement a permanent storage solution.

In the left corner….Synology DS3612xs

So let’s talk about the Synology DS3612xs because this thing is a beast! I chose this model specifically because of the 12 bay capacity and its ease of transition into our test lab environment (I’m begging my boss to buy it for my Home Lab!) The specs for this thing are really impressive:

12 Drive Bays (Expandable to 36 with Add On Chassis)
Intel Core i3 CPU
8GB RAM
4 1GB NICs
Available PCIe bay (did someone say 10GB?)
vSphere 5 support with VAAI
SSD TRIM Support
Synology Awesomesauce DSM operating system

In the right corner….Intel 520 Series SSD and 10GB Fiber

I went with Intel 520 Series 480GB Solid State Discs because of the reliability, cost and Total IOPS count (42,000 Read/50,000 Write). Because of the Peak IOPS burst, I have heard horror stories about running SSD’s over 1GB so I wanted to have a nice big pipe to our SAN network, I went with a Intel SFP card that supports 10GB fiber. This fit perfectly into our SAN switches and was excited to get everything put together!

Did it fix the IOPS issue?

Yes it has! But that was its intention all along. We took the time, did the research and assembled a reasonable budget and solution that could solve an immediate crisis for our end users. Is it a permanent solution? Absolutely not! But we have seen an immediate performance improvement across the board, from recomposes, pool creation, to end user UI improvements, it has been really nice to finally know but to understand the problem.

The next steps?

Now that we have our band aid we can focus on our permanent storage solution. I am really excited to start working with various vendors and stand up some POC’s to see how the various solutions work with our systems and processes. Until then I get a lot of joy watching the performance metrics every morning during login storms go smoothly. Clone VM’s in seconds as opposed to 90 minutes! I will update this article as I can with some specific performance charts. But for now I am getting ready for our next set of problems after storage….virtualizing graphics. But isn’t that why we are doing this, to learn, understand, solve problems and make things better? I know I am!

vNelsonTX

When in doubt….virtualize it!

Category / Storage

Just Because You Can, Doesn’t Mean You Should

ESXi5.5 Purple Screen of Death = Oh My!

Just Because You Can, Doesn’t Mean You Should

SSD’s saved our View Pod

The problem was right in front of us…

The Solution…sorta

In the left corner….Synology DS3612xs

In the right corner….Intel 520 Series SSD and 10GB Fiber

Did it fix the IOPS issue?

The next steps?