Partly Cloudy with a Chance to Fill Up Disk Space: Storage Analytics for Datastore and Guest Disk Capacity Management

By Krishna Raj Raja (@esxtopGuru)

If you follow tech news, you’ve probably heard the buzz about predictive analytics. Facebook can predict when couples breakup, Amazon can predict and ship items you’re likely to purchase before you click the buy button. Target can predict pregnancy based on shopping patterns. These organizations now use big data not only to understand but also to predict behavior.

Given these advances in the consumer world, you’d expect to see the same predictive power at work in the world of datacenter operations. The ability to accurately forecast behavior, especially for expensive resources like storage, could translate to tremendous gains in efficiency and cost savings. Surprisingly - and unfortunately - even simple storage-related operations such as “disk space used by the virtual machine” are difficult to predict and forecast.

In this blog I’ll highlight today’s reactive approach to storage capacity management, explain why it’s so difficult to predict storage capacity behavior and needs in the virtualized datacenter, and share what CloudPhysics is doing to make it much, much easier.

Today’s storage capacity management: reactive, not predictive

In virtualized infrastructure there are two levels of storage space needs: the space required by the guest operating systems and the space required by the hypervisor to store the virtual machines. Running out of disk space in either of these could lead to costly outages, but most administrators today don’t know when they will reach their capacity limits.

As a band-aid, admins set up monitoring solutions at the guest and hypervisor layers to track disk space usage, using static (and often arbitrary) thresholds such as “% free space” to trigger alerts.  For example, a 1TB disk drive that’s 90% full still has 100G free space and is unlikely to cause an immediate outage, so an alert triggered by this threshold is somewhat meaningless. Also in many cases, disk space usage remains fairly constant, making static threshold-based alerting is useless. A case in point: thick provisioned virtual disks don’t grow and don’t require a lot of disk space headroom so even when “90% full,” there is no disk space capacity risk.

As a result, most alerts are considered noise or false alarms, and routinely ignored or disabled. In the rare case that an alert points to an actual capacity issue, notification is sent after the problem occurs. So instead of predicting behavior, the admin is put in the position of reacting to behaviors and fighting fires.

Overprovisioning: preventive, not predictive (and expensive too)

In response to the ineffectiveness of capacity monitoring and alerting, overprovisioning storage has become a standard acceptable practice among administrators - and is largely a preventive measure. According to CloudPhysics’ global data set, most organizations maintain an ongoing buffer of 35% more storage capacity than they need. But the cost of prevention is very high, not just in real dollars but opportunity cost. Take a look at this chart:

The longer you wait, the lower your storage cost

The longer you wait, the lower your storage cost

Storage cost (price per GB) is on the steady decline for the both spinning and NAND drives and at the same time performance, reliability and feature sets of enterprise storage has been steadily increasing. For instance several years ago storage vendors were not providing dedupe functionality or it was very expensive to get that functionality. Today most storage vendors provide dedupe functionality and use SSD in some form or another. Also the number of storage choices also has tremendously increased over time. So delaying storage purchases not only defers capex expenditure but also gives you the opportunity to buy the latest and coolest technology.

Clearly, the ability to accurately predict storage capacity needs would enable you to delay storage purchase, avoid purchasing more than what’s truly necessary to support your virtual datacenter, and at the same time preempt capacity-induced downtime.

Why predictive is hard - and how CloudPhysics makes it easy

We all know virtualized datacenters are very dynamic. VMs are provisioned and deleted, snapshots are created and deleted, VMs could be storage vmotioned from one datastore to another by storage DRS for balancing performance or storage usage or when you put a datastore into maintenance mode. In addition, virtual machines could be using thick provisioned virtual disks or thin provisioned virtual disks or linked clones. Each of these virtual disk types has different storage usage patterns. In such a dynamic environment, predicting future storage usage is complex, since changes in disk usage tend to be very bursty.  As a result, simple linear interpolation techniques based on models of disk usage growth rates don’t work very well.

With CloudPhysics, all of that dynamism and complexity is captured over time and run through simulation techniques to predict future behavior, enabling admins to anticipate and plan proactively, not reactively - eliminating the need for overprovisioning in advance.

Forecasting space usage

For example, to forecast storage space usage, we take frequent snapshots of space usage and use it to determine space usage patterns and future behavior. Our predictive analytics then can give you a forecast with a probability very similar to weather forecast (see chart), and one that’s always updated based on the latest inputs.

 

Guest Disk Capacity

At a glance, admins can see how many VMs are at risk of running out of space in the guest.

At a glance, admins can see how many VMs are at risk of running out of space in the guest.

Likewise, our predictive analytics algorithm monitors guest space usage information (as reported by VMware tools) and uses it to predict VMs at risk of running out space in the guest. At one quick glance you get the count of VMs that are under risk at different time intervals, without the need to monitor and manage each guest individually.

And you can drill down to an individual VM to see which partitions are likely to run out disk space and even identify unpartitioned space that you can leverage for expansion.

Admins can examine individual VMs and their risk to fill prediction for individual disk partitions.

Admins can examine individual VMs and their risk to fill prediction for individual disk partitions.

Datastore Capacity Management

Today, if you have hundreds of datastores you have to click each of them to find out their space usage. Wouldn’t it be convenient to get the total space usage across your entire datacenter? The datastore space card provides you this view. With one click you can get the total space usage, broken down by the space used by VMs and non-VM files. You can also get an overview of total reclaimable space broken down by its components.

See datastore space usage datacenter wide, and how much disk space can be reclaimed.

See datastore space usage datacenter wide, and how much disk space can be reclaimed.

And for each datastore you can drill down to its individual risk profiles for different fill levels. You can also can get a history of past space usage and the top 10 files currently hogging disk space in that datastore.

Drill down into each datastore for more specific storage analytics.

Drill down into each datastore for more specific storage analytics.

Predictive analytics is powerful and has the potential to radically change datacenter operations management. It is our mission at CloudPhysics to eliminate the tedium and make analytics that are simple and easy for virtual administrators to use. Check out our CloudPhysics Storage Analytics (free trial) to see how they can help you be more efficient, effective and proactive in storage capacity management.

Click here to sign up for your 30-day free trial.

Additional information:

Our Storage Analytics run on VMware 4.1 and higher.

Information about Premium and Community Editions is here.

 

Noisy Neighbor Where Art Thou? - Performance Culprit and Victim Analysis Using Cloudphysics Storage Analytics

By Krishna Raj Raja @esxtopGuru

In my last blog post I talked about the free trial of our storage analytics offering. Here, I want to focus on one of the core features of this product offering -- Datastore Contention card.

We all know that virtualization helps to drive up utilization of physical resources, but it also makes resource contention almost inevitable. VMware used the term "noisy neighbor" to denote these resource contention issues. One of the commonly cited barriers for virtualization is the lack of visibility into noisy neighbor issues. Over the years, VMware administrators have trained themselves to spot CPU and memory contention issues. However, detecting datastore contention has always been really hard.

Why is datastore contention difficult to detect?

You can readily detect CPU contention by monitoring CPU ready time and memory contention by monitoring memory ballooning or swapping metrics on a single host. But what would you monitor to track datastore contention? Unlike CPU or memory resources, datastore contention does not happen within a host. A datastore is a shared resource accessed by multiple physical hosts at the same time. To determine contention hotspots one needs first to look at the I/O metrics from every host and the VMs that are connected to the datastore. Then one must aggregate and correlate them into a unified view.  To achieve this today, one has to be both a PowerCLI guru and an Excel savant. All the while you are likely to put a lot stress on your vCenter to pull all that performance data.

Even if one managed to pull all the data, it is still a very tedious job to do visual correlation and analyze all the metrics. The storage performance metrics such as IOPS, Outstanding IOs, Latency and Throughput have very intricate relationships. Only a trained eye can spot issues from these metrics. In the end, VMware administrators are faced with analysis paralysis. When the analysis is overwhelming one might be tempted to simply throw more hardware at the problem in the form of disk spindles, memory or SSD cache. Or one may even simply reduce the consolidation undermining the core value of virtualization.

This is specifically why we built our datastore contention analysis card.  With it you can:

  • Quickly find which datastores are experiencing contention 

  • Find out the overall health and performance of all the datastores

  • Determine when contention is occurring

  • Identify culprit VM(s) and the victims

Which datastores are experiencing contention?

When you have hundreds of datastores how do you find out which ones are experiencing contention and which ones are not? In our Datastore Contention card we simplified this by automatically classifying the datastores into those that require your attention and those that are noteworthy.

Need Attention, Noteworthy Tabs

Need Attention, Noteworthy Tabs

If you want ignore some parts of your infrastructure such as test and development environment, we got you covered. Using the filter menu you can select your vCenter, Datacenter, Compute cluster and storage cluster and even search for a single datastore name. If you have the habit of using specific naming conventions for your datastore names you can also leverage our support for regular expressions in the search filter.

Filter Options to Select Inventory

Filter Options to Select Inventory

 

Overall health and performance of the datastores:

We provide useful summary aggregated metrics such as overall throughput, average latency and peak latency for all the datastores. These metrics auto update as you change your datastore selection. Often many administrators wonder what is normal and what is abnormal values for some of these metrics. Some operations management tools claim that they can do anomaly detection but these tools are limited by the dataset that they have access to. If you already have a bad baseline then the anomaly detection algorithm is not that useful to spot the existing problems. One of the other powerful aspect of CloudPhysics is the access to performance metrics from a wide variety of infrastructures. Using this global dataset we can spot performance outliers much more easily even if you don’t have a good baseline in your infrastructure. For instance, if the metric values in your infrastructure not only exceed your baseline but also exceed many other similar infrastructures, we can indicate that you have a problem with much greater certainty.

Screenshot Showing aggregated performance metrics highlighting outliers compared to the Cloudphysics Global Dataset

Screenshot Showing aggregated performance metrics highlighting outliers compared to the Cloudphysics Global Dataset

 

When does the contention occur? Our analytics continuously monitor the storage performance metrics and identify hotspots within the last 24 hours. Our hotspot detection is not based on a simple thresholding approach. Instead, we run complex analytics in our cloud backend to identify issues. Once we identify a one or more performance hotspots we highlight the likely problematic time periods in the performance chart.

Datastore Performance Chart - Highlighting Contention Hotspot

Datastore Performance Chart - Highlighting Contention Hotspot

 

Which VMs are the Culprit and which are the Victims?

The most important feature of the datastore contention card is its ability to automatically find culprit and victim VMs. In the past, this sort of analysis would require you to have a storage performance guru spending hours scouring through all the performance data. I myself have done this analysis in the VMware performance group and I know how painful and tedious it is. This is why I’m really proud of the way we have simplified this analysis and automated down to a few simple clicks. Now all you have to do is select the datastore and click on the hotspot to identify culprits and victims.

            Culprit and Victim List

            Culprit and Victim List

Two weeks ago we launched our storage analytics product and we have been receiving tremendous feedback so far. I’m really excited about the direction that we are heading. I would love to hear more of your feedback and suggestions -- please provide them in the comments section below or send me a tweet. And just a reminder: you can get access to this free product trial by registering on the signup page here.

Who's minding your storage zoo? Try CloudPhysics new storage analytics.

By Krishna Raj Raja @esxtopGuru

What I love about being a product manager in a startup is you get to listen to customers every day and leverage your learnings to directly influence product and company direction. Our new Storage Analytics, now available as a free trial, is a great example of how this works.

My experience at CloudPhysics over the past few years has amplified a recurring theme I saw while working with customers in my previous 10 years at VMware:  virtualization has forever changed the enterprise storage scenario. Gone are the days when storage meant local spinning disks; shared storage and flash are now the new normal in datacenters. Similarly, the number of enterprise storage vendors have grown from a handful to several dozen - and the number of different storage arrays you can buy today is mind boggling.

Storage has truly become a zoo and the admins managing these storage arrays are the new zookeepers :-). It’s messy - and beyond messy, it ends up being downright wasteful. As counterintuitive as it sounds, with great server consolidation comes great waste,  especially storage waste.

This is why I’m really excited about our new Storage Analytics offering, which focuses on two key - and otherwise tedious - management problems that haunt every virtualized datacenter:

  • Storage capacity

  • Storage performance

Storage capacity

There are many fast-and-easy paths to storage waste in a virtualized datacenter. But the path to storage reclamation is typically slow and complicated. Take VM sprawl for example: it takes just seconds to spin up a new VM, but figuring out if and when it needs to be deleted takes hours, if not weeks. You can more easily reclaim CPU and memory resources by powering off the VMs, but powered off VMs still take up disk space. Over time, you may forget about them. That’s just one example of space waste - there are many more.

CloudPhysics is specifically addressing storage-induced capacity problems by providing unique, powerful insights into where and how your storage space is being consumed along with specific recommendations on how to reclaim the space. We’ve been working hard to develop the algorithms that solve this problem, while making all that complexity transparent to users. The screenshot below, which shows how we address the problem of unused VMs, is just a sample of what you’ll find in our storage analytics.

Storage performance

There’s a strong relationship between storage waste and storage performance. Why? Many virtualization users simply overprovision the number of spindles, read/write cache, flash storage etc. to avoid the pain of troubleshooting storage performance issues. After all who wants to spend hours and hours of time combing through performance charts to understand correlations and do root cause analysis? Nobody does - but we have figured out how to leverage big data analytics to do it for you. For example, our fantastic new datastore contention analytics (see below) tell you when and where you are experiencing contention, and automatically identifies which VMs in your datastore are performance culprits (and which are victims). You can now solve performance issues literally in few clicks - which is a lot quicker and more efficient than overprovisioning your storage.

Obviously there’s a lot more to discover in our new Storage Analytics, so be sure to check for more blogs from me in the coming weeks. More importantly, check it out for yourself. Just click here to sign up for your 30-day free trial.

Additional information:

Our Storage Analytics run on VMware 4.1 and higher.
Information about Premium and Community Editions is here.

CloudPhysics at Virtualization Field Day on March 5th!

UPDATE March 6, 2014:

Videos from yesterday's Tech Field Day are now available (see below), so if you missed the event or want to listen again, here you go! Questions? Feel free to contact  @cloudphysics, @virtualirfan or @esxtopGuru on Twitter.

---

Tomorrow, the Tech Field Day crew will be at CloudPhysics HQ here in Mountain View for Virtualization Field Day 3. We’re really looking forward to having all the delegates—Alastair, Andrea, David, Eric, Eric, James, Jeff, Marco, Paul, Rick and Scott—as well as the TFD team—Claire, Stephen and Tom—here with us!!

We’ll be talk about how CloudPhysics uses big data analytics to take all of the data coming out of virtualized systems and turn it into actionable intelligence on how the infrastructure is behaving, where the problems are, how to fix them, where you can save space, where you can improve performance, and more. But we won’t just be talking. :) We’ll be demo-ing what CloudPhysics can do, live and livestreamed, for everyone.

Join the CloudPhysics team for our VFD3 presentation! 

You can follow along on Twitter with hashtag #VFD3, follow us @CloudPhysics, the Tech Field Day team @TechFieldDay, and watch the livestream here or on the Virtualization Field Day site tomorrow, March 5th from 9:30 - 11:30AM PST (GMT-8):

---

Who is CloudPhysics? CEO John Blumenthal provides an intro to Silicon Valley startup CloudPhysics, who brings a fresh, SaaS-based approach to virtualization management, combining Big Data science, patent-pending datacenter simulation and modeling, and resource management techniques to deliver actionable analytics for better managing the virtual datacenter. 

CTO Irfan Ahmad (@virtualirfan) explains CloudPhysics’ unqiue approach to virtualization management. Leveraging a daily stream of 120+ billion samples of configuration, performance, failure and event data from its global user base, and utilizing patent-pending datacenter simulation and unique resource management techniques, CloudPhysics empowers enterprise IT to drive Google-like operations excellence using actionable analytics from a large, relevant, continually refreshed data set.

@virtualirfan takes a look under the hood of the company’s predictive analytics technology, starting with the handling of the firehose of data collected from datacenters round the world, followed by an examination of the overall architecture, simulation techniques, discovering causality, predicting potential outages, and supported use cases.

Krishna Raj Raja (@esxtopGuru) gives a live demo of how to use CloudPhysics to pinpoint hidden operational hazards in VMware vSphere environments.

Krishna gives a live demo of how to use CloudPhysics to troubleshoot storage performance and configuration issues in your virtual datacenter, including a demo of custom analytics capabilities (with CloudPhysics Card Builder).

Many IT teams are considering SSDs to improve datacenter performance. Krishna and Irfan explain how CloudPhysics Cache Benefits Analysis helps IT teams understand if – and exactly where - SSD will deliver performance advantages in their particular datacenter.

You can have your performance data and graph it too

One of the things we’re known for at CloudPhysics is the mashup, taking data from different sources and mashing it together. Instead of requiring users to hunt through multiple operational data sources, and then piece together relevant data by hand into spreadsheets or other tools to generate tables and charts, we pre-package the data together with analytics to automatically generate the charts, tables and - most importantly - answers users need.

As David Davis notes in his recent article on vSphere management tools--this is precisely when it's worth paying for a management tool.

Something new to mash up and visualize: performance data

We’re proud to add vSphere performance data counters to CloudPhysics mashup mix. With this new capability, you can now access performance data counters as properties in Card Builder (our reports and analytics tool). You can now mash up both performance and configuration data from different objects, which we then automatically chart and graph, creating a visualization that makes it easy for you to see what’s going on. You can specify a time window and toggle each property that’s displayed, enabling further exploration. All without impacting the performance of vCenter or any other part of the infrastructure itself.*

To be concrete, this new capability in CloudPhysics means you can do things like:

  • compare host CPU usage with the CPU ready time for all VMs on the host

  • compare VM network performance to NIC utilization and network configuration.

  • compare the storage read and write latencies for a host and VMs (see graph below)

Here you can see read latency consistently spiking between 700 and 1000 ms, while write latency is relatively constant at under 100 ms. This would suggest that you’ve got a candidate for caching!

Here you can see read latency consistently spiking between 700 and 1000 ms, while write latency is relatively constant at under 100 ms. This would suggest that you’ve got a candidate for caching!

 

More than 100 performance data properties

And that’s just scratching the surface. We’ve exposed almost 100 performance data counters as properties so far, and will bring more online as customers have use for them. Every property you pick gets automatically plotted and overlayed as shown in the screenshot above, letting you choose which properties to display together and over what time period. You can even zoom in on a particular slice of time.

Here’s a sample list of properties available at the host and VM levels:

  • Storage path read average

  • Storage path write average

  • Storage path total read latency

  • Storage path total write latency

  • Memory usage average

  • Memory swap usage average

  • CPU usage average

  • …and over 90 more

 

Making it real time

Where it can really get interesting is when you’re trying to diagnose or map the real performance for specific hosts or VMs over time. CloudPhysics performance data is as near real time as you can get, down to a 20 second granularity (VMware calls that “real time”). And unlike other vSphere analytics or monitoring solutions, including those from VMware itself, we don’t roll the data up after one hour. We maintain that real time data for you and let you dig into it with anywhere from a few minutes to one month of performance data—without losing resolution.

As a little icing on the performance data cake, you can export both the data and the visualization as pdf, csv, or image files to put into slides or send around to your team and show them how we’re taking the guesswork out of virtualization management. :)

 

Getting started with performance data

To help you get started, we’ve got a video tutorial for you. If you’re already a CloudPhysics user and would like to try it out, let us know and we’ll get in touch. And if you’re not, now’s a good time to get started and just let us know you'd like to try out performance data!

Here’s a tutorial to get you up and running using performance data by walking through a simple use case.


*That’s the beauty of the CloudPhysics SaaS platform and Observer combination: collect the data as it’s generated, no running scripts against every vCenter every hour to get the granular info before it’s rolled up, store the data, merge the data, crunch the numbers, figure out how to chart and graph the data, and then finally find out what’s going on. Doing this manually via scripts would cause vCenter to hiccup, at the least and prevent execution of things like provisioning, in the worst case scenario.