Noisy Neighbor Where Art Thou? - Performance Culprit and Victim Analysis Using Cloudphysics Storage Analytics

By Krishna Raj Raja @esxtopGuru

In my last blog post I talked about the free trial of our storage analytics offering. Here, I want to focus on one of the core features of this product offering -- Datastore Contention card.

We all know that virtualization helps to drive up utilization of physical resources, but it also makes resource contention almost inevitable. VMware used the term "noisy neighbor" to denote these resource contention issues. One of the commonly cited barriers for virtualization is the lack of visibility into noisy neighbor issues. Over the years, VMware administrators have trained themselves to spot CPU and memory contention issues. However, detecting datastore contention has always been really hard.

Why is datastore contention difficult to detect?

You can readily detect CPU contention by monitoring CPU ready time and memory contention by monitoring memory ballooning or swapping metrics on a single host. But what would you monitor to track datastore contention? Unlike CPU or memory resources, datastore contention does not happen within a host. A datastore is a shared resource accessed by multiple physical hosts at the same time. To determine contention hotspots one needs first to look at the I/O metrics from every host and the VMs that are connected to the datastore. Then one must aggregate and correlate them into a unified view.  To achieve this today, one has to be both a PowerCLI guru and an Excel savant. All the while you are likely to put a lot stress on your vCenter to pull all that performance data.

Even if one managed to pull all the data, it is still a very tedious job to do visual correlation and analyze all the metrics. The storage performance metrics such as IOPS, Outstanding IOs, Latency and Throughput have very intricate relationships. Only a trained eye can spot issues from these metrics. In the end, VMware administrators are faced with analysis paralysis. When the analysis is overwhelming one might be tempted to simply throw more hardware at the problem in the form of disk spindles, memory or SSD cache. Or one may even simply reduce the consolidation undermining the core value of virtualization.

This is specifically why we built our datastore contention analysis card.  With it you can:

  • Quickly find which datastores are experiencing contention 

  • Find out the overall health and performance of all the datastores

  • Determine when contention is occurring

  • Identify culprit VM(s) and the victims

Which datastores are experiencing contention?

When you have hundreds of datastores how do you find out which ones are experiencing contention and which ones are not? In our Datastore Contention card we simplified this by automatically classifying the datastores into those that require your attention and those that are noteworthy.

Need Attention, Noteworthy Tabs

Need Attention, Noteworthy Tabs

If you want ignore some parts of your infrastructure such as test and development environment, we got you covered. Using the filter menu you can select your vCenter, Datacenter, Compute cluster and storage cluster and even search for a single datastore name. If you have the habit of using specific naming conventions for your datastore names you can also leverage our support for regular expressions in the search filter.

Filter Options to Select Inventory

Filter Options to Select Inventory

 

Overall health and performance of the datastores:

We provide useful summary aggregated metrics such as overall throughput, average latency and peak latency for all the datastores. These metrics auto update as you change your datastore selection. Often many administrators wonder what is normal and what is abnormal values for some of these metrics. Some operations management tools claim that they can do anomaly detection but these tools are limited by the dataset that they have access to. If you already have a bad baseline then the anomaly detection algorithm is not that useful to spot the existing problems. One of the other powerful aspect of CloudPhysics is the access to performance metrics from a wide variety of infrastructures. Using this global dataset we can spot performance outliers much more easily even if you don’t have a good baseline in your infrastructure. For instance, if the metric values in your infrastructure not only exceed your baseline but also exceed many other similar infrastructures, we can indicate that you have a problem with much greater certainty.

Screenshot Showing aggregated performance metrics highlighting outliers compared to the Cloudphysics Global Dataset

Screenshot Showing aggregated performance metrics highlighting outliers compared to the Cloudphysics Global Dataset

 

When does the contention occur? Our analytics continuously monitor the storage performance metrics and identify hotspots within the last 24 hours. Our hotspot detection is not based on a simple thresholding approach. Instead, we run complex analytics in our cloud backend to identify issues. Once we identify a one or more performance hotspots we highlight the likely problematic time periods in the performance chart.

Datastore Performance Chart - Highlighting Contention Hotspot

Datastore Performance Chart - Highlighting Contention Hotspot

 

Which VMs are the Culprit and which are the Victims?

The most important feature of the datastore contention card is its ability to automatically find culprit and victim VMs. In the past, this sort of analysis would require you to have a storage performance guru spending hours scouring through all the performance data. I myself have done this analysis in the VMware performance group and I know how painful and tedious it is. This is why I’m really proud of the way we have simplified this analysis and automated down to a few simple clicks. Now all you have to do is select the datastore and click on the hotspot to identify culprits and victims.

            Culprit and Victim List

            Culprit and Victim List

Two weeks ago we launched our storage analytics product and we have been receiving tremendous feedback so far. I’m really excited about the direction that we are heading. I would love to hear more of your feedback and suggestions -- please provide them in the comments section below or send me a tweet. And just a reminder: you can get access to this free product trial by registering on the signup page here.

Who's minding your storage zoo? Try CloudPhysics new storage analytics.

By Krishna Raj Raja @esxtopGuru

What I love about being a product manager in a startup is you get to listen to customers every day and leverage your learnings to directly influence product and company direction. Our new Storage Analytics, now available as a free trial, is a great example of how this works.

My experience at CloudPhysics over the past few years has amplified a recurring theme I saw while working with customers in my previous 10 years at VMware:  virtualization has forever changed the enterprise storage scenario. Gone are the days when storage meant local spinning disks; shared storage and flash are now the new normal in datacenters. Similarly, the number of enterprise storage vendors have grown from a handful to several dozen - and the number of different storage arrays you can buy today is mind boggling.

Storage has truly become a zoo and the admins managing these storage arrays are the new zookeepers :-). It’s messy - and beyond messy, it ends up being downright wasteful. As counterintuitive as it sounds, with great server consolidation comes great waste,  especially storage waste.

This is why I’m really excited about our new Storage Analytics offering, which focuses on two key - and otherwise tedious - management problems that haunt every virtualized datacenter:

  • Storage capacity

  • Storage performance

Storage capacity

There are many fast-and-easy paths to storage waste in a virtualized datacenter. But the path to storage reclamation is typically slow and complicated. Take VM sprawl for example: it takes just seconds to spin up a new VM, but figuring out if and when it needs to be deleted takes hours, if not weeks. You can more easily reclaim CPU and memory resources by powering off the VMs, but powered off VMs still take up disk space. Over time, you may forget about them. That’s just one example of space waste - there are many more.

CloudPhysics is specifically addressing storage-induced capacity problems by providing unique, powerful insights into where and how your storage space is being consumed along with specific recommendations on how to reclaim the space. We’ve been working hard to develop the algorithms that solve this problem, while making all that complexity transparent to users. The screenshot below, which shows how we address the problem of unused VMs, is just a sample of what you’ll find in our storage analytics.

Storage performance

There’s a strong relationship between storage waste and storage performance. Why? Many virtualization users simply overprovision the number of spindles, read/write cache, flash storage etc. to avoid the pain of troubleshooting storage performance issues. After all who wants to spend hours and hours of time combing through performance charts to understand correlations and do root cause analysis? Nobody does - but we have figured out how to leverage big data analytics to do it for you. For example, our fantastic new datastore contention analytics (see below) tell you when and where you are experiencing contention, and automatically identifies which VMs in your datastore are performance culprits (and which are victims). You can now solve performance issues literally in few clicks - which is a lot quicker and more efficient than overprovisioning your storage.

Obviously there’s a lot more to discover in our new Storage Analytics, so be sure to check for more blogs from me in the coming weeks. More importantly, check it out for yourself. Just click here to sign up for your 30-day free trial.

Additional information:

Our Storage Analytics run on VMware 4.1 and higher.
Information about Premium and Community Editions is here.

CloudPhysics at Virtualization Field Day on March 5th!

UPDATE March 6, 2014:

Videos from yesterday's Tech Field Day are now available (see below), so if you missed the event or want to listen again, here you go! Questions? Feel free to contact  @cloudphysics, @virtualirfan or @esxtopGuru on Twitter.

---

Tomorrow, the Tech Field Day crew will be at CloudPhysics HQ here in Mountain View for Virtualization Field Day 3. We’re really looking forward to having all the delegates—Alastair, Andrea, David, Eric, Eric, James, Jeff, Marco, Paul, Rick and Scott—as well as the TFD team—Claire, Stephen and Tom—here with us!!

We’ll be talk about how CloudPhysics uses big data analytics to take all of the data coming out of virtualized systems and turn it into actionable intelligence on how the infrastructure is behaving, where the problems are, how to fix them, where you can save space, where you can improve performance, and more. But we won’t just be talking. :) We’ll be demo-ing what CloudPhysics can do, live and livestreamed, for everyone.

Join the CloudPhysics team for our VFD3 presentation! 

You can follow along on Twitter with hashtag #VFD3, follow us @CloudPhysics, the Tech Field Day team @TechFieldDay, and watch the livestream here or on the Virtualization Field Day site tomorrow, March 5th from 9:30 - 11:30AM PST (GMT-8):

---

Who is CloudPhysics? CEO John Blumenthal provides an intro to Silicon Valley startup CloudPhysics, who brings a fresh, SaaS-based approach to virtualization management, combining Big Data science, patent-pending datacenter simulation and modeling, and resource management techniques to deliver actionable analytics for better managing the virtual datacenter. 

CTO Irfan Ahmad (@virtualirfan) explains CloudPhysics’ unqiue approach to virtualization management. Leveraging a daily stream of 120+ billion samples of configuration, performance, failure and event data from its global user base, and utilizing patent-pending datacenter simulation and unique resource management techniques, CloudPhysics empowers enterprise IT to drive Google-like operations excellence using actionable analytics from a large, relevant, continually refreshed data set.

@virtualirfan takes a look under the hood of the company’s predictive analytics technology, starting with the handling of the firehose of data collected from datacenters round the world, followed by an examination of the overall architecture, simulation techniques, discovering causality, predicting potential outages, and supported use cases.

Krishna Raj Raja (@esxtopGuru) gives a live demo of how to use CloudPhysics to pinpoint hidden operational hazards in VMware vSphere environments.

Krishna gives a live demo of how to use CloudPhysics to troubleshoot storage performance and configuration issues in your virtual datacenter, including a demo of custom analytics capabilities (with CloudPhysics Card Builder).

Many IT teams are considering SSDs to improve datacenter performance. Krishna and Irfan explain how CloudPhysics Cache Benefits Analysis helps IT teams understand if – and exactly where - SSD will deliver performance advantages in their particular datacenter.

You can have your performance data and graph it too

One of the things we’re known for at CloudPhysics is the mashup, taking data from different sources and mashing it together. Instead of requiring users to hunt through multiple operational data sources, and then piece together relevant data by hand into spreadsheets or other tools to generate tables and charts, we pre-package the data together with analytics to automatically generate the charts, tables and - most importantly - answers users need.

As David Davis notes in his recent article on vSphere management tools--this is precisely when it's worth paying for a management tool.

Something new to mash up and visualize: performance data

We’re proud to add vSphere performance data counters to CloudPhysics mashup mix. With this new capability, you can now access performance data counters as properties in Card Builder (our reports and analytics tool). You can now mash up both performance and configuration data from different objects, which we then automatically chart and graph, creating a visualization that makes it easy for you to see what’s going on. You can specify a time window and toggle each property that’s displayed, enabling further exploration. All without impacting the performance of vCenter or any other part of the infrastructure itself.*

To be concrete, this new capability in CloudPhysics means you can do things like:

  • compare host CPU usage with the CPU ready time for all VMs on the host

  • compare VM network performance to NIC utilization and network configuration.

  • compare the storage read and write latencies for a host and VMs (see graph below)

Here you can see read latency consistently spiking between 700 and 1000 ms, while write latency is relatively constant at under 100 ms. This would suggest that you’ve got a candidate for caching!

Here you can see read latency consistently spiking between 700 and 1000 ms, while write latency is relatively constant at under 100 ms. This would suggest that you’ve got a candidate for caching!

 

More than 100 performance data properties

And that’s just scratching the surface. We’ve exposed almost 100 performance data counters as properties so far, and will bring more online as customers have use for them. Every property you pick gets automatically plotted and overlayed as shown in the screenshot above, letting you choose which properties to display together and over what time period. You can even zoom in on a particular slice of time.

Here’s a sample list of properties available at the host and VM levels:

  • Storage path read average

  • Storage path write average

  • Storage path total read latency

  • Storage path total write latency

  • Memory usage average

  • Memory swap usage average

  • CPU usage average

  • …and over 90 more

 

Making it real time

Where it can really get interesting is when you’re trying to diagnose or map the real performance for specific hosts or VMs over time. CloudPhysics performance data is as near real time as you can get, down to a 20 second granularity (VMware calls that “real time”). And unlike other vSphere analytics or monitoring solutions, including those from VMware itself, we don’t roll the data up after one hour. We maintain that real time data for you and let you dig into it with anywhere from a few minutes to one month of performance data—without losing resolution.

As a little icing on the performance data cake, you can export both the data and the visualization as pdf, csv, or image files to put into slides or send around to your team and show them how we’re taking the guesswork out of virtualization management. :)

 

Getting started with performance data

To help you get started, we’ve got a video tutorial for you. If you’re already a CloudPhysics user and would like to try it out, let us know and we’ll get in touch. And if you’re not, now’s a good time to get started and just let us know you'd like to try out performance data!

Here’s a tutorial to get you up and running using performance data by walking through a simple use case.


*That’s the beauty of the CloudPhysics SaaS platform and Observer combination: collect the data as it’s generated, no running scripts against every vCenter every hour to get the granular info before it’s rolled up, store the data, merge the data, crunch the numbers, figure out how to chart and graph the data, and then finally find out what’s going on. Doing this manually via scripts would cause vCenter to hiccup, at the least and prevent execution of things like provisioning, in the worst case scenario.


Do Hybrid Clouds Make Cents? Free Cost Calculator for AWS

It is no secret that organizations are increasingly looking into hybrid cloud solutions to get the best of both private and public clouds. And judging by recent announcements, service providers are paying attention.  Last week Amazon augmented their existing AWS offering with a new desktop-as-a-service, while VMware this year launched its vCloud Hybrid Service (vCHS) as well announced the acquisition of Desktone, a leader in desktop-as-a-service.

Making the decision to migrate your VMs from private cloud to a public cloud involves many factors such as cost, performance, utilization, and sizing. But even just understanding whether or not it’s economical to run a particular set of VMs in a public cloud can be difficult. To get a cost estimate you have to manually import details of all the VMs and its resource consumption and then map the details onto the pricing model of the cloud service provider(s) you’re looking at. Depending on how how many VMs you’re looking at and how much you like Microsoft Excel you are looking at days to weeks worth of work.

Introducing CloudPhysics Cost Calculators for Hybrid Clouds

10-cost-calculator-for-aws.png
11-cost-calculator-for-vchs.png

CloudPhysics now offers a way for administrators and managers to get pricing estimate for hybrid clouds in just seconds.  We do the data collection, private-to-public cloud VM mapping, and cost calculation for you. You can even experiment with parameters that impact pricing for the target public cloud to figure out the best fit. This information is available for any part of the inventory: an entire vCenter, a single cluster,  a single datastore, even down to a single virtual machine.

We have developed two new cards [What is a Card?] -- Cost Calculator for AWS and Cost Calculator for vCHS. Both are absolutely free for all CloudPhysics users as part of our Community edition.

(If you’re not already using CloudPhysics, it takes less than 5 minutes to sign up and get started with our free trial and use these cards!)

 

In this post, I’m going to dig into into the Cost Calculator for AWS and I’ll talk about vCHS and doing comparisons across clouds in my next two posts.

Cost Calculator for AWS

Matching your existing VMs to one of the available EC2 instance types is the first step toward understanding what it might cost to run that VM in EC2.

Amazon provides EC2 reserved instances and spot instances. EC2 reserved instances provide fixed and discounted hourly pricing based on reserving those instances for a fee ahead of time. However, pricing of Amazon EC2 reserved instances depends on a number of factors such as instance type (which come with pre-defined sizes and performance capabilities), resource requirements, expected utilization, location, reservation length, type of storage, type of the guest operating system and more.

You need to understand:

  1. Which EC2 instance types are the right matches for my VMs?

  2. What are the resource tradeoffs when there isn’t an exact match?

  3. Where should the instances be located?

  4. What if no match is found?

  5. What happens if my mix of VMs changes?

  6. How does the pricing change if I change location, storage type, or any of the other options?

1. Which EC2 instance types are the right matches for my VM?

Since the configuration of EC2 instance types is fixed, you need to map your VMs to one of the available instance type.  The Cost Calculator for AWS Card lets you choose the criteria to do the initial mapping and you have 4 options.

instance-match-selection.png

Match by lowest price: This option finds the lowest priced instance from all regions that matches either by vCPU count or vRAM size.

Match by vCPU Count: This option matches by vCPU count and then finds the lowest priced instance from the available choices.

Match by vRAM Size: This option matches by vRAM size and then finds the lowest priced instance from the available choices.

Match vCPU and vRAM Size: This option matches by both vCPU count and vRAM and then finds the lowest priced instance from the available choices.

2. What are the resource tradeoffs when there isn’t an exact match?

Depending on the match criteria that you have chosen, the matched instance may have a resource tradeoff, either vCPU count or vRAM. If there is such a resource tradeoff, you’ll see it as shown below.

vram-tradeoff.png

3. Where should the instances be located?

Depending on the match criteria and pricing, the Card may match instances in any AWS region. All the matches are overlaid on a map so you can easily find out their geographical distribution. The Card also gives you the option to override the region. This is useful if you want your matches to be found from only one specific region either for performance or compliance reasons.

instance-location.png

4. What if no match is found?

For some specific VM configurations, or guest OS types, there may not be a matching EC2 instance type. In this scenarios, the Card will show which VMs could not be matched and why.

5. What happens if your VMs change?

pricing-summary.png

The Card will show cost for each individual VM’s matched instance, a summary for all the VMs that you selected to match against, and a breakout of storage costs. This is automatically updated whenever you modify pricing parameters and whenever the VMs configurations change.

You can also see how many of each kind of EC2 instance type your VMs were matched against.

instance-type-count.png

6. How does the pricing change if I change location, storage type, or any of the other options?

The three main parameters that control instance pricing beyond instance type and location are: term lengths, storage type, and expected utilization.

term-length.png

Term Length: there are two options 1 or 3 years. Amazon offers bigger pricing discount for 3 year term and you can view the discount in the pricing when you select 3 year option.

storage-type.png

Storage Type: If the instance will be used primarily for compute and there is no need for persistent storage, then Instance Storage is the cheapest and it is included by default. But for most cases you’ll probably want persistent storage with Amazon EBS (you can find out more about EBS here). The Card will factor your selection into the cost.

vm-uptime.png

Expected Utilization: If you don't need your virtual machine up and running 24/7, you could save lot of money by powering off  the virtual machine when not used. Amazon provides three instance utilization categories light, medium and heavy which you can think of in terms of total desired uptime for the instance. The Card will let you change the VM uptime and automatically select the right utilization category that will meet that uptime at the lowest price.

 

 

Conclusion

The convenience and benefit of using a public cloud service such as Amazon Web Service is undeniable. However it is non-trivial to figure out how your systems map to Amazon EC2 instance types and how much it would cost to run your VMs in AWS. With CloudPhysics Cost Calculator for AWS you can get a quick estimate for any part of your existing virtualized infrastructure and easily play around with different pricing parameters and options. We hope this makes your decision a bit easier.

That's the CloudPhysics way: taking the guess work out of virtualization management.

P.S.  If you are an existing CloudPhysics user you could login today and start using these cards immediately. If you are new, here is the link again to create an account with CloudPhysics and you can get started in 5 minutes! 

Krishna Raj Raja
@esxtopGuru