VM whatcha m’hoozit

I had an interesting conversation with a colleague of mine about, “what is vmware?”, “how does a VM access hardware features, like disk encryption (OPAL, TCG)?”, “what happens if two VMs want to use the same hardware, is that SRIOV or MRIOV allowing that?”, “if I add feature Z to the hardware, how much better will the VM be able to do Y?”, and so on…

The list of questions was quite long, and until I understood the perspective and experience base of my colleague, nothing I was using made any sense to him.  After muddling through it I realized that for some folks, hardware folks in particular, the things we virtualization lovers live and breath everyday are completely foreign.  It is not simple enough to say things like “hardware is abstracted”, or “its software defined”.  A hardware engineer understands that there are special page tables, registers and instructions in the CPU for memory management, DMA between devices, interrupts to signal events, and caches to optimize performance.  They understand there are AHCI controllers for SATA and RAID controllers for SAS and that NVMe will change the future.

What a hardware engineer struggles with is how hardware can be abstracted to the point that a guest operating system running in a VM can’t see the hardware, yet still depend on the hardware features to do its normal job as it would if it was on physical hardware.  VMs have to see a chipset, a memory controller, a graphics card, a storage device, right?  More importantly, some don’t understand why hardware should be abstracted at all.  For them, hardware a beautiful creation with millions and billions of transistors that can push electrical pulses at blazing speeds from point A to point B.  The moment I mention that a the VM can’t see any specialized hardware feature unless the hypervisor exposes it, the wheels start turning and even more questions pop up.  Questions like, “what about SRIOV and how it allows VMs to share the hardware?”, and “what about the isolation that the CPU, MMU and IOMMU provide to VMs?”  What is missing is how the hardware is used by the hypervisor and how the hardware is ultimately presented to the virtual machine.

There is no easy or quick answer to these questions but, I do have advice for anyone reading this.  If you are a software engineer trying to explain to a hardware engineer what virtualization is, don’t focus on the buzzwords like software defined, or paravirtual device.  Instead, focus on the why things like hardware abstraction exist, and why software defined anything is simpler to manage for the 100’s or 1000’s of servers and applications that are running on your  virtual infrastructure.  Also,  make sure to explain why virtualization is around in the first place; to utilize hardware resources to their fullest and getting the most for your investment without sacrificing performance or manageability.

If you are a hardware engineer trying to make sense of what your hardware can do to improve, benefit or add value, focus on the hypervisor rather than the VM.  The hypervisor is the OS, and it has drivers just like any other OS.  The hypervisor uses the hardware, and presents the VM with a view to the hardware and neither the VM nor its applications are interested in doing something proprietary to take advantage of your new bits and bytes.  Rather the VM wants to take advantage of the hardware features (performance, scalability, reliability) without knowing what the hardware is, or how it does its magic.  It should just work so that regardless of the host the VM is running on, it continues to run, and continues to be a reliable service to the enterprise.

Why S.M.A.R.T. is so d.u.m.b?

This article is intended to provide insight into problems encountered by management applications that use S.M.A.R.T. reporting capabilities of SATA SSD’s, and other device types that provide a translation such as the Micron P420m/P320h devices.  SSD’s and HDD’s have supported S.M.A.R.T. reporting for many years, and are still a critical part of any storage solution. SSD’s introduce new requirements to an old standard; specifically lifetime and how to know when your drive should be replaced.

I’m going do cover the problem with SMART reporting from three different perspectives 1) the “standard”, 2) a device manufacturer, and 3) system software (driver) perspective.

S.M.A.R.T. stands for Self-monitoring, Analysis, and Reporting Technology.  The last version of this specification can be found on T13.org web site at: http://www.t13.org/documents/UploadedDocuments/docs2007/D1699r4a-ATA8-ACS.pdf

Lots of intricate details and other opinions around SMART reporting can be found online.  I found wikipedia’s description to be very informative: wikipedia at http://en.wikipedia.org/wiki/S.M.A.R.T.

Topics covered in this article:

  • SMART Attributes – what are they?
  • PCIe SSD’s and SMART
  • Software Interfaces for SMART Reporting
  • How do we fix this?

SMART Attributes – what are they?

In its most simple form SMART reporting is a mechanism to request from an ATA storage device, such as a SATA SSD, a 512-byte data structure that contains a collection of SMART attributes.  There are a finite number of attributes available in 512-bytes, and a SMART attribute is 3 field structure representing the following:

  1. Attribute Id:  which attribute (from the ATA specification) is being reported in the Value and Threshold fields.
  2. Value: a value from 0 to 255 representing a range of possible values reported by the device.
  3. Threshold: an OPTIONAL value from 0 to 255 representing an event threshold, where if exceeded, the device will change its operating specification according to the rules of the ATA specification for SMART reporting.

The existence of an optional Threshold field represents a mechanism to change the threshold and a mechanism to signal a driver when SMART event occurs, but the specification describes the Threshold as an optional feature for device manufacturers to implement.

You may find this hard to believe but SMART reporting is NOT a standard.  It is a specification for doing something but it doesn’t specify the what or the how in a way that ensures device manufacturers and OEM’s report these SMART attributes in a consistent form with consistent behaviors.  To make matters worse OEMs have had to define their own standards for device vendors to conform to their management specifications. Some OEMs have adopted open standards on how to collate, present and maintain management data; of which SMART Attributes are a tiny fraction of the total management data set.

Its important to note that SMART was introduced for spinning media in the 1990’s and adopted by SSD manufacturers in the mid 2000’s to support new problems that arise with Flash memory as the storage technology.  Also, the last known edit to the ATA/ATAPI-8 specification was made in 2007, and if you read the first few lines of the specification it should become even more apparent the challenges facing a device manufacturer:

This is a draft proposed American National Standard of Accredited Standards Committee INCITS. As such this is not a completed standard. The T10 Technical Committee may modify this document as a result of comments received during public review and its approval as a standard. Use of the information contained here in is at your own risk.

You may be asking yourself, “what drives a device manufacturer’s requirements for SMART reporting?”  The answer is simple, OEM storage specifications and requirements.

The problem is twofold, first, OEM’s have defined their own proprietary definition of which attribute identifiers and their values mean. Oh, and between OEMs the attribute id values can be different. In reality there are only a few attributes we care about that have differing attribute identifiers. For better illustration of what I mean, see the example below. As a result, device vendors must then SKU their products (usually firmware) for specific OEMs for the sole purpose of supporting SMART reporting in different management tools.  The second problem, is that device vendors must also decide what level of the SMART Reporting feature set that they support; should they support all Thresholds, or only some?  Typically decisions as simple as SMART attributes are made very late in product development such that it is too late to map “internal” flash statistics properly, sometimes leading to short cuts for these types of features.

Example of some of the more interesting SSD attribute mappings:

  • PROGRAM FAIL COUNT: these are errors corrected by the device.
    • Specification ID: #171
    • OEM A ID: #181
  • ERASE FAIL COUNT: these indicates that blocks are being or will soon be retired.
    • Specification ID: #172
    • OEM A ID: #182
  • TEMPERATURE: current operating temperature, good to know if your chassis has enough cooling.
    • Specification ID: #194
    • OEM A ID: (same id)
  • PERCENT LIFETIME USED: amount of lifetime calculated by the drive, not guessed by an end user.
    • Specification ID: #202
    • OEM A ID: #204


What is a PCIe SSD?  A PCIe SSD is an SSD in which the HBA and storage device are integrated together onto a single board, exposed on a PCI Express bus; directly connected to the host CPU or main IO bus.  I will clarify what I mean by storage device after I describe what I mean by HBA.  The key point here is that a PCIe SSD is two devices in one, and require a device driver to

What is a Host Bus Adapter (HBA)?  The most familiar HBA example in our everyday computers and laptops are based on the main chipset implementation of AHCI (Advanced Hardware Controller Interface).  AHCI is Intel’s gift to cheap storage for us all.  You must keep in mind that even though we don’t see it or install its hardware in our computers, it is actually a PCIe device that we are plugging our SATA drives into.  In enterprise computing there are a few more players than Intel, of course Intel is one of them and their architectures include SCSI devices as well as SATA devices (Serial-ATA).  Enterprise HBA implementations are all proprietary to each device manufacturer, and how each manufacturer exposes SMART reporting is unique to their device drivers more so than hardware.

What is a storage device?  A SATA SSD is one example of a storage device.  A SATA SSD consists of a SATA controller and a Flash controller operating together to store and retrieve data on Flash media, instead of spinning (HDD) media.  In the case of a PCIe SSD, the storage device consists of a single component instead of two, a Flash controller.  The Flash controller performs the low-level Flash read/write and management operations that are common in all SSD’s.

The first standard for PCIe SSD’s is called NVME (Non-Volatile Memory Express).  Until version 1.0 of the NVME specification, all PCIe SSD’s were proprietary.  The Micron P420m and P320h PCIe SSD’s are a prime example of a custom and proprietary HBA and Flash controller based storage device.  Since neither SATA, nor SAS are part of the design of the PCIe SSD, the implementation of SMART reporting must be provided by and translated by the device driver and/or management software.

Non-standard PCIe SSD’s have the additional problem of translating internal device details into something meaningful to management interfaces that are designed for SATA SMART reporting.  Sometimes, the device implementation doesn’t even remotely have an equivalent mechanism for reporting a SMART attribute.  One example would be total bytes written, which is a common lifetime measurement for SSD endurance.  In the case of the Micron P420m and P320h SSD’s the lifetime calculation of the device is a much more complex answer than simple “total bytes written”.  It turns out that you can, depending on your write workload and frequency of writes, write to some SSD’s beyond their rated “total bytes written”.  If a drive were to report “total bytes written” as it’s lifetime indicator it could be misleading and costly if the management software uses only that statistic to determine when a drive should be replaced.  For that reason the P420m and P320h SMART reporting reports a percentage of lifetime value, once it reaches 100%, the drive will enter a write protect mode to allow users to read data off the drive safely, and will dis-allow writes to prevent corruption of user data due to stress on the already strained flash subsystem.

Software Interfaces for SMART Reporting

Device drivers are responsible for presenting user space management applications with access to device data. Operating systems dictate the mechanism for kernel drivers to expose user space functions. In Windows and Linux it is commonly made available via IOCTL, a single entry point with a command code and memory buffer for command parameters; with no standard for parameters or IOCTL code. Linux has some nice facilities for creating “proc” and “sysfs” nodes that can be queried by command line tools and scripts through standard file read / write operations. VMware has two mechanisms depending on the version of the vsphere hypervisor, IOCTL and “proc” for versions 4.1 through 5.1; a.k.a. VMK Linux kernel. For version 5.5 and beyond, IOCTL and “proc” are removed and a special management API is introduced. The interface is much different from IOCTL and can be queried by device, and has provisions for making sure there is a version match with an interface specification. Out of sync driver and user space versions can be detected at runtime with this mechanism.

VMware is doing good things with the management API with respect to defining a standard around SMART reporting and it has great potential for device vendors to comply with. The biggest challenge will be getting device vendors on board with the appropriate partner licensing to deliver plugins that can remap or adapt their device to the SMART management interface.

How do we fix this?

The solution is NOT likely to be fixed for ATA variants, specifically SATA.  To my knowledge the SATA specification is not evolving beyond SATA-3, and the focus for SSD technology going forward is on the NVME standard.  I’ve not (yet) evaluated the NVME specification for how well it has standardized its specification for SMART reporting (which is documented as the NVMe log).  However, in browsing its definitions they are much more suited to finer grained detail and predictability in understanding the statistics for “behind the scenes” tasks that a drive may be doing in any given SSD.  The question I still have to answer is “what kind of challenges are device manufacturers facing when it comes to interpreting the specification?” How much room is there for error or simple mis-understanding.

As for software interfaces, there are still no standards around NVME for how to gain access to the NVME log.  This means that it is up to operating systems vendors to own the “standardization” of how a management application gains access to request and interpret the NVME log data; oh and still co-exist with existing SAS / SATA equivalents. VMware is getting there with its SMART management interface, and it would be nice to see the other two big OS vendors getting on board with something similar.

IMHO, without a NVME “software” standard, we could end up heading right where we left off with ATA SMART Reporting…  The best answer is for operating systems vendors to get to together and define a software standard around NVME to bridge the gap between device manufacturer “interpretation”, OEM device differentiation, and NVME as a specification rather than a full standard!

%d bloggers like this: