Fifty Shades of Storage, Part 7

— Storage Horizons Blog —

What kind of data access makes up an application, and why is this important when thinking about tiering (for performance) vs. basic HSM?

Application Data Attributes

Steve SicolaOne of the key points is about “reaction time” for tiering. Another way to look at this is “Dynamic Data Placement.” Can the data for an application’s use be in the right tier, at the right time, so as to make I/O consistent for all portions of the application?

All applications have data that have access patterns with the following traits:

  1. Tight Random—frequent R/W across a small range of the data set
  2. Random—seldom, non-localized access
  3. Sequential

In the history of arrays, the above aspects of application data sets have been covered by either up front, hand tuning of LUNs per data type, fixing the location of data into RAM, short stroking many HDDs, and SSDs (old and new). With hybrid tiering arrays, it’s all going to be about dynamic data placement, if it can be done without causing inconsistent performance.

Application data sets are rarely the size of the array’s storage capacity; meaning that for a customer, the desire is naturally to run multiple applications on the array to drive efficiency, however, “the devil is in the details.” Most arrays can’t handle multi-tenancy for applications as it causes the arrays to have inconsistent performance.

If storage for an application was like an ice cream parfait and cost was right, then everything would be simple . . . but as noted above, it is not.

In the world of Virtualization, VDI, and just plain, old small to medium businesses (trying to make the most of their IT storage purchases), the need for true, multi-tenancy storage that can adapt dynamically and provide automatic QoS, for all volumes across multiple applications, is VERY HIGH.

The problem to be solved is that of being able to ingest the entire data stream from multiple applications, simultaneously, and to properly analyze the attributes of the data and characterize them, based upon the usage patterns above. It’s the Ultimate Big Data problem, and X-IO has solved this with its unique Continuous Adaptive Data Placement (CADP) algorithms, layered upon all the unique ISE technology. After solving all the fundamental problems to drive reliability, availability, and capacity utilization, CADP is made simpler and more effective, as a result, for up-tiering, consistent high performance, and multi-tenancy for applications.

As compared to big data, or even remote sensing of weather for accurate weather forecasting, the aspects of application I/O density, the size of the area with a given density, and its location, among many other attributes, play the key roles in determining where to place data for optimum and consistent I/O, for an application or multiple applications, running against a storage device.

So, X-IO effectively does solve the dynamic data placement problem for multiple applications, running simultaneously, against a Hyper ISE hybrid. Basically, data placement for right time/right place is done across the entire capacity of the hybrid array known as Hyper ISE, and multiple applications can be provided with consistent I/O and throughput. This means more VMs, more VDI users, more databases, etc. It’s all about being able to more with less.

X-IO Hyper ISE has proven this with benchmarks such as Temenos and Redknee, running on Microsoft SQL 2012, beating out larger arrays with traditional tiering; and furthermore, with the best of Microsoft TechEd wins, two years in a row, X-IO is seen as the best storage on the planet.

Tagged , ,

Fifty Shades of Storage, Part 6

— Storage Horizons Blog —

Tiering and Hybrid Arrays: What’s the truth about these?

Steve Sicola

Basic differences in performance between DRAM, SSD, HDD, and Nearline HDD (in X-IO terms) mean vastly different opportunities with respect to data tiering for Hierarchical Storage Management/Information Life Cycle Management (HSM or ILM) of data and backups, and also for performance.

Performance tiering is much different than life cycle management, because data is actively tiered up for application; and life cycle management is about getting data down to the lowest storage cost, over time for data at rest—typically backups. However, many vendors are trying the mix during normal application operation, which leads to really non-linear performance for applications, headaches, overprovisioning, and a large amount of human intervention.

In consideration of data tiering that has gone on for years, let’s talk about the norm, and also the new tiers that now exist for storage, today.

Since about 1990, an array basically had two or three tiers. The first tier for data was the DRAM within the array. The second tier was the enterprise drives that were placed in enclosures (JBODs). In the early 2000s, a new tier, using SATA or Nearline drives (there are differences), was added.  Arrays already had tiering between controller-based cache DRAM, as well as, enterprise drives. But now with the addition of SATA or nearline drives, some storage companies decided to place HSM software within the arrays; and with some rudimentary migration, not only “down-tier” but “up-tier,” based on average activity, on chunks of data, within the volumes on the array.

HSM has been around a long, long time. Hierarchical Storage Management or Life Cycle Management of data was pretty much developed for “down-tiering,” over time of containers, of data for applications. It was initially meant for moving old copies of data sets, from applications, to slower, more archival storage. When first developed, HSM was basically from disk to tape. A few years ago, arrays added “tiering” or “fluid data,” and basically aberrated this initial meaning. Why do I say aberrated? It’s because these arrays were actually mixing a portion of the data set volumes between the enterprise and nearline disk data—“live.” This has multiple effects. Instead of just storing copies (clones, snapshots, etc.) made from the enterprise data set volume(s), for an application on the nearline storage, the volumes now mixed portions of the data, on the different tiers, while the application runs.

This has many effects, some with some merit, but most with some dubious and potentially detrimental effects. On the outside, it is a surefire marketing play to say you can save money, by placing live data on two tiers of storage, while the application is running. It sounds great. However, the issues involved in doing this are multi-fold:

  1. The amount of DRAM it takes ($, compute time) to represent the different tiers and manage the bookkeeping, of what and how much is where on the tiers, is complex. This leads to potential for more bugs and potential availability issues. That aside, the additional DRAM (to keep the bookkeeping data and the compute time to act upon tiering at the block level) SLOWS down the overall operation. This means the need for more expensive controllers, more storage, etc.
  2. The reliability differences, between the drive types, mean different RAID levels will most likely be used on the different tiers. RAID 1 or 5 would be used on the enterprise drives, while RAID 6 would be used on the nearline drives. This means not only different performance characteristics, but a good deal of over provisioning with respect to drives, of different types, in different JBOD enclosures, and more enclosures to cover single points of failure. This is very inefficient.
  3. The ability to “up-tier” based on activity for a given volume, now has some non-linear performance effects if attempted dynamically, because of the hysteresis involved in moving the data and accesses during the transition. This is why, in many cases with tiered storage, manual intervention is required; and in many cases to preserver performance, volumes are locked into a tier, defeating the purpose of tiering. Reaction time for up-tiering is essential for a system that wants to actively use multiple tiers.
  4. The complexity of the entire multi-tier environment is now greatly increased, which means software testing becomes very difficult to cover all of the cases involved. This drives down performance, drives up the potential for software bugs that hurt availability or data integrity, AND means more expensive processors that eat up $ and power.
Tagged ,

Fifty Shades of Storage, Part 5

Steve Sicola

Hybrid and SSD Arrays

  • What are they good for and what are they not?
  • What are enterprise hybrids?
  • What are just SANs in a can for SMB doing HSM/ILM?!
  • What about All SSD arrays?
  • Are they worth it?

All HDD Arrays, Hybrids, and all SSD Arrays:  An Introduction

All HDD arrays are still the mainstay of the world today, but that may be changing, in the future, as SSDs come down some in price and the world’s demand, for quick business decisions, increases.

HDD arrays, such as the original ISE-1 and now the ISE-2, provide multiples of performance, over traditional arrays, with the best price/performance and lowest Total Cost of Ownership (TCO) on earth.  This has been proven, time and again, all over the world and in real benchmarks. Our focus is on the fundamentals of storage that have been ignored for 20+ years! The industry has ignored them but, at
X-IO, we strive for performance across all the capacity, reliability that is 100x over all others, half the power, and half (or less) the rack space, and acquisition costs that are competitive with any mainstream enterprise storage solution!

The hype of all-SSD arrays has really been the focus of the media today, but are the benefits real and the TCO really less? I believe that in some cases it is, but the number of use cases is small. Why? There are many reasons.

If SSD were the same price as HDD, then I’d be all over SSD because of its overall speed advantage. However, that’s not the case, so then, why all SSD in many companies? Enterprise SSD is 10-20x the $/GB of enterprise or nearline HDD. These prices will not converge anytime soon. Enterprise HDD will also come down in price which will keep the delta, between the technologies, significant.

The wild thing is that some all-SSD vendors are playing in the space where they should—high-end niches of trading floor applications. The problem is that many start-ups are trying to play in the basic enterprise where multi-application, high-capacity, and consistent I/O performance is required, not to mention where availability and reliability is needed. The numbers just are not there when it comes to cost, let alone TCO, when power, space, etc. are factored in.

When it comes to overcoming the TCO argument, some all-SSD vendors have added more “features” to be able to claim that their $/GB is the same or better than HDD. They do so by adding features, such as deduplication and compression. However, these features come with a price tag. The cost is for very high-powered servers to run the array, which drives up cost, but really drives up power! And the sheer usage of these features drives the application performance down, even with a bunch of SSD. The end result is dubious to the user, because most tier one applications either have dedupe or compression built in, or they just don’t need them, because there is not that much dedupe required.

The other aspect of adding such features to an all-SSD or tier one array is that of complexity. The complexity of the software goes up by magnitudes, driving the reliability of the software down, while also slowing down the system even further. There is no free lunch; but in this world of marketing, when ideas such as these keep on coming, good storage that is always available, gives consistent good performance, and is averse to service, is the key to all IT managers, in the end.

I believe, at this point in time, deduplication and compression are for data at rest, which is for backups and archived data. Performing these operations on tier one applications seems like a waste because it always depends on the application. But when the data is at rest, these features are in their perfect environment. Performance is not an issue here, and cost savings (by driving down capacity, currently used to free up for more data at rest) IS!

In the end, my thoughts on all-SSD arrays, today, go like this:

  • If SSD was the price of enterprise HDD and supply was able to meet demand . . .
  • And then if applications could use all the IOPS . . .
  • Then SSD will replace HDD.

Until then, all-SSD arrays are either a niche or a shill on TCO with dubious value, in most cases.

So going forward, I’m going to leave SSD arrays for now and talk about tiering and hybrid arrays, as to me, they offer the most benefit, for years to come, in both price/performance and overall TCO reduction.

Fifty Shades of Storage, Part 4

Steve SicolaWhat are the specific aspects that an array can and should have, to be efficient and TCO friendly? How does ISE meet and exceed these aspects by making the “whole greater than the sum of the parts”? I will deal with these questions across my next few blogs, but first consider the design of a storage product from the ground up. A storage array is a specialized computer system. It has a clear focus on data storage, but it’s also much more than that. A storage array has a few laws it must live by:

  1. It must protect data from at least a single failure
  2. It must never lose data after a power failure
  3. It must withstand a failure as a result of a power failure (see number 1)
  4. Reads and writes should be expected and be capable of being performed, at a proper duty cycle, depending on the tier of storage (e.g., ISE is a Tier 0/1 device, meaning the duty cycle should be 100%, meaning anything at any time/all the time, and be low latency, high IOPS/throughput).

So, what makes up an array that meets these “laws” in such a way that it’s not just a small server or even a PC with a bunch of Band-Aids on top (or “perfume on a pig”!)?

Array Hardware and Its Effect on TCO

Given that a storage array typically has two controllers, aspects that make or break TCO include:

1.  Are both controllers active at the same time for access to same data volumes? If they are not, then typically an active/passive system or one that is only active to some volumes, and the other to the rest of the volumes, causes availability issues and/or software reliability issues, driving cost. An active/passive system would most likely throw more hefty hardware at each controller, driving up power to make up for performance loss, in the normal case of both controllers operating. Also, in cases where active-active is not within an array, then software called multi-pathing drivers, must be put into play that add complexity, sometimes cost extra money, and drive the overall solution cost up—either way with storage companies seeking to recover development and support costs by hiding costs inside of high warranty costs.

2.  Do both controllers have a communication link that has near zero latency? This makes a difference in case 1, above, when failover is to occur; but most importantly to solve issues with an application’s write workload with the lowest latency and overall cost. Mirroring of write data between controllers is the best method to ensure data integrity in the case of failure, and also for lowest latency across the widest range of host access patterns. True active-active operation with a dual controller array is possible when this communication link is fast enough. Not only does this allow for faster failover, in the event of a controller reboot or failure, but also additive performance to all volumes when both controllers are operational. In addition, servers no longer need special drivers to control multiple paths to the storage.

3.  Related to case 2 is how the dynamic random access memory (DRAM) cache is used for writes and how it is protected. A good write-back cache can smooth out most application I/O  “outliers” from the standpoint of overall access to the dataset for the application. A small amount of DRAM with non-volatility, as well as a very fast inter-controller communication link, allows for I/O latency to be reduced on the first order. Remember, DRAM is 1000x faster than SSD, which in turn is much faster than HDD (for random I/O). Using DRAM in the proper quantities can reduce TCO, but throwing a large amount at it without intelligence just drives up cost and power usage.

4.  Good cache algorithms that can aggregate I/O, pre-fetch, do full raid strip writes, atomic writes, parity caching, etc., are all aspects of a very cost-effective usage of a small amount of DRAM that points to all the back-end storage devices, which in the end must have I/O performed to/from them, in the most efficient ways possible, for each back-end device type.

5.  What kind of back-end device types should be considered? Nearline HDD (SATA or SAS), Enterprise HDD (10K or 15K), SSD in drive or plug-in card form factor? It all depends on the mission of the array. If it is price/performance and TCO, then my mind goes to how to use the 10K HDD, as well as MLC SSD, for some applications for the job. Using nearline HDD has its place in very low performance or sequential I/O environments, mainly in backup and archive use cases; because the extremely low I/O density causes the ability to utilize the capacity behind these typically high-capacity drives (to disallow efficient full capacity utilization). Remember though, low-cost, high-capacity drives have a different duty cycle than enterprise drives. For example, throwing multiple sequential workloads against high cap drives is just like a random workload and will kill these drives prematurely, resulting in more service events, slower performance during long rebuilds, potential data loss, and sub-optimal performance.

6.  Does the array have the ability to utilize all the capacity with I/O to all attached capacity? This is a key metric in effective TCO vs. the old adage of $/GB. If an array can utilize ALL the capacity under load, then efficiency drives down TCO. The ability to utilize all the capacity is the function of the data layout, effective utilization of back-end devices, andt also how the caching and controller cooperation work. All of this can drive TCO way down or way up depending on how well it’s done.

7.  Does the array have a warranty greater than three years? If so, then it’s either because the technology reduces service events OR it’s a sales tactic. If it’s the former, then it truly drives TCO down as more storage is purchased. If it’s not, then its “pay me now or pay me later.” Technology that provides for less service is based on a design for reliability and availability that goes far past just dealing with errors that occur in a system. It’s a system approach, similar to Six Sigma to reduce variation in the system, which reduces the chance of failure. In an array, that means how the devices are packaged, how the removable pieces are grouped together, and how the software can deal with potential faults in the system and keep the application running without loss of QoS. A system that can do this drives TCO down because of the fact that customers don’t have to design for failure, or in other words, design around the shortcomings of the array by over provisioning (as many cloud vendors do). Many cloud providers have designed for failure with mass amounts of over-provisioned storage, n-way mirroring, etc. The industry has been trained around the shortcomings of array design and error recovery, so those that build their own datacenters just go for the cheapest design with the cheapest parts because of this. In contrast, a storage system that really does provide for magnitudes-greater reliability, availability, capacity utilization, and performance across that capacity, can actually change this mindset. However, it takes belief that a design of this nature is possible . . . and it has been done with the ISE from X-IO.

8.  Does the array provide real-time tiering that maintains a consistent I/O stream for multiple applications across the largest amount of capacity possible? An array that can effectively do this with the highest I/O and largest capacity, at lowest cost, wins the TCO battle. Beware of marketing fear, uncertainty, and doubt (FUD) that sound the same, but the architecture and design of the product, as well as results, are what matter.

9.  Does an array add features that, under the right circumstances, reduce capacity footprint via de-dupe or compression? If so, I smell snake oil because in most tier1 applications, compression and de-dupe just drive up cost of the controller while giving dubious results. On paper it might look good for the $/GB, but other aspects like space, power, and utilization go down. And if it’s done with all SSD, in order to artificially say the cost is less, all the worse.

Why am I harping on the way that arrays are designed? It is because all of this drives the TCO up or down based on architecture and methods used to drive up performance, capacity utilization, reliability, and availability . . . or NOT!

Most arrays today are very wasteful when it comes to the:

  • amount of compute power inside the array
  • amount of actual usable capacity
  • overall reliability (or aversion to service events)
  • availability of the array to the application

Also, adding features such as those noted above, as well as many kinds of replication, make the performance of the array inconsistent, causing IT architects to over-provision their gear and “work around the SAN.” SANs got a bad name for bloated, framed architectures with big iron, big license fees for every feature on the planet, poor performance, poor reliability, poor capacity utilization, etc., etc., etc . . . A SAN was originally meant to just put storage on a private network that servers could share. Oh, how things get polluted over time when greed takes over by a vendor.

As noted before, putting the right amount of compute, against the right amount of storage, will drive costs down in power, space, and application efficiency.

Most arrays also have the mindset of “when in doubt, throw it out” when it comes to replaceable components within the system, also known as Field Replaceable Units (FRUs). This leads to more service events, higher warranty costs, as well as potential and real performance loss at the application, and even down time.

What Makes ISE Tick?

X-IO is now in its second generation of ISE, a balanced storage system that breaks all the molds of the traditional storage system.  Unique aspects of ISE and its second generation are:

1.  All the things ISE solves, including two to three times the I/O, per HDD, over any other array manufacturer.

2.  Dual super-capacitor subsystems, in order to always be able to hold up both controllers, for up to 8 minutes, to flush the mirrored write-back cache on both controllers to a small SSD on each controller. This ENDS the issue of the batteries or UPS, to either hold up cache or hold up the entire array, to write out write-back cache to a set of log disks. It now means reliability goes up exponentially over a batter which was already good—it not only keeps the price the same,  but also make the data readily available for server usage when power comes back on.  (Note:  Two super-caps are in each ISE but only one is necessary for hold-up. Two are provided for high availability and no single point of failure.)

3.  Reliability that is increased tenfold, over the first generation ISE, for the back-end devices in datapacs that are using the new Hyper ISE 7-series (with additional groupings of HDDs). This extends the art of ISE-deferring-service, and includes the 5-year hardware warranty that X-IO extends on all its ISE systems.

4.  Unique Performance Tiering in the Hyper ISE hybrid that allows for full use of the HDD capacity with a small % of SSD. The new 7-series extends this capability, with varying capacities of the Hyper ISE, as well as SSD capacity for application acceleration.

5.  No features that are not necessary for application performance. ISE does NOT do de-duplication as it’s not necessary if the application does it—which most do—but moreover, since we are the only company in the world that allows for full utilization of the storage purchased, de-duplication/compression is relegated to where it should be:  for data at rest NOT for tier 1 storage. Furthermore, features like thin provisioning are not necessary as the mainline OS, such as Windows and Linux, let alone VMware, allow for proper grow and shrink of volumes that ISE does support.

50 Shades of Storage, Part 3

— Storage Horizons Blog —

Steve Sicola

When it comes to storage systems, the cost to build the product and the subsequent acquisition cost are only two aspects of the overall cost of owning and operating the storage. The $/GB argument does not hold up anymore as the only important point in storage, because the enterprise and this world demand much, much more. Price/ Performance is very important but other aspects, in this day and age, play equal roles in most cases. Aspects like how the storage array is designed, how much capacity can be utilized (e.g., getting I/Os to it), and then how the software is layered on it, make all the difference in the world, to the total cost of operation (TCO) of storage. So many aspects make up a good array that provides performance, reliability, availability, and capacity utilization; and it’s important not just about a specific aspect, but how they play together. It’s all about making “the whole greater than the sum of its parts.” Environmental aspects make up a huge part of the cost of owning and managing storage in a datacenter and many times are “invisible” costs because of departmental silos.

The aspects to consider in a storage system, today, when buying and then owning the system are:

  1. Cost of acquisition
  2. Cost of warranty service
  3. Cost of power
  4. Cost of space
  5. Cost of features with licenses, etc.
  6. Cost for managing and attaching the storage to the system/application

How the array is designed—from mechanicals and electronics to the software that runs it all—plays a key role to drive TCO up or down.

When I consider building storage, I look at what gives the biggest bang for the buck when it comes to performance, at the lowest cost. I also look at reliability, availability, and the usable capacity. Stan Zaffos of Gartner coined the term, “Available Performance,” which seems to sum it up pretty well. Can you make the storage available, all the time, with a consistent amount of performance? That ties together price/performance, reliability, availability, and usable capacity.

TCO is not just about $ per GB anymore, nor has it been for some time, but many storage companies still seem to focus on it. Then there are others that now seem to focus only on $ per I/O, which is like fishing with dynamite when using all RAM or SSD! It’s also NOT about putting every feature, on the planet, inside the storage system because today, most applications provide features that obviate the need for features within the array. Focusing on the wrong things drive TCO up, not down. Our online whitepaper about common mistakes that are made in storage purchases, “How to Minimize Data Storage Costs and Avoid Expensive Mistakes,” puts this all in a business perspective. Putting all of the data management/protection features inside the storage reduces the scalability of the storage, locks customers to a vendor, and also drives down the efficiency of the storage, in terms of consistent performance and capacity utilization, let alone reliability and availability.

When considering the build of a storage array, I look at multiple factors:

  1. Processor Speed and Capability: If a processor can have speed, as well as RAID acceleration, without the need of having multiple additional components or custom chips, it is the winner. New x64 processors, from Intel’s Jasper Forest to the new Sandy Bridge, provide that capability. Choosing the right processor is important, because too many times, the processor that is recommended is more than what is necessary and this drives up power costs, needlessly.
  2. Memory Capability: Dynamic random-access memory (DRAM) is still the fastest. Its 1000x the speed of flash but of course, it’s much more expensive. Using the right amount, for the job of buffering and caching, is a key to cost containment, as well as array efficiency.
  3. Write-back Cache: This feature is amazingly effective, if the algorithms used, smooth out the accesses to the back-end devices, whether they are HDD or SSD.
  4. Non-volatility and Mirrored Cache: This feature, for most applications, is a key point when it makes a storage subsystem appear to be faster than it really is. It also provides for data integrity and availability in first- and second-order benefits.
  5. Back-end Storage Device Choice (Enterprise HDD, Nearline/High Cap HDD, and SSD): Each of these choices has ramifications to all aspects of the array from cost, reliability, performance, and availability.
  6. Storage Tiering: Tiering has been around for a long time. It was initially coined as Hierarchical Storage Management (HSM), then Life Cycle Management, etc., and now tiering. But tiering can be different, depending on what the goal is. Is it the performance or some all-in-one desire to have tier-one storage mixed with tier-x storage? Is it within the array or across arrays?
  7. Design for Reliability and Availability: These are subtly different and relate to things, such as how many different pieces there are to the solution, and how the intelligent parts of the array allow for availability, in the event of failures (fault tolerance). Packaging of the devices and the different components—without cables and with fewer replaceable components—are keys to driving up reliability and availability, as well as driving TCO down. In the end, design for reliability is all about reducing service events that affect the storage consumer, in one way or another, while availability is all about making sure the storage is available for access, all the time.

The world of computing is complicated enough. We do not need to see so many start-ups confusing the basics of computing and storage with statements like “SSD for the cost of HDD,” or “one tier for all,” or “automatic QoS,” or “no caching,” even to the extreme of “No more HDDs.” It’s all a game to try and sell people on how price/performance could be, not what it SHOULD be!

Basically, architecture is everything. Brute force only works so far by “hiding the cheese” and adding features that mask the overall cost with dubious claims about cost savings (by de-dupe and compression). They are like the wares of the old “snake oil” salesmen of the 19th century. Are these aspects of a storage array that really make a difference?

Read more on http://stevesicola.com.

Fifty Shades of Storage, Part Two

As I began to go down the path to describe the “50 Shades of Storage” in my last blog, I noticed a blog from David Black (see http:www.blackliszt.com/2013/03/in-storage-there-is-x-io-and-then-there-are-all-of-the-others.html). He talks about X-IO, in a unique way, which is ripe for me to talk about as we delve into the storage products that work well—and those that don’t work well—in today’s world of Cloud and IT datacenters, in general. What I mean by “work and don’t work” speaks to fundamentals like reliability, availability, capacity utilization, and performance—all key metrics in the mission to drive overall costs down, not just the acquisition cost.

David makes the point that “In storage, there is X-IO, and there are all the others . . . ” It is true because the X-IO ISE and open storage management with RESTful web services, application integration, etc., are the perfect complements for what has transpired in the industry. It is no longer about what your array can do for your system. It is what the system can already do with the fundamentals of a good array!

The big players, as David puts it, are just turning the crank, assuming that everyone is still wanting a general purpose array that has every feature on the planet in there. As a matter of fact, most of the start-ups are doing this, as well. These products work, but in a world that has seen the maturation of Windows and Linux, as well as all the applications and virtualization software, the NEED for features, in the array, is JUST GOING AWAY!

David’s note that “ISE is a simple building block” is to the point and very much what I was after when ISE was being developed at Seagate. The goal was to make a building block of storage that reduced the Total Cost of Operation (TCO), so significantly, that service would be the exception versus the norm. Performance is effectively delivered with efficiencies and consistency, availability is almost at five 9s, and all the capacity can be used instead of being stranded like most arrays.

When ISE is compared to the others within the industry, X-IO comes up short in some people’s minds because it does NOT have the features that storage analysts look for. But when one looks at the system (like David’s blog talks about the Cloud), features do NOT matter. A normal Cloud uses mass replication (RAID is not dead, it’s just RAID-1 on steroids!). ISE can eliminate the extra copies while making the overall solution work better and have a much longer, useful life.

The models of Cloud that are either private or new Clouds that are designed to run enterprise applications, need to do so with as low of TCO as possible, in order to make money. That’s why ISE is so important, in this world, because it raises the bar in all the pertinent areas: capacity utilization, reliability, availability, and performance. For backup or content retrieval, some of these points are not as important; but for virtualization, database/business intelligence, and VDI in the Cloud or private datacenter, ALL these points are important in order to make money.

I highly recommend that my readers take a look at David Black’s blog. It is not only spot-on about the point we make at X-IO, it speaks to the lack of system awareness in the world. Without system awareness, redundant features—in operating systems/applications and storage—are everywhere, and the efficiency of datacenter operations fall well-short of goals set by CIOs and CFOs, let alone Cloud providers that want to make money by selling Cloud-based services, other than backup.

X-IO is focused on TCO, from the ground up, by working on the fundamentals of storage. This “50 Shades of Storage” blog series will clearly define those things that really do matter and those which are, well . . . just SPIN! ISE is built from all commodity parts and is just put together in a different way to make the “whole greater than the sum of the parts,” as well as taking storage to its rightful place—the trustworthy depository of customer data!

50 Shades Of Storage

— Storage Horizons Blog —

HDD, Hybrid HDD/SSD, Or All SSD Storage:  So Many “Shades Of Grey” That It Is No Wonder People Are Confused!

There is a significant amount of hype out there around all SSD solutions, in the marketplace today, with the likes of Violin, Pure Storage, Nimbus, etc. There is also a lot of hype about some start-ups that put SSD in as cache, along with SATA drives, to be an “all-in-one” box that includes every feature on the planet. But in the end, when building a storage array, then using it, it comes down to price performance and how people think about storage with respect to the rest of the computing application.

While hybrids offer a hedge against decreased quality of service (when heavy usage of applications occur), and all SSD solutions seem to be overkill because of the cost differences between flash and HDD capacity (except for the most demanding of applications), in this day and age, good HDD solutions still fit the bill for most applications.  After all, what is the center of the universe—storage or the system/application? (Hint:  It is the system/application!)

Applications such as a database, virtualization with multiple applications, VDI, and many others have “signatures” with various types of I/O that make up the “signatures.” They range from sequential read/write, localized random I/O (tight random across a relatively small range of capacity), to very un-localized random I/O (for application metadata or seldom touched pieces of data in the app). These applications have not changed in years and can be tuned to do more or less of each type of I/O, in some instances. However, to use just SATA HDD or just SSD technology to cover these applications seems folly. Anyone can improve capacity by using lots of HDDs to get enough performance to drive an application—while on the other hand, lots of performance can be thrown at an application with lower capacity and at higher costs. The optimum answer is somewhere in the middle. There is a good reason why enterprise HDDs are still used, from 7.2K, 10K, to 15K RPM drives, as well as the reason that enterprise SSD exists, in drive form or a plug-in card. The point is to use them for the right workload and in the right mix.

My conclusion is that a good hybrid storage system, mixing HDD with SSD, provides the best price/performance, as well as the lowest total cost of ownership (TCO). It also provides the most predictable & consistent I/O for applications, regardless of the application, if done right—and it will work across the entire purchased capacity.

However, if someone wants to buy a “SAN in a CAN” like most of the other hybrids that basically put consumer-grade flash in with high capacity and zero I/O drives (along with every feature on the planet), that’s an SMB play, not an enterprise play. These kinds of boxes are very much like large SAN arrays (Ethernet or Fibre Channel) and include features that the applications and operating systems already have, as well.

Then there are the all-SSD arrays. We liken them to “Fishing with Dynamite,” however in some workloads they are very necessary, just not for the broader market of today. The variation between these vendors, as well as features, causes me to pause. Once again, the actual design and implementation of these arrays range from scary to insane (adding features to validate their existence) by saying they are the same price as disk with subjective de-dupe and compression capabilities. Once again, many shades of grey and with features that already exist within operating systems and applications.

This new series of blog, starting in April 2013, will give some insight into hybrids and all-flash arrays and why ISE, with its linear scalability and matched storage management for IT and Cloud, as well as incredible performance and TCO, should be a product that all other products, on the market, are compared against first.

Tagged , ,

Cloud Computing and Datacenter Automation in an Era of Efficiencies

What’s a cloud? To me, it is an automated datacenter that can provide compute, application, and backup services. It can be a private cloud for a company, or it can be public where the cloud provider wants to make money on the services it provides.

In both private and public clouds, the thought of efficiency is the key to a successful and profitable business. From the amount of humans that have to administer it, to the efficiency of computes, infrastructure, and storage—it is all about the money. In old terms, it’s about total cost of ownership (TCO). Somehow the fear, uncertainty, and doubt (FUD) have overshadowed this most important metric of all and it’s time it makes a comeback.

TCO is NOT about how much it costs to put something in the cloud operation. Sure it’s a component, but what about how much it can do for the cloud provider from the standpoint of work units per hour, power, space, and cooling costs? After these basics, there are the big metrics in computing like availability and reliability (how much service does it need); and for storage, it’s about capacity utilization of what was purchased. The last aspect is about how much is charged for the service contracts of each technical component of the system. It is amazing that this probably drives 20-40% of the costs for a cloud, as well as any datacenter. The three-year life cycle is artificial and drives pure money, for the vendors or forklift upgrades, when components can and do last longer. Why can’t we just make stuff that is reliable and requires little service, like ISE with its five-year hardware warranty?  We put a man on the moon over 40 years ago, so why can’t we make products that just work like ISE? The answer is partly in the desire for service revenues, as well as the lack of innovation.

So much noise was made about $/GB over the years, and it seems like the entire world has been brainwashed that SATA drives are the panacea for storage. Recently, the big hype has been around flash storage in some type of card that is installed in a computer or SSD—or a big box of flash. That’s going the other way on $/GB, even though it is fast as could be.

Both SATA and flash today have TCO that is higher than people think. They have good aspects but they also have bad aspects that bring TCO down-to-earth. SATA drives may be cheap but they are unreliable and have almost no performance, and when you push them hard, they die quicker. SATA drives also are unable to access all of the capacity in any reasonable time—which means for all but backup or content.  They are pretty much a bad choice for any real application because your capacity utilization is well below 50%. It is stranded capacity that is wasted and causes people to buy more storage, waste more storage, waste more space, and waste more power! So for those who just buy more drives, the power, space, and cooling then become issues, as well as drive failure rates, repair frequency, and potentials for data loss. For those in the cloud who then just say, “Buy many of them and have many copies of the data,” the capacity utilization and effective $/GB continue to dwindle while all the other cost metrics soar, as well.

For all SSD solutions today, the $/GB is prohibitive. I love flash as an option for the future, along with other up-and-coming, non-volatile storage options.  However, most device makers that are currently using flash, waste a lot of space because of the lifetime and failure modes of flash. Today when you add this up, it increases the $/GB. Stay tuned for this situation because if the prices were to come down significantly, relative to enterprise HDD, things may change quickly, but all other aspects of TCO still must be addressed no matter what the storage type. SSD in ISE is already here with Hyper ISE—it is the most efficient use of SSD with HDD in the industry. When SSDs are more cost-effective, ISE will be ready for them. So the TCO is all about the cost of storage procurement, the cost of service contracts, as well as power, cooling, capacity utilization, overall reliability (need for repair events), availability to applications, and performance. Those all-ssd suppliers that use de-dupe and compression to market a lower $/GB really add power and cost to the solution with a subjective amount of capacity savings vs. 100% capacity utilization that ISE provides.

ISE density today allows for 40 2.5” devices to be housed in every 3U rack. While it is possibly not the densest packaging, ISE is all about making sure that the density does not cause reliability problems with vibration and heat. Remember, the environmental factors for all storage devices are key metrics, whether they are HDD or SSD.

By design, ISE, as compared to other enterprise arrays, experiences 100 times less service events because of its technology. My team (some of whom have been with me for 10-30 years) and I developed it while working at Seagate between the years of 2002-2007. This was proven in its first generation, first shipped in 2008, and running in most countries in the world. ISE also has unprecedented availability with straightforward test metrics from ISE architecture that allows software to be adequately tested. Architecture wins in this world, even though brute force can hide the facts for a while.

ISE performance has been shown, in industry benchmarks, to lead in efficient application performance. ISE gets three to four times the I/O out of an HDD, vs. all competition, and with Hyper-ISE, a unique fusion of HDD and SSD (NOT flash cache), the applications just scream. Recent Redknee and Temenos benchmarks, for mobile billing and banking applications that use Microsoft SQL 2012, show that ISE is 25 times more efficient than traditional enterprise storage for applications. This is an incredible savings of space and power, along with the fact that ISE allows full capacity utilization while maintaining full performance.

ISE was built with cloud in mind, first and foremost, when a cloud was basically an automated or autonomic datacenter. We have done all the work with many datacenters, and the numbers are clear about ISE efficiency for enterprise applications against traditional storage:

  1. ISE reduces space costs by 3X
  2. ISE reduces power costs in datacenters by 10X
  3. ISE reduces service events by 100X with proven ISE packaging and self-healing technology

In this world, where costs and human resources are always the desired point for savings by every CFO, for clouds and datacenters, ISE should be considered for all.  X-IO is shown by Gartner to be the ONLY innovators in storage, and we are executing on this vision for all customers.

In previous blogs, I’ve written about the myths and truths of storage. I’ve written about how frequently drives fail, or look like they fail, and cause service events for the rest of the industry. I’ve written about performance and its need to cover all of the capacity purchased for enterprise applications. I’ve written about availability and how it relates to efficient use, 24-hours-per-day, for datacenter applications. Space and cooling also played a part in my writings, because if you can use what you buy, it takes less space and less cooling to run your datacenter.

All of these essays apply to Cloud datacenters, whether private or public. ISE is all about enterprise applications and private clouds, as well as new “application” public clouds. It is purpose-built to be efficient, easy-to-use and last five years or more. There is nothing like ISE in today’s industry, regardless of marketing and FUD by the competition.  ISE is needed because it is the most efficient storage on the planet—and that’s about TCO.

Tagged , , , , , ,

The Truth About Storage, Part Five

Another recent myth, the third in the Five myths and Half-truths series, is that RAID is dead or that RAID-6 is the answer to everything. Where did doing the math go? What about the economics here of wastage or doing the right thing for the right reasons? From distributed file systems with copies of files in multiple places, it’s nice to see that 1980’s RAID-1 Server Replication worked.

It’s just that with this method, the work is spread to the network and in general the cheapest SATA drives are used in these environments.  This is versus even replicated RAID-5 sets for the more disaster tolerant minded people would do not only more economically, but with more predicable performance and availability.

The other part is that RAID-6 is the answer. RAID-6 works to help for two drive failures, but has it been looked at for how failure/repair really is calculated to see what is necessary where before you do disaster tolerant copy to elsewhere?  I believe in the right RAID Level for the right UER (uncorrectable error rate) of the drive or drives involved in the data protection scheme. This is modeled with the repair rate, which can vary and cause some to need RAID-6 in arrays based on speed of software and or hardware design. Performance suffers as protection level increases and $/IO/GB suffers, or I/O density, the key metric in my mind for application performance.

The bottom line answer is to use the right data protection technique based upon the desired UER or MTBDL (mean time between data loss). Combinations of RAID-1, RAID-5, or even RAID-6 can be employed but all have their costs which must be factored into each IT decision. ISE uses RAID-1 and RAID-5 along with ISE Mirroring to provide for proper redundancy for enterprise HDDs as well as extended data protection for those customers who require extended data loss protection for disasters, etc.

Another Half-Truth are that SANs are basically ‘Available’.  Availability is different from reliability, because it has nothing to do with repair rate of the SAN. Availability is about keeping storage access to the servers maintained without disruption. The problem here is that that the answer is ‘kinda’…

Availability means ‘always on’ and accessible. Without it, server applications can’t run and the entire business stops. I call these severity ‘1’ issues. In order to avoid these things many aspects of day to day operation of  storage systems must be looked at. Even though some storage systems perform ‘ok’ on ‘always on’, various aspects of ‘reduced availability’ occur based upon weaknesses in storage systems.

Most SAN arrays (IP or FC) have their weaknesses,  and are vary from company to company as which weakness(es) they have. They range from:

  1. Back-End device weakness
  2. Caching Weakness
  3. Failover/Recovery Weakness
  4. Code Efficiency/Open Source weakness
  5. Maintenance Weakness as related to risk and loss of availability

Back-end device weaknesses are manifested in ‘slow’ drive performance, limiting access to storage for extended periods of time, causing dissatisfaction by application owners. This is typically due to lack of attention in storage software in back-end control and observation of the back-end infrastructure as well as the attached storage devices. The infrastructure today is rapidly moving to SAS (serial attached SCSI), which is very similar to a fibre channel network, albeit reduced in function. SAS is very powerful but requires knowledge of how to handle networks, even on the back-end, and even with a small number of devices. Bus resets, accessing of enclosure information, etc must be done properly in order to not get in the way of normal I/O. The other aspect of back-end weakness comes in dealing with the storage devices themselves, as getting data on and off the devices efficiently either stalls applications or makes them fly. Brute force can work with overkill of a pile of flash or DRAM, but money talks here…what price glory?

Caching weaknesses in SANS still plague users with the only indication of this being how much they have to pay to get performance. Methods for caching efficiently can get overcome by having too much cache, with searching dominating the workload. It seems amazing that caches in many SANS are huge when they don’t really need to be. Efficiency is key here once again as Caching is all about knowing ‘when to hold em’ and ‘when to fold em’ as in cache it or flush it. The ‘sensing’ of knowledge of the data is key to continuous available performance to an application.  Another aspect of cache weakness is the method(s) in which cache mirroring to enable safe Write-Back caching is performed. Many SANs use an external bus in order to affect write-back caching. Most have issues with latency because network interfaces are used such as FC, IB, or SAS.  In order to make write-back caching penalties from mirroring be avoided, the speed and latency of the mirror ‘bus’ should approach or be the same as the internal memory bus speed of the processor in the SAN. This weakness actually causes a ripple effect on performance but also availability when an actual failover occurs. That weakness comes from a lack of full ‘active-active’ access through all controllers in the SAN to the same volumes.

Failover and Recovery when a redundant storage system on the SAN loses part of its processing power due to faults in hardware or software are a key component in availability and weaknesses in storage systems. The desire is to have zero time failover to surviving parts of the system after such a fault. However, due to weaknesses in most storage systems, the time to perform such a ‘failover’ or ‘recovery’ (when the failing part returns), can cause down time to servers based on excessive time taken to perform failover or can cause severe performance degradation that is noticeable by the applications. The process of a failover or recovery is similar to a ‘v-motion’ activity in VMware, as the entire state of what the failing component is taken over by the survivor. The methods in which this occurs all relate to the complexity at the point of failure or recovery. The more work that has to be done to affect completion of the failover or recovery, the longer time it takes to complete the failover or recovery. Many SANs are indeterminate as to how long the process will take, based on weaknesses not only in the failover/recovery code, but all the other weaknesses in the system add up to cause issues here. Some arrays can take minutes to failover, causing downtimes periodically to applications, but also making timeouts longer and causing the fateful ‘pause’ in computing that slows businesses down.

Overall, weaknesses in SANS are due to either old software, patch-worked software, open source software, stacked software with feature upon feature, lack of design vs. ‘seat of the pants’ development, etc. This is the single most reason that storage systems do NOT get the outward efficiency that they should. Typical systems get 1/3 to ½ of the performance they could, while wasting processing power and storage to overcome this. This relates to the cost of storage, the quality of the storage from a user perspective and the cost for available storage. For many it’s one more charge after an other for propping up against these weaknesses that the customer is told is an enhancement or a new service offering.  I have seen array generations get a whopping 5-10% increase in performance with new processing power that is 2-4 times as fast. There is nothing like bad software to ruin your plans. And as quality suffers so does availability. With the mix of open source, written software, glued on software, etc it’s a wonder many startups can get into an enterprise operation, let alone stay in there.

The last weakness, and this one is the most overlooked when customers buy storage, is the maintenance of the system over its life. A storage system is as good as its last upgrade in terms of stability, performance, and availability. Many customers ignore this as they have been burned in the past by failed upgrades or planned downtimes that cause data center outages that can go on forever. Many storage companies, don’t even talk about it on their websites, and even the big companies have ‘planned downtimes’ for upgrades of attached storage devices. This is crazy in this world of non-stop computing and actually causes more money to be spent to cover it up as with above to hide weaknesses in specific parts of a system. In many cases, customers who figure it out end up putting in DR solutions to even cover their maintenance operations of the SAN. What a waste! It should be just like updating a PC or a Mac where its non-intrusive, and with respect to redundant systems, performed one at a time to NOT lose any availability or cause any downtime!

ISE on the other hand, with the extreme focus on performance, reliability, AND availability is now at about 66 years between severity 1 events that cause downtime for servers. This is based on a strong focus on all aspects of the storage software stack as well as the hardware design of ISE. Our performance is 2-4 times that of any HDD device and our Hyper-ISE with HDD and SSD, we have set records in benchmarks across the world with our ability to ‘sense’ application loads and adapt accordingly for random, sequential, and mixed loads. Efficiency is the key to cost effective data centers and ISE is the building block with the most efficiency on earth.

Tagged , , , , , , ,

The Truth About Storage, Part Four

Another myth, the second in the Five myths and Half-truths series, is that the ‘storage shelf’ or ‘drive tray’, or new ‘massively dense drive packages’ and servers with many HDDs are actually decent at preventing vibration and providing enough cooling for all the drives in the enclosure. THEY ARE NOT. Vibration across drives, hot spots in the packaging, vibration across entire packages in the rack and more are the cause of wastage in the data center from performance loss to significant increases in failure rate/drive replacement. Adding to this is the possible exposure to data loss when drives fail under these circumstances, with parts of the data set exposed without redundancy or the requirement for many hours to restore redundancy with the size of today’s drives. Failure rates in this case go well past the design specifications of 1%, up to 3-5% which is untenable for most businesses and many storage vendors try band-aids with fancier algorithms or more copies of data while increasing service costs to the user. The underlying problem should be addressed first.

Drives placed in environments where heat and vibration are excessive directly relates to real drive failures while the results of environment, software, and indirectly relate to drive replacements by causing “No Trouble Founds, or NTFs”. NTFs are the bane of the computer industry as they waste time, money, risk a customer’s data set on recovery techniques they employ in storage or elsewhere. The things that cause NTF’s are the lack of thorough error recovery in almost every storage device or host driver that communicates with storage devices. Excessive heat and vibration can and do cause extended times for read or write completion, retries, missed sectors, re-synch/reboots of drives, etc. This causes the software, which is not well designed or thought through to just mark the drive bad and go into ‘rebuild’ and assume it’s alright to slow things down for the application, pay more for a service call with higher service costs, and put the customer data in jeopardy while ‘rebuilding’ the data from the ‘failed’ drive. Service contracts for storage devices are high priced because of this lack of attention to detail and money is made whether or not drives fail and the customer always pays.

The design of ISE places drives in an extremely low vibration and heat environment and has state of the art intelligent error recovery with 4 plus years field experience with over 5,000 ISE deployed world-wide as well as 6 years of 100’s of units in the lab. This has provided XIO with the proof that good packaging and recovery algorithms (not just RAID) drastically reduce failure rate of drives, repair frequency of the entire system, and maximize performance of the drives and storage system.  XIO uses patented DataPac technology that house 10-20 drives as well as patented Managed Reliability software allowing XIO to provide a no-charge 5 year HW maintenance warranty on ISE. The peace of mind of a storage system that has NO NTF’s, an extremely low failure rate, and performance across all the purchased capacity is unheard of in the storage industry. When has buying less been better than buying more? Efficiency and simplicity with little or no human intervention has always been important, and ISE solves the foundational issues that have been plaguing the industry for 20+ years. SSD’s require similar attention as they are now part of the mix in IT environments, and I’ll cover them in a bit.

Tagged , , , , , , , ,
Follow

Get every new post delivered to your Inbox.

Join 28 other followers