Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

Getting the hang of IOPS

Created: 28 Mar 2011 • Updated: 28 Jun 2012 | 48 comments
Language Translations
ianatkin's picture
+57 57 Votes
Login to vote

The newly updated version is now available: Getting The Hang Of IOPS v1.3

If you are an Altiris Administrator, take it from me that IOPS are important to you. What I hope to do in today's article is to help you understand what IOPS are and why they are important when sizing your disk subsystems. In brief I cover the following,

  • Harddisk basics -how harddisks work!
  • Drive response times
  • Interpreting drive throughputs -what these figures actually mean 
  • What IOPS are and why they are so important
  • IOPS calculations and disk arrays

I should state now that I do not consider myself an expert on this topic. However, every so often I find myself benchmarking disks, and I know the curve I had to climb to interpret all the various vendor stats -the information overload can be overwhelming. What I'm going to attempt in this article is to herd together all the salient pieces of information I've gathered over time. With luck, this will help you engage in a meaninful dialog with your storage people to get the performance you need from your storage. 

Introduction
Disk Performance Basics
   Hard Disk Speeds - It's more than just RPM...
   The Response Time
   Disk Transfer Rates aka the 'Sequential Read'
Zone Bit Recording
Understanding Enterprise Disk Performance
   Disk Operations per Second - IOPS
   IOPS and Data
   IOPS and Partial Stroking
   How Many IOPS Do We Need?
   IOPS, Disk Arrays & Write Penalties
Summary
Further Reading

Introduction 

If you are looking at IT Management Suite (ITMS) one of the underpinning technologies which needs to be considered in earnest is Microsoft SQL Server. Specifically, you want to be sure that your SQL server is up to the job. There are many ways to help SQL Server perform well. Among them are,

  • Move both the server OS and the SQL Server application to 64-bit
  • Ensure you've got enough RAM chips to load your entire SQL database into memory
  • Ensure you've got enough processing power on-box
  • Ensure the disk subsystem is up to the task
  • Implement database maintenance plans
  • Performance monitoring

One of the most difficult of the line items to get right in the above list is ensuring the disk susbsystem is up to the task. This is important -you want to be sure that the hardware you are considering is suitable from the outset for the loads you anticipated placing on your SQL Server.

Once your hardware is purchased, you can of course tweak how SQL server utilises the disks it's been given. For example, to reduce contention we can employ different spindles for the OS, databases and log files. You might even re-align your disk partitions and tune your volume blocksizes when formatting.

But specifying the disk subsystem initially leads to a lot of tricky questions,

  1. How fast really are these disks?
  2. Okay I now know how fast they are. Err... Is that good?
  3. Is the disk configuration suitable for SQL requirements of ITSM 7.1?

Before we can begin to answer these questions, we really need to start at the beginning...

Disk Performance Basics

Disk performance is an interesting topic. Most of us tend to think of this in terms of how many MegaBytes per second (MB/s) we can get out of our storage.  Our day-to-day tasks like computer imaging and copying files between disks teaches us that this MB/s figure is indeed an important benchmark.

It is however vital to understand that these processes belong to a specific class of I/O which we call sequential. For example, when we are reading a file from beginning to end in one continuous stream we are actually executing a sequential read. Likewise, when copying large files the write process to the new drive is called a sequential write.

When we talk about rating a disk subsystem's performance, the sequential read and write operations are only half the story. To see why, let's take a look into the innards of a classic mechanical harddisk.

Hard Disk Speeds - It's more than just RPM...
A harddisk essentially consists of some drive electronics, a spinning platter and a number of read/write heads which can be swung across the disk on an arm. Below I illustrate in gorgeous powerpoint art the essential components of a disk drive. Note I am focusing on the mechanical aspects of the drive as it is these which limit the rate at which we can read data from (and write data to) the drive. 

 

The main items in the above figure are,

  1. The Disk Platter
    The platter is the disk within the drive housing upon which our information is recorded. The platter is a hard material (i.e. not floppy!) which is usually either aluminium, glass or a ceramic. This is coated with a magnetic surface to enable the storage of magnetic bits which represent our data. The platter is spun at incredible speeds by the central spindle (up to 250kmph on the fastest disks) which has the effect of presenting a stream of data under the disk head at terrific speeds.

    In order to provide a means to locate data on the disk, these platters are fomatted with thousands of concentric circles called tracks. Each track is subdivided into sectors which each store 512 bytes of data. 

    As there is a limit to the density with which vendors can record magnetic information on a platter,  manufacturers will often be forced to make disk drives with several platters in order to meet the storage capacities their customers demand.
     

  2. The Drive Head
    This is the business end of the drive. The heads read and write information bits to and from the magnetic domains that pass beneath it on the platter surface. There are usually two heads per platter which are sited on either side of the disk.
     
  3. The Actuator Arm
    This is the assembly which holds the heads and ensures (through the actuator) that the heads are positioned over the correct disk track.

When considering disk performance one of the obvious players is the platter spin speed. The drive head will pick up far more data per second from a platter which spins at 1000 Rotations Per Minute (RPM) when compared with one that spins just once per minute! Simply put, the faster the drive spins the more sectors the head can read in any given time period. 

Next, the speed with which the arm can be moved between the disk tracks will also come into play. For example, consider the case where the head is hovering over say track 33 of a platter. An I/O request then comes in for some data on track 500. The arm then has to swing the head across 467 tracks in order to reach the track with the requested data. The time it takes for the arm to move that distance will fundamentally limit the number of random I/O requests which can be serviced in any given time. For the purposes of benchmarking, these two mechanical speeds which limit disk I/O are provided in the manufacturer's specification sheets as times,

  1. Average Latency
    This is the time taken for the platter to undergo half a disk rotation. Why half? Well at any one time the data can be either a full disk rotation away from the head, or by luck it might already be right underneath it.  The time taken for a half rotation therefore gives us the average time it takes for the platter to spin round enough for the data to be retrieved.
     
  2. Average Seek Time
    Generally speaking, when the I/O request comes in for a particular piece of data, the head will not be above the correct track on the disk. The arm will need to move so that the head is directed over the correct track where it must then wait for the platter spin to present the target data beneath it. As the data could potentially be anywhere on the platter,  the average seek time is time taken for the head to travel half way across the disk.

So, whilst disk RPM is important (as this yeilds the average latency above) it is only half the story. The seek time also has an important part to play. 
 

The Response Time
Generally speaking, the time taken to service an individual (and random) I/O request will be limited by the combination of the above defined latency and seek times. Let's take for example a fairly mainstream retail laptop harddisk -a Seagate Momentus. From the Seagate website its specifications are,

Spin Speed (RPM) .................. 7200 RPM
Average latency .......................4.17ms
Seek time (Read) .....................11ms
Seek time (Write) .....................13ms
I/O data transfer rate ................300MB/s

Returning to our special case of a sequential read, we can see that the time taken to locate the start of our data will be the sum of the average latency and the average seek times. This is because once the head has moved over the disk to the correct track (the seek time) it will still have to wait (on average) for half a platter rotation to locate the data. The total time taken to locate and read the data is called the drive's response time,

Response Time = (Average Latency) + (Average Seek Time)

I've heard  people question this formula on the grounds that these two mechanical motions occur concurrently -the platter is in motion whilst the arm is tracking across the disk. The thinking then is that the response time is which ever is the larger of seek and latency. This thought experiment however has a flaw -once the drive head reaches the correct track it has no idea what sector is beneath it. The head only starts reading once it reaches the target track and thereafter must use the sector address marks to orient itself (see figure below). Once it has the address mark, it knows where it is on the platter and therefore how many sector gaps must pass before the target sector arrives. 

The result is that when the head arrives at the correct track, we will still have wait on average for half a disk rotation for the correct sector to be presented. The formula which sums the seek and latency to provide the drive's response time is therefore correct.

Digression aside, the response time for our Seagate Momentus is therefore,

 (Response Time) = 11ms + 4.17ms 

                = 15.17ms. 

So the drive's response time is a little over 15 thousandths of a second. Well that sounds small, but how does this compare with other drives and in what scenarios will the drive's response time matter to us?

To get an idea of how a drive's response time impacts on disk performance, let's first see how this comes into play in a sequential read operation. 

Disk Transfer Rates aka the 'Sequential Read'
Most disk drive manufacturers report both the response time, and a peak transfer rate in their drive specification. The peak transfer rate typically refers to the best case sequential read scenario.

 
Let's assume the OS has directed the disk to perform a large sequential read operation. After the initial average overhead of 15.17ms to locate the start of the data, the actuator arm need now move only fractionally with each disk rotation to continue the read (assuming the data is contigious). The rate at which we can read data off the disk is now limited by the platter RPM and how much data the manufacturer can pack into each track.
 
Well, we know the RPM speed of the platter, but what about the data density on the platter? For that we have to dig into the manufacturers spec sheet,
 

This tells us that the number of bits per inch of track is 1,490,000. Let's now use this data to work out how much data the drive could potentially deliver on a sequential read.

Noting this is a 2.5inch drive, the maximum track length is going to be the outer circumference of the drive (pi * d) = 2.5*3.14 = 7.87 inches. As we have 1490kb per inch data density, this means the maximum amount of data which can be crammed onto a track is about,

(Data Per Track)  = 7.87 * 1490 k bits

                = 11,734 k bits

                = 1.43MB 

Now a disk spinning at 7200RPM is actually spinning 120 times per second. Which means that the total amount of data which can pass under the head in 1 second is a massive 173MB (120 * 1.43MB).

Taking into account that perhaps  about 87% of a track is data, this gives a maximum disk throughput of about 150MB/s which is surprisingly in agreement with Seagates own figures. 

Note that this calculation is best case -it assumes the data is being sequentially read from the outermost tracks of the disk and that there are no other delays between the head reading the data and the operating system which requested it. As we start populating the drive with data, the tracks get smaller and smaller as we work inwards (don't worry -we'll cover this in Zone Bit Recording below). This means less data per track as you work towards the centre of the platter, and therefore the less data passing under the head in any given time frame.

To see how bad the sequential read rate can get, let's perform the same calculation for the smallest track which has a 1 inch diameter. This gives a worst case sequential read rate of 60MB/s! So when your users report that their computers get progressively slower with time, they might not actually be imagining it. As the disk fills up, retrieving the data from the end of a 2.5inch drive will be 2.5 times slower than retrieving it from the start. For a 3.5 inch desktop harddisk the difference is 3.5 times.

The degradation which comes into play as a disk fills up aside, the conclusion to take away from this section is that a drive's response time does not impact on the sequential read performance. In this scenario, the drives data density and RPM are the important figures to consider.

Before we move onto a scenario where the response time is important, let's look at how drives manage to store more data on their outer tracks than they do on their inner ones.

Zone Bit Recording
As I stated in the above section, the longer outer tracks contain more data than the shorter inner tracks. This might seem obvious, but this has not always been the case. When harddisks were first brought to market their disk controllers were rather limited. This resulted in a very simple and geometric logic in the way tracks were divided into sectors as shown below. Specifically, each track was divided into a fixed number of sectors over which the data could be recorded. On these disks the number of sectors-per-track was a constant quantity across the platter.

As controllers became more advanced, manufacturers realised that they were finally able to increase the complexity of the platter surface. In particular, they were able to increase the numbers of sectors per track as the track radius increased.

The optimum situation would have been to record on each track as many sectors as possible into its length, but as disks have thousands of tracks this presented a problem - the controller would have to keep a table of all the tracks with their sector counts so it would know exactly what track to move the head to when when reading a particular sector. There is also a law of diminishing returns at play if you continue to attempt to fit the maximum number of sectors into each and every track.

A compromise was found. The platter would be divided into a small number of zones. Each zone being a logical grouping of tracks which had a specific sector-per-track count. This had the advantage of increasing disk capacities by using the outer tracks more effectively. Importantly, this was achieved without introducing a complex lookup mechanism on the controller when it had to figure out where a particular sector was located.

The diagram above shows an example where the platter surface is divided into 5 zones. Each of these zones contains a large number of tracks (typically thousands), although this is not illustrated in the above pictures for simplicity. This technique is called Zone Bit Recording, or ZBR for short.

On some harddisks, you can see this zoning manifest very clearly if you use a disk benchmarking tool like HD Tune. This tool tests the disk's sequential read speed working from the outermost track inwards. In the particular case of one of my Maxtor drives, you can see quite clearly that the highest disk transfer rates are obtained on the outer tracks.  As the tool moves inwards, we see a sequence of steps as the read head crosses zones possessing a reduced number of sectors per track. In this case we can see that the platter has been divided into 16 zones.

.

This elegant manifestation of ZBR is sadly hard to find on modern drives -the stairs are generally replaced by a spiky mess. My guess is that other trickery is at play with caches and controller logic which results in so many data bursts as to obscure the ZBR layout.

Understanding Enterprise Disk Performance

Now that we've covered the basics of how harddisks work, we're now ready to take a deeper look into disk performance in the enterprise. As we'll see, this means thinking about disk performance in terms of response times instead of the sustained disk throughputs we've considered up to now.

Disk Operations per Second - IOPS
What we have seen in the above sections is that the disk's response time has very little to do with a  harddisk's transfer rate. The transfer rate is in fact dominated by the drive's RPM and linear recording density (the maximum number of sectors-per-track) 

This begs the question of exactly when does the response time become important?

To answer this, let's return to where this article started -SQL Servers.  The problem with databases is that database I/O is unlikely to be sequential in nature. One query could ask for some data at the top of a table, and the next query could request data from 100,000 rows down. In fact, consecutive queries might even be for different databases.
 
If we were to look at the disk level whilst such queries are in action, what we'd  see is the head zipping back and forth like mad -apparently moving at random as it tries ro read and write data in response to the incoming I/O requests.

In the database scenario, the time it takes for each small I/O request to be serviced is dominated by the time it takes the disk heads to travel to the target location and pick up the data. That is to say, the disk's reponse time will now dominate our performance. The response time now reflects the time our storage takes to service an I/O request when the request is random and small. If we turn this new benchmark on its head, we can invert this to give the number of Input/Output oPerations per Second (IOPS) our storage provides. 

So, for the specific case of our Seagate Drive with a 15.17ms response time, it will take at least on average 15.17ms to service each I/O. Turning this on it's head to give us our IOPS yeilds  (1/  0.01517) which is 66 IOPS.

Before we take a look and see whether this value is good or bad, I must emphasise that this calculation has not taken into account the process of reading or writing data. An IOPS value calculated in these terms is actually referring to zero-byte file transfers. As ludicrous as this might seem, it does give a good starting point for estimating how many read and write IOPS your storage will deliver as the response time will dominate for small I/O requests.

In order to gauge whether my Seagate Momentus IOPS figure of 66 is any good or not, it would be useful to have a feeling for the IOPS values that different classes of storage provide. Below is an enhancement to a table inspired by Nick Anderson's efforts where he grouped various drive types by their RPM and then inverted their response times to give their zero-byte read IOPS,

As you can see, my Seagate Momentus actually sits in the 5400RPM bracket even though it's a 7200RPM drive. Not so surprising as this is actually a laptop drive, and compromises are often made in order to make such mobile devices quieter. In short -your milage will vary.

IOPS and Data
Our current definition of a drive's IOPS is based on the time it takes a drive to retrieve a zero-sized file. Of immediate concern is what happens to our IOPS values as soon as we want to start retrieving/writing data. In this case, we'll see that both the response time and sequential transfer rates comes into play.

To estimate the I/O request time, we need to sum the response time with the time required to read/write our data (noting that a write seek is normally a couple of ms longer than a read seek to give the head more time to settle). The chart below therefore shows how I'd expect the IOPS to vary as we increase the size of the data block we're requesting from our Seagate Momentus drive.
 

So our 66 IOPS Seagate drive will in a SQL Server scenario (with 64KB block sizes) actually give us 64 IOPS when reading and 56 IOPS when writing.

The emphasis here is that when talking about IOPS (and of course comparing them), it is important to confirm the block sizes being tested and whether we are talking about reading or writing data. This is especially important for drives where the transfer times start playing a more significant role in the total time taken for the IO operation to be serviced.

As real-world IOPS values are detrimentally affected when I/O block sizes are considered  (and also of course if we are writing instead of reading), manufacturers will generally quote a best case IOPS. This is taken from the time taken to read the minimum amount from a drive ( 512 bytes). This essentially yields an IOPS value derived from the drive's response time. 

Cynicism aside, this simplified way of looking at IOPS is actually fine for ball-park values. Always worth bearing in mind that these quoted values are always going to be rather optimistic.

IOPS and Partial Stroking
If you recall, our 500GB Seagate Momentus has the following specs,

Spin Speed (RPM) .................. 7200 RPM
Average latency .......................4.17ms
Seek time (Read) .....................11ms
Internal I/O data transfer rate .....150MB/s
IOPS........................................66

On the IOPS scale, we've already determined that this isn't exactly a performer. If we wanted to use this drive for a SQL database we'd likely be pretty dissapointed. Is there anything we can do once we've bought the drive to increase it's performance? Technically of course the answer is no, but strangely enough we can cheat the stats by being a little clever in our partioning.

To see how this works, let's partition the Momentus drive so that only the first 100GB is formatted. The rest of the drive, 400GB worth is now a dead-zone to the heads -they will never go there. This has a very interesting consequence to the drives seek time. The heads are now limited to a small portion of the drives surface, which means the time to traverse from one end of the formatted drive to the other is much smaller than the time taken it would have taken for the head to cross the entire disk. This reflects rather nicely on the drive's seek time over that 100GB surface, which has an interesting effect on the drive's IOPS. 

To get some figures, let's assume that about 4ms of a drive's seek time is taken up with accelerating and decelerating the heads (2ms to accelerate, and 2ms to decelerate). The rest of the drive's seek time can then be said to be attributed to it's transit across the platter surface.  

So, by reducing the physical distance the head has to travel now to a fifth of the drive's surface, we can estimate that the transit time is going to be reduced likewise. This results in a new seek time of (11-4)/5 + 4 = 6.4ms.

In fact, as more data is packed into the outside tracks due to ZBR this would be conservative estimate. If the latter four fifths of the drive were never going to be used, the drive stats would now look as follows, 

Spin Speed (RPM) .................. 7200 RPM
Average latency .......................4.17ms
Seek time (Read) .....................6.4ms (for 0-100GB head movement restriction)
Internal I/O data transfer rate .....150MB/s
IOPS........................................94

The potential IOPS for this drive has increased by 50% In fact, it's pretty much compariable now to a high-end 7200RPM drive! This trick is called partial stroking, and can be a quite effective way to ensure slower RPM drives perform like their big RPM brothers. Yes, you do lose capacity but in terms of cost you can save overall.

To see if this really works, I've used IOMETER to gather a number of response times for my Seagate Momentus using various partition sizes and a 512 byte data transfer.

Here we can see that the back of envelope calculation wasn't so bad -the average I/O response time here for a 100GB drive worked out to be 11ms and the quick calculation gave about 10.5ms. Not bad considering a lot of guess work was involved -my figures for head acceleration and deceleration were plucked out the air. Further I didn't add a settling time for the head before it started reading the data to allow the vibrations in the actuator arm to setting down. In truth, I likely over-estimated the arm accelleration and decelleration times which had the effect of absorbing the head settle time.

But, as a rough calculation I imagine this wouldn't be too far off for most drives.

Your milage will of course vary across drive models, but if for example you are looking at getting a lot of IOPS for a 100GB database, I'd expect that a 1TB 7200RPM Seagate Barracuda with 80 IOPS could be turned into a 120 IOPS drive by partitioning it for such a purpose. This would take the drive into the 10K RPM ballpark on the IOPS scale for less than half the price of a 100GB 10K RPM disk.

As you can see, this technique of ensuring most of the drives surface is a 'dead-zone' for the heads can turn a modest desktop harddisk into an IOPS king for its class. And the reason for doing this is not to be petty, or prove a point -it's cost. Drives with large RPMs and quoted IOPS tend to be rather expensive.

Having said that, I don't imagine though that many vendors would understand you wanting to effectively throw the bulk of your drives capacity out of the window. Your boss either...

 

How Many IOPS Do We Need?

Whilst enhancing our IOPS with drive stroking is interesting, what we're missing at the moment is where in the IOPS spectrum we should be aiming to target our disk subsystem infrastructure.
 
The ITSM 7.1 Planning and Implementation Guide has some interesting figures for a 20,000 node setup where SQL I/O was profiled for an hour at peak time,
 

The conclusion was that the main SQL Server CMDB database required on average 240 write IOPS over this hour window. As we don't want to target our disk subsystem to be working at peak, we'd probably want to aim for a storage system capable of 500 write IOPS.

 
This IOPS target is simply not achievable through a single mechanical drive, so we must move our thinking to drive arrays in the hope that by aggregating disks we can start multiplying up our IOPS. As we'll see, it is at this point things get murky.....
 
 

IOPS, Disk Arrays & Write Penalties
A quick peak under the bonnet of most enterprise servers will reveal a multitude of disks connected to a special disk controller called a RAID controller. If you are not familiar with RAID, there is plenty of good online info available on this topic, and RAID's wikipedia entry isn't such a bad place to start.

To summarise, RAID stands for Redundant Array of Independent Disks. This technology answers the need to maintain enterprise data integrity in a world where harddisks have a life expectancy and will someday die. The RAID controller abstracts the underlying physical drives into a number of logical drives. By building fault-tolerance into the way data is physically distributed, RAID arrays can be built to withstand a number of drive failures before data integrity is compromised.

Over the years many different RAID schemes have been developed to allow data to be written to a disk array in a fault tolerant fashion. Each scheme is classified and allocated a RAID level. To help in the arguments that follow concerning RAID performance, let's review now some of the more commonly used RAID levels,

  • RAID 0
    This level carves up the data to be written into blocks (typically 64K) which are then distributed across all the drives in the array. So when writing a 640KB file through a RAID 0 controller with 5 disks it would first divide the file into 10 x 64KB blocks.  It would then write the first 5 blocks to each of the 5 disks simulateneously, and then once that was successful proceed to write the remaining five blocks in the same way. As data is written in layers across the disk array this technique is called striping, and the block size above is referred to as the array's stripe size.  Should a drive fail in RAID 0, the data is lost -there is no redundancy. As the striping concept used here is the basis of other RAID levels which do offer redundancy, it is hard to omit RAID 0  from the official RAID classification.

    RAID 0's great benefit is that it offers a much improved I/O performance as all the disks are potentially utilised when reading and writing data.
     

  • RAID 1
    This is the simplest to understand RAID configuration. When a block of data is written to a physical disk in this configuration, that write process is exactly duplicated on another disk. For that reason, these drives are often referred to as mirrored pairs. In the event of a drive failure, the array and can continue to operate with no data loss or performance degradation.
     
  • RAID 5
    This is a fault tolerant version of RAID 0. In this configuration each stripe layer contains a parity block. The storing of a parity block provides the RAID redundancy as should a drive fail, the information the now defunct drive contained can be rebuilt on-the-fly using the rest of the  blocks in the stripe layer. Once a drive fails, the array is said to operate in a degraded state. A single read can potentially require the whole stripe to be read so that the missing drive's information can be rebuilt. Should a further drive fail before the defunct drive is replaced (and rebuilt) the integrity of the array will be lost.
     
  • RAID 6
    As RAID 5 above, but now two drives store parity information which means that two drives can be lost before array integrity is compromised. This extra redundancy comes at the cost of losing the equivlaent of two drives worth of capacity in the RAID 6 array (whereas in RAID 5 you lose the equivalent of one drive in capacity).
     
  • RAID 10
    This is what we refer to as a nested RAID configuration -it is a stripe of mirrors and is as such called  RAID 1 + 0 (or RAID 10 for short). In this configuration you have a stripe setup as in RAID 0 above, but now each disk has a mirrored partner to provide redundancy. Protection against drive failure is very good as the likelihood of both drives failing in any mirror simultenously is low.You can potentially lose up to half of the total drives in the array with this setup (assuming a one disk per mirror failure).

    With RAID 10 your array capacity is half the total capacity of your storage.  
     

Below I show graphically examples of RAID 5 and RAID 10 disk configurations. Here each block is designated by a letter and a number. The letter designates the stripe layer, and the number designates the block index within that stripe layer. Blocks with the letter p index are parity blocks. 

As stated above, one of the great benefits that striping gives is performance.

Let's take again the example of a RAID 0 array consisting of 5 disks. When writing a file, all the data isn't simply written to the first disk. Instead, only the first block will be written to the first disk. The controller directs the second block to the second disk, and so on until all the disks have been written to. If there is still more of the file to write, the controller begins again from disk 1 on a new stripe layer.  Using this strategy, you can simultaneously read and write data to a lot of disks, aggregating your read and write performance.

This can powerfully enhance our IOPS. In order to see how IOPS are affected by each RAID configuration, let's now discuss each of the RAID levels in turn and think through what happens for both incoming read and write requests.

  • RAID 0
    For the cases of both read and write IOPS to the RAID controller, one IOPS will result on the physical disk where the data is located.
     
  • RAID 1
    For the case of a read IOPS, the controller will execute one read IOPS on one of the disks in the mirror. For the case of a write IOPS to the controller, there will be two write IOPS executed -one to each disk in the mirror.
     
  • RAID 5
    For the case of a read IOPS, the controller does not need to read the parity data -it just directs the read directly to the disk which holds the data in question resulting again in 1 IOPS at the backend. For the case of a disk write we have a problem - we also have to update the parity information in the target stripe layer. The RAID controller must therefore execute two read IOPS (one to read the block we are about to write to, and the other for obtain the parity information for the stripe). We must then calculate the new parity information, and then execute two write IOPS (one to update the parity block and the other to update the data block). One write IOPS therefore results in 4 IOPS at the backend!
     
  • RAID 6
    As above, one read IOPS to the controller will result in one read IOPS at the backend. One write IOPS will now however result in 6 IOPS at the backend to maintain the two parity blocks in each stripe (3 read and 3 write).
     
  • RAID 10
    One read IOPS sent to the controller will be directed to the correct stripe and one of the mirrored pair -so again only one write IOPS at the backend. One write IOPS to the controller however will result in two IOPS being executed in the backend to reflect that both drives in the mirrored pair require updating.

What we therefore see when utilising disk arrays is the following,

  1. For disk reads, the IOPS capacity of the array is the number of disks in the array multiplied by a single drive IOPS. This is because one incoming read I/O results in a single I/O at the backend.
     
  2. For disk writes with RAID, the number of IOPS executed at the backend is generally not the same as the number of write IOPS coming into the controller. This results the total number of effective write IOPS that an array is capable of being generally much less than what you might assume by naively aggregating disk performance.

The number of writes imposed on the backend by one incoming write request is often referred to as the RAID write penalty. Each RAID level suffer from a different write penalty as described above, though for easier reference the table below is useful,  

Knowing the write penalty each RAID level suffers from, we can calculate the effective IOPS of an array using the following equation,

where n is the number of disks in the array, IOPS is the single drive IOPS, R is the fraction of reads taken from disk profiling, W is the fraction of writes taken from disk profiling, and F is the write penalty (or RAID Factor).

If we know the number of IOPS we need from our storage array, but don't know the number of drives we need to supply that figure, then we can rearrange the above equation as follows,

So in our case of a SQL Server requiring 500 write IOPS (i.e. 0% READ pretty much) let's assume we are offered a storage solution of 10K SAS drives capable of 120 IOPS a piece. How many disks would we need to meet this write IOPS requirement? The table below summarises the results.

What we see here is a HUGE variation in the number of drives required depending on the RAID level. So, your choice of RAID configuration is very, very important if storage IOPS is important to you.

I should say that most RAID 5 and RAID 6 controllers do understand this penalty, and will consequently cache as many write IOPS as possible, committing them during an idle window where possible. As a result, in real-world scenarios these controllers can perform slightly better than you'd anticipate from the table above. However once these arrays become highly utilised the idle moments become fewer which edges the performace back toward the limits defined above.
 

Summary

This finally then concludes today's article. I hope it's been useful and that you now have a better understand IOPS. The main points to take away from this article are,

  1. Get involved with your server/storage guys when it comes to spec'ing your storage
     
  2. The important measure for sequential I/O is disk throughput
     
  3. The important measure for random I/O is IOPS
     
  4. Database I/O is generally random in nature and in the case of the Altiris CMDB the SQL profile is also predominently write biased.
     
  5. Choosing your storage RAID level is critical when considering your IOPS performance. By selecting RAID6 over RAID1 or 10 level you can potentially drop your total write IOPS by a factor of 3.

I should finish with an empahsis that this article is a starter on the disk performance journey. As such, this document should not be considered in isolation when benchmarking and spec'ing your systems. Note also that at the top of the reading list below is a *great* Altiris KB for SQL Server which will help you configure your SQL Server appropriately.

Next in the article pipeline (with luck) will be "Getting the Hang of Benchmarking" which will aim to cover more thoroughly what you can do to benchmark your systems once they are in place.

Good Luck!

Ian./

Further Reading

SQL Server 2005 and 2008 Implementation Best Practices and Optimization - A great symantec KB article on improving SQL Server performance

http://www.pcguide.com/ref/hdd - This is a great reference for how harddisks work. It includes everything you'd ever want to know about how harddisks work.

http://www.zdnet.com/blog/ou/how-higher-rpm-hard-drives-rip-you-off/322 - the blog entry which got me interested in drive stroking

http://www.seagate.com/staticfiles/support/disc/manuals/notebook/momentus/XT/100610268b.pdf  -the Seagate Momentus specification sheet

http://vmtoday.com/2009/12/storage-basics-part-i-intro/ - A nice series of articles by Joshua Townsend on storage

http://www.seagate.com/docs/pdf/whitepaper/tp613_transition_to_4k_sectors.pdf -an interesting Seagate whitepaper on the transition from 512 Byte sectors to 4K sectors.

http://www.techrepublic.com/blog/datacenter/calculate-iops-in-a-storage-array/2182 -Scot Lowe's great TechRepublic article on IOPS and storage arrays

http://oss.oracle.com/~mkp/docs/ls-2009-petersen.pdf -Martin Peterson's paper on I/O topology

Comments 48 CommentsJump to latest comment

Tenacious Geo's picture

Ian (AKA the IT Juggernaut),

I am so greatful that you are an active part of this Altiris/Symantec community. Your brilliance and ingenuity contribute greatly to the ability of learners like me to keep growing. I am putting your disk knowledge immediately to use in my planning for ITMS 7.1.

Kind regards,

Geo

-Geo

0
Login to vote
Pascal KOTTE's picture

Thank Ian.

~Pascal @ Kotte.net~ Do you speak French? Et utilisez Altiris: venez nous rejoindre sur le GUASF

0
Login to vote
Darren Collins's picture

Ian,

I know a lot of work went into this: superb job.

Many thanks,
Darren.

Darren Collins
Applications Packaging and Deployment for IT Services,
Oxford University, UK.

0
Login to vote
Seamless's picture

An excellent and useful article. Thanks, Ian

0
Login to vote
jessek's picture

I was suffering from insomnia last night when I came across this article.  Reading this made it worse, it was so compelling.  Well done!

Jesse Kozikowski
Aspirus, Inc.

0
Login to vote
ianatkin's picture

To be honest Jessek, I found it compelling to write as well. For me too it resulted in some late nights... ;-)

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
rweiss77's picture

Nice work on this article!

0
Login to vote
Rick Grigg at Matrix's picture

Excellent article, nicely articulated, great job all around.  My own clients have been looking for something like this.  Thanks very much!!

 

Rick Grigg

0
Login to vote
awgtek variq's picture

two words: Fusion-IO and OCZ :)

which raises (but does not 'beg') the question :) : what do you think of the SSD revolution and its implications for database technologies?

Insightful article btw. I was looking for citations to support RAID 10 over RAID 5. This works well. thanks.

0
Login to vote
ianatkin's picture

I think SSD technology is superb, but out of my price range to test!! I had hoped to pursuade work to get me some in to test, but that did not work out. :-(

The main show stopper is price. Single Level Cell (SLC) technology is the one you want for mission-critical applications, but it's still very expensive though over the Multi Level Cell (MLC) equivalent. SLC is faster and has a longer lifespan.

There was a rather scary article on ZNet a few months ago which cited that all the team who had bought SSD drives in the last 18 months had suffered failures. Despite these problems, they still were excited about SSD, and would not consider going back. Mechanical drives were just to painful for them by comparison!

So in short, I think for databases SSD is FANTASTIC. Depending on the intensity of the I/O you expect, you might want to build into you budget's a more regular drive replacement regime than you currently have with traditional technologies though....

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

+4
Login to vote
momkvi's picture

The key is not to throw everything at SSD but what makes sense.  Run some AWR's or whatever reports your database can produce regarding I/O usage and put your highest hit data on SSD's.  Keep the rest of the data on spindles.  This can save you money on extra spindles (or cache) trying to support increasing number of I/Os.

Nowadays, there are Enterprise class MLC's that have life spans much longer than the first batch of SSD's did.

And if you don't want to buy SSD's, put a bunch of memory into the server and increase your buffer cache.  More than one way to skin a cat - or improve performance.  Skinning a cat is not fun from what I hear, improving performance is.

+2
Login to vote
momkvi's picture

Awesome Post!

0
Login to vote
Fusion_Technology's picture

There is so liittle information out there, especially when it pertains to the ITMS environments.

Thank you for putting all this effort into the article.

Jack

SSE, STS, ACP, ACE, AAC, ACI

0
Login to vote
jonfleck's picture

IOPS are also very important for normal desktop use, not just databases. I've just upgraded my laptop to an SSD a couple of weeks ago and it has been the best real world performance boost I've ever gotten from a hardware upgrade. I've even upgraded my Windows 7 HTPC to an SSD even though media doesn't really need a fast drives it nice to have a HTPC that can boot up and shut down in seconds and perform system updates without causing frame rate drops in the video.

0
Login to vote
ianatkin's picture

 

Indeed -everyone wants their computers to boot faster, and react faster. And SSD if you can afford it is certainly the way to go.

But, being pedantic -remember SSD is blazingly fast on both random and serial I/O. So, without profiling your boot I/O you can't say for sure whether its massive throughput of SSD here which is swinger for you or massive IOPS.

But heck -as long as our desktops are fast do we really care?? ;-)

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
AlexP's picture

Marvelous work Ian!

SSD is indeed the best storage choice for a database.yes

0
Login to vote
Ding Honghui's picture

Best article I have read about the disk performance!

And I have some note:

The raid5 write is not always 2 read and 2 write.

For example, the full strip write is not the case.

From wikipedia: RAID 5 implementations suffer from poor performance when faced with a workload which includes many writes which are *smaller than the capacity of a single stripe*.

Actually, the detail action depends on the available strip space and the size of data to be written.

0
Login to vote
ianatkin's picture

Hi Ding -thanks for the high praise. This article has been much more useful to the community than I initially gauged. I should have written it sooner!

To you point on RAID5 stripe commits, this is absolutely correct. If a whole stripe needs to be written then the initial read to confirm the parity in advance is totally redundant.

However for the purposes of IOPS calculations (where we are predominently dealing with the smallest writes possible) then this penalty remains.

This is not to say vendors don't take advantage. Several vendors will temporarily cache RAID5 segment writes, the aim here being that should the sweet spot of a full-stripe write emerge then they can commit this in one go.

Clever stuff, and (as you point out) another important factor when thinking about profiling your application against your storage.

 

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
ricodjs's picture

Good article, Ian, the level of detail and determination to get to the facts you show here are inspiring. Almost good enough to get you a job at Cambridge :)

R

0
Login to vote
ianatkin's picture

That Be Fighting Talk Mister.

Barcelona. The Hilton at Dawn. May the best SQL Query win... 

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
EdT's picture

Do you want me as your second?  Or at least to capture the event on memory card?

If your issue has been solved, please use the "Mark as Solution" link on the most relevant thread.

+1
Login to vote
ianatkin's picture

A second kinda makes it harder to chicken out.... or at least more official when I do... ;-)

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
Alim Shaikh's picture

Great Job Ian, awesome document...

0
Login to vote
michael cole's picture

Your passion for writing isn't hidden here, I love that you use the first person tense as opposed to a 'dry third', which is typically against convention for a technical document. Your writing is always comfortable and personal and i can see you took care to come across to the reader like that.

I think your diagrams are really neat, i'm not sure if you were being sarcastic with saying your powerpoint art was 'gorgeous', but I thought the Fisher-Price blue hard disk was very friendly looking and far preferable to the real thing!

It was really worth while for you to commit this to the forums, I will be directing people to here if they ask about IOPS.

Michael Cole

Principal Business Critical Engineer

Business Critical Services

+2
Login to vote
DentargNOT's picture

Got one question.

Is write penality constant?

For example for raid1 write penality here is 2, but I guess it changes with number of disks. Cause getting 500 IOPS in raid1 with 8 disks seems unreal (in example above).

Not sure about raid5, cause it depends on implementation, some could get bigger write penality with larger number of disks.

0
Login to vote
ianatkin's picture

It's best to assume that the write penalty is constant for any given RAID configuration. This gives you the worst case scenario figure at high loads when vendor optimisations no longer apply. Although in truth, at high loads vendor solutions can have exhibit other unpleasant issues I guess.... ;-)

And yes -the figures can be surprising. This is because vendors will always quote where they can best case IOPS that an array is capable of. They might for this reason quote read IOPS or even write IOPS (but for striping with no redundancy).

The reason why I started doing this type of testing a few years back is because I was always surprised that my arrays were never achieving the vendor figures. It pays to look a bit deeper into how those figures were obtained, but also to bear in mind of course the disk fundamentals I've tried to cover here.

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
Lery's picture

This gets interesting when you put a SAN into play. I see a lot of people just throwing the Symantec CMDB blindly into the SAN world without a thought on how it will impact performance.

0
Login to vote
thawk's picture

Good work. I saw that fusion-io plug in one of the comments. It is defiantly worth looking to, "the Woz" and co. seem to be doing this right. Of course under full disclosure I went to a fusion-io presentation earlier this week.. koolaid was good for those people in need of iops.

0
Login to vote
ianatkin's picture

FusionIO is great, and am still holding out hope to get this in (at some point in the murky future...).

I did get a OCZ RevoDrive for my desktop and have been fairly impressed I have to say. I have dumped my server VMDKs on it, and the response times & throughput's are brill. Would be curious to gauge the difference between this and an SSD -the response times are meant to be superior when we plug direct into the bus..... but still would be nice to see a response time histogram for both!

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
Sebastien77's picture

Thanks for this great article! I am a complete newbie and I like the approach but there are a couple of things I do not quite get...

1) I do not understand where the performance difference between purely read IOPS and purely write IOPS stems from (IOPS graphics in the article) given that there should be no difference in terms of mechanical movements (arm and disk speed) in both scenarios.

2) Talking about read/write differences I understand the concept of random reads (they might happen anywhere on the disk and the controller cannot choose) but the random writes are puzzling me... purely random writes are somewhat very unlikely to happen for a well designed system and an empty disk since the data coming in can be written sequentially. Only when random reads are interleaved the head may be sent somewhere else (but then the write cache should help a lot)... as such what do pure write IOPS really mean in practice? Could the simple performance approximations be simply adapted to account for it, at least in some very unbalanced scenarios (e.g. many writes, very few reads)?

3) Talking about the read cache: is it really fundamental in a hard drive? For memory access I understand that data proximity is highly correlated to the likelihood to access them within a given amount of time but for a hard drive how does this happen in practice (is is sector based, meaninfg the full sector is always fetched to the cache if accessed? Does it then really help?)

I am not trying to sacrify the simplicity of the article but these thoughts might be interesting points to complement it and I would like to have your point of view!

Seb

0
Login to vote
ianatkin's picture

Hi Seb. I'll try to answer each point as clearly as I can.

  1. Difference Between Read and Write IOPS
    For a single drive, there is actually a small difference between a read IOPS and a write IOPS. And it is a mechanical difference -a couple of extra milliseconds seek is required to allow head to settle and the data to be written. In short, you can be a bit more gung-ho reading data from a disk than you can be when writing it.

    For multiple drives connected through a RAID controller, when writing you have the overhead of extra reads and writes to satisfy the RAID data duplication and parity requirement. This is the RAID write penalty and is much more significant that the small percentage overhead you get from the extra head settling time described above.
     

  2. Random Writes Should be rare?
    Remember here I am talking about enterprise storage -not just a single disk.Today's world of virtualised workloads act as an IO blender on storage so many workloads which you think as being sequential on a per app or per server basis actually get randominsed.

    If you want more specifics, take a look at the graph below,

    Here I illustrate three applications which demand good storage performance -Virtual Desktop Infrastructures(20-80% random writes), Microsoft Exchange server (30% random writes) and of course Altiris ITMS (98% random writes). Yes, caches can help but these only work to allieviate spikes. If you have consistently high random write workloads, you'll in the end come to the limit of the drive mechanics.

    You ask what random write IOPS mean. In practice, if we remove the virtual I/O blender from the argument random write IOPS tend to mean disparate and disconnected I/O operations. For example a database backend to an online banking system. Lots of people randomly requesting different pieces of data. Returning to the case of virtualised systems, it means lot of servers writing I/O blocks simulateneously which results in an apparent random spattering of writes to the storage as the governer attempts to serve the requests in the timely fashion.
     

  3. Hard disk Read Cache -Important?
    From an IOPS point of view -probabaly not. But from a normal disk operation point of view with sequential I/O then certainly yes. It is this that allows the drive to read ahead and deliver cached bytes rather than having to read mechanically exactly what is asked for, when it's asked for. Fabulous when assisting sequential I/O and allowing high burst data rates.

 

Hope that's helped a little? Don't forget to vote up the article -your thumb could make it hit 50 votes!

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

+1
Login to vote
Sebastien77's picture

Hi, thanks for your answer!

Point 1: Crystal clear.

Point 2: I overlooked that the perspective was more on database storage... let's now consider a single hard drive (or a single RAID0) on which a single client may either perform many read accesses or many write accesses (not mixed). My point was that many write accesses (even of small files) can a priori always be serialized to a long sequential write (provided the disk is close to empty and with the help of the write cache) so that the bottleneck should be the write bandwidth whereas random reads of far apart small files will hit the IOPS bottleneck. In essence for an empty disk random writes should be avoided by the controller if the write locations have not to be specific. Do you confirm this and shouldn't it be taken into account in the system planning?

Point3:
 
Following the previous comment reading data ahead of time to make them accessible through the read cache is only successfull for many cache hits. For random reads of many files spread all over the disk this is not likely to happen and for sustained sequential read the limiting factor is the sequential bandwidth (the cache cannot help) so I do not clearly see in which situtation the read cache can bring a real boost - could you develop? Is it really a process where the cache is filled up with nearby data or are the read accesses queued up to be eventually grouped together?

Seb

0
Login to vote
ianatkin's picture

Hi Seb,

Point 2
I think you are mixing the seqential and random I/O arguments here. To be clear, when considering IOPS we are always thinking about random IO by definition. Applications will perform a mix of serial and random IO and it is up to the implementer to ensure that the storage is capable of both requirements.

Your point about writing lots of files to a disk is valid, and indeed will most likely be a serial write. So here we are limited by disk RPM and fall into the sequential read/write I/O class I discussed at the beginning of the article.

Reading files from a disk will also likely fall into this serial I/O class. Generally, reading and writing whole files is generaly not overly restricted by IOPS unless they've been through some kind of I/O blender.

With regards to system planning, you'll need to profile your application (or get detailed spec sheets from the vendor). If you vendor is talking to you about IOPS, then they are talking about random I/O requirements. They will also (if you are lucky) advise you on the read/write fraction too as this is critical when choosing your RAID level.

But please don't get confused and think that you can fundamentally cheat the read/write arguments with caches. You can't. Caches give you breathing space and allows you to spread your I/O load a little. In the end, the data has to be written. And the higher your array utilisation, the less effective the cache becomes.

Point 3
I think you are right here -the disk cache is only effective for sequential I/O. More than that, it is only useful when you have plenty of idle windows in your IO pattern. In a sustained or random I/O scenario it can't help. 

To explain, lets imagine a 'cacheless scenario' where an application requests a 512K block from disk. The disk head navigates to the track, reads the sector, and sends that back up the bus. Now imagine that the application requests the next 1MB worth of blocks. The disk head no longer has to move (assuming no activity in the mean time), but has to wait a half revolution (on average) before if finds the requested start sector. It now needs to wait for the disk rotation to allow the next 1MB to be read.

Now consider the same situation with a disk cache. On the first I/O request the disk reads the first block and sends this up the bus as requested. It does not stop here though, but proceeds to read the entire track into cache. The application then requests the next MB and this request does not have to touch the disk -it comes straight from cache. This pre-fetch cache strategy gives great burst rates, but relies on idle windows to preload the cache with whatever data the pre-fetch algorthim decides is sensible.

Tough going isn't it?

 

 

 

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
Sebastien77's picture

Hi again,

No, I was not saying that the write cache solves everything but you are right to point out that write IOPS are random by definition. I was just trying to say that in some practical situtations the random write IOPS might just not be the right figure to consider to estimate the expected write performance and that this performance  will also depend on the state of the disk (empty or crowded with many files and few sparse empty locations). I fully agree that in the situation of many random writes of small files the write cache will not help.

Cheers,

Seb

0
Login to vote
ianatkin's picture

Hi Seb,

Whether the random write IOPS figure is the best performance benchmark or not will simply depend on the application. Remember also that the quoted IOPS figure is an average. It assumes the whole drive is available.

So, yes -in some practical situations this does depend on the state of the disk. If the disk is new the writes will all be at the beginning of the disk  and your head seek travel will be small. This will effectively inflate your observed IOPS figure for the drive (think of it as a disk lifetime partial stroking). As the number of files grow though and the disk becomes more highly utilised, you're IOPS will indeed converge to the theoretical IOPS figure.

Hence the argument for keeping system boot files at the start of a disk, in a separate partition. This strategy  gives you the benefit of the high sequential read (as you're on the outer rim of the platter) as well as an inflated IOPS for boot time operations.

Are we on the same wavelength now?

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
Sebastien77's picture

Hi!

Yes, thank you for the explanations!

For the sake of completness what would be the expected bandwidths (sequential read and sequential write) in RAID5 configuration (assuming the disks are empty and that we know the read and write bandwidths of a single disk)?

In this specific case the previous data and parity does not need to be read before writing it back as whole stripes could be written anew. As such I would expect read and write bandwidths to be quite close and around (n-1)*BW where BW is the bandwidth of a single disk and n is the number of disks in the array. However in practice it turns out that there is a large difference and that writing is slower... where does this difference comes from?

Seb

0
Login to vote
ianatkin's picture

Hi Seb,

Hmm...The controller is simply performing an algorithm at the block level of the storage. The filesystem level  (where you are targeting your argument) exists a level above that and the RAID controller simply does not get involved at that level.

In short, the RAID Controller does not know that the block data is all zero before the write (unless it's previously cached this block). It must read it first to know the value to perform the write and parity operation.

You are also on uncertain ground trying to benchmark a system using some initial writes at the beginning of the disk. The controller cache will certainly come into play at this level in a major way and will delay the write commit for a more opportune moment.

One rule when benchmarking is to always try to avoid scenarios where the cache comes into play as you cannot easily extrapolating such results to yeild your ultimate array performance. 

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
Sebastien77's picture

Yes it gets very difficult to predict the performance when the caches come into play. One thing I noticed in practice is that whereas the read and write bandwidths are usually quite close for a single disk system the write bandwidth often lags quite behind for RAID0, any clue?

Seb

0
Login to vote
ianatkin's picture

No clue -that's interesting. What RAID controllers are you analysing?

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
Sebastien77's picture

I unfortunatley did not perform these tests myself... I just noticed it from reading several reviews of different NAS (QNAP and Synology). I do not really understand why they get this difference and why they are not questioning it. In terms of expected sequential read and write bandwidths for RAID (0, 1 and 5) of n disks (as compared to bandwidths for a single of these disks): how would you do the maths (I did it but I would like your opinion)? I read different discussions on the topic but I could not find a clear answer.

0
Login to vote
ianatkin's picture

This is where I can't help you Seb.

When you are talking about box throughputs and benchmark tests the limitations aren't the ones I'm talking about in this article. Here I'm talking about IOPS, which focuses purely on the limitations due to harddisk mechanics. This is the bottleneck when dealing with tiny random I/O.

When talking about bandwidths through these NAS devices the mechanical disk attributes are no longer the bottleneck. At this end of the scale, the boxes bus speeds, processors, architecture come into play a lot more i.e. the bits that implement the RAID.

A long time ago there was an article written about not using fast drives in to your NAS boxes as the host processor couldn't take advantage of the drive speeds as it was fully utilised as it was managing the RAID. 

There are other limitations too with NAS. Let's take a look at some speeds...

  • Modern 7200 disk 125MB/s sustained transfer rate
  • 1GB LAN = 125MB/s
  • 3Gb/s SATA interface = 325MB/s

So three modern disks will more than saturate the bus at peak (even assuming the RAID could cope). And then of course you'll then be ultimately limited by your 1GB LAN.

The RAID bandwidth problems across NAS boxes means that chucking same disks into different NAS devices will likely yeild huge bandwidth throughput differences.

My gut feeling is that the disks sustained rates aren't the problem -it's how much resource/cash the vendor is willing to throw at the RAID for each NAS pricepoint in the market.

 

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

+1
Login to vote
Sebastien77's picture

Hi,

Don't worry I got the bandwidth ("number of disk platters + RPM" limiting regime) vs IOPS ("seek+access time" limiting regime) well hammered in my head :) and yes it is the first case I was interested in.

The NAS in question were tested over 10 GbE links but anyway the link limitation should not really come into play when only comparing read vs write bandwidths. I agree that for an array of 8 disks (7200 RPM) the NAS computing power might be the bottleneck (only for RAID5 write though) but I wanted to derive the theoretical maximum read to write bandwidths ratio assuming that the bottleneck is the disk RPM (and making all other necessary simplifcations) for RAID0,1 and 5

Cheers,

Sébastien

 

0
Login to vote
JeanLoupDaix's picture

To ianatkin ...and everybody elese interested.

I am using a sizing tool that for a given IOPS level, and a given Read/Write ratio level, and a given Read Cache Hits %, and a given Write Cache Hits %, and a given "Write Gather Cache %", the number of raid group needed based on HDD types, Raid types. I can not find simple explanation on WGC except at http://jes.ece.wisc.edu/papers/isca07_nesbit.pdf ...which shows, middle-of-the-lane, a 80% Write gather Cache %. Has anybody found a tool, or wrote a tool, or has a trick to deduct, based on other known collected meytrics, what could well be (ballpark, pair of ranges) WGC?

 

Thank you. JL

0
Login to vote
ianatkin's picture

Hi Jean,

This is the first I've even heard of the Write Gather Cache!  A quick google reveals it's an Oracle Database performance enhancer -it allows the DB to buffer upto 4MB to increase the performance of large object data writes.

From a IOPS point of view, you'd probably need to know what percentage of write data was likely to fall into the LOB (large object) category before you can deduce the performance enhancement this would provide.

I'll take a look at the PDF link on my commute tomorrow and see how they've done this. My revised IOPS article (see link at the top) has an added section where I talk about IOPS more from an application point of view (more specifically the read/write profile). That approach will help give you the hardware limit, but does not of course take into account the various caches that you are interested in.

Kind Regards,
Ian./

 

 

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
ianatkin's picture

Hi Jean,

Sorry for the lateness of response here -I did look at this PDF on "Virtual Private Caches", but that didn't really help clarify anything on the Write Gather Cache for me. If you want to talk more about this, it might be best to drop me a PM.

Kind Regards,
Ian./

 

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote
Bzia's picture

Love this one,, well written yes

0
Login to vote
AttilaKovacs's picture

Ian,

It's a really great topic but I have some more question after read this.
In your sample you have an SQL server with 1+238+593=832 IOPS required at the same time.
Is it normal to plan your system to 500 IOPS? In my opinion if you build an array from 8 disks in RAID 10 as you describe later for achieve 500 IOPS then in peek hours it will definitely the bottleneck of the system.

From other aspect: If I have 8 disks is it better to build two arrays of 4 disks in RAID10 (~250 IOPS each) and split IO-s between those ('separate spindle for LOG files') or build one RAID10 array (~500 IOPS) and pray to IO load not escalates on each other?

Best Regards

Attila

0
Login to vote
ianatkin's picture

Hi Attila,

It is not 'normal' at all to plan your system for 500 IOPS. The values quoted here are for a specific environment size in a specific configuratation which Symantec have provided for illustration and comparison purposes.

You really need to think about your application requirements first and foremost. This will allow you to identify  where your bottlenecks are likely to be. For example, from an Altiris point of view, I often for SMBs recommend that admins configure their databases for 'simple' recovery. This prevents transaction log file issues and allows us to focus a bit more on performance of the actual DB access, which starts us on the path of  optimising the random small write performance where generally I'd go for RAID-10.

There is no quick fix figure for IOPS; you'll need to look at how you intend to configure the application and from there see what repercussions that has for storage.

Kind Regards,
Ian./

 

Ian Atkin, IT Services, Oxford University, UK

Connect Etiquette: "Mark as Solution" those posts which assist you most in resolving your problem, and give a thumbs up to useful articles and downloads

0
Login to vote