Thursday, 9 June 2011

SQL Server I/O Bottlenecks ?


I/O Bottlenecks

SQL Server performance depends heavily on the I/O subsystem. Unless your database fits into physical memory, SQL Server constantly brings database pages in and out of the buffer pool. This generates substantial I/O traffic. Similarly, the log records need to be flushed to the disk before a transaction can be declared committed. And finally, SQL Server uses tempdb for various purposes such as to store intermediate results, to sort, to keep row versions and so on. So a good I/O subsystem is critical to the performance of SQL Server.
Access to log files is sequential except when a transaction needs to be rolled back while access to data files, including tempdb, is randomly accessed. So as a general rule, you should have log files on a separate physical disk than data files for better performance. The focus of this paper is not how to configure your I/O devices but to describe ways to identify if you have I/O bottleneck. Once an I/O bottleneck is identified, you may need to reconfigure your I/O subsystem.
If you have a slow I/O subsystem, your users may experience performance problems such as slow response times, and tasks that abort due to timeouts.
You can use the following performance counters to identify I/O bottlenecks. Note, these AVG values tend to be skewed (to the low side) if you have an infrequent collection interval. For example, it is hard to tell the nature of an I/O spike with 60-second snapshots. Also, you should not rely on one counter to determine a bottleneck; look for multiple counters to cross check the validity of your findings.
PhysicalDisk Object: Avg. Disk Queue Length represents the average number of physical read and write requests that were queued on the selected physical disk during the sampling period. If your I/O system is overloaded, more read/write operations will be waiting. If your disk queue length frequently exceeds a value of 2 during peak usage of SQL Server, then you might have an I/O bottleneck.
Avg. Disk Sec/Read is the average time, in seconds, of a read of data from the disk. Any number
Less than 10 ms - very good
Between 10 - 20 ms - okay
Between 20 - 50 ms - slow, needs attention
Greater than 50 ms – Serious I/O bottleneck
Avg. Disk Sec/Write is the average time, in seconds, of a write of data to the disk. Please refer to the guideline in the previous bullet.
Physical Disk: %Disk Time is the percentage of elapsed time that the selected disk drive was busy servicing read or write requests. A general guideline is that if this value is greater than 50 percent, it represents an I/O bottleneck.
Avg. Disk Reads/Sec is the rate of read operations on the disk. You need to make sure that this number is less than 85 percent of the disk capacity. The disk access time increases exponentially beyond 85 percent capacity.
Avg. Disk Writes/Sec is the rate of write operations on the disk. Make sure that this number is less than 85 percent of the disk capacity. The disk access time increases exponentially beyond 85 percent capacity.
When using above counters, you may need to adjust the values for RAID configurations using the following formulas.
Raid 0 -- I/Os per disk = (reads + writes) / number of disks
Raid 1 -- I/Os per disk = [reads + (2 * writes)] / 2
Raid 5 -- I/Os per disk = [reads + (4 * writes)] / number of disks
Raid 10 -- I/Os per disk = [reads + (2 * writes)] / number of disks

For example, you have a RAID-1 system with two physical disks with the following values of the counters.
Disk Reads/sec            80
Disk Writes/sec           70
Avg. Disk Queue Length    5

In that case, you are encountering (80 + (2 * 70))/2 = 110 I/Os per disk and your disk queue length = 5/2 = 2.5 which indicates a border line I/O bottleneck.
You can also identify I/O bottlenecks by examining the latch waits. These latch waits account for the physical I/O waits when a page is accessed for reading or writing and the page is not available in the buffer pool. When the page is not found in the buffer pool, an asynchronous I/O is posted and then the status of the I/O is checked. If I/O has already completed, the worker proceeds normally. Otherwise, it waits on PAGEIOLATCH_EX or PAGEIOLATCH_SH, depending upon the type of request. The following DMV query can be used to find I/O latch wait statistics.
Select  wait_type,
        waiting_tasks_count,
        wait_time_ms
from    sys.dm_os_wait_stats 
where  wait_type like 'PAGEIOLATCH%' 
order by wait_type

wait_type       waiting_tasks_count  wait_time_ms   signal_wait_time_ms
-----------------------------------------------------------------------
PAGEIOLATCH_DT  0                    0                    0
PAGEIOLATCH_EX  1230                 791                  11
PAGEIOLATCH_KP  0                    0                    0
PAGEIOLATCH_NL  0                    0                    0
PAGEIOLATCH_SH  13756                7241                 180
PAGEIOLATCH_UP  80                   66                   0

Here the latch waits of interest are the underlined ones. When the I/O completes, the worker is placed in the runnable queue. The time between I/O completions until the time the worker is actually scheduled is accounted under the signal_wait_time_ms column. You can identify an I/O problem if your waiting_task_counts and wait_time_ms deviate significantly from what you see normally. For this, it is important to get a baseline of performance counters and key DMV query outputs when SQL Server is running smoothly. These wait_types can indicate whether your I/O subsystem is experiencing a bottleneck, but they do not provide any visibility on the physical disk(s) that are experiencing the problem.
You can use the following DMV query to find currently pending I/O requests. You can execute this query periodically to check the health of I/O subsystem and to isolate physical disk(s) that are involved in the I/O bottlenecks.
select
    database_id,
    file_id,
    io_stall,
    io_pending_ms_ticks,
    scheduler_address
from    sys.dm_io_virtual_file_stats(NULL, NULL)t1,
        sys.dm_io_pending_io_requests as t2
where  t1.file_handle = t2.io_handle

A sample output is as follows. It shows that on a given database, there are three pending I/Os at this moment. You can use the database_id and file_id to find the physical disk the files are mapped to. The io_pending_ms_ticks represent the total time individual I/Os are waiting in the pending queue.

Database_id   File_Id io_stall io_pending_ms_ticks  scheduler_address
----------------------------------------------------------------------
6                  1                  10804           78                          0x0227A040
6                  1                  10804           78                          0x0227A040
6                  2                  101451          31                          0x02720040



Resolution
When you have identified an I/O bottleneck, you can address it by doing one or more of the following:
Check the memory configuration of SQL Server. If SQL Server has been configured with insufficient memory, it will incur more I/O overhead. You can examine following counters to identify memory pressure
Buffer Cache hit ratio
Page Life Expectancy
Checkpoint pages/sec
Lazywrites/sec

Increase I/O bandwidth.
Add more physical drives to the current disk arrays and/or replace your current disks with faster drives. This helps to boost both read and write access times. But don't add more drives to the array than your I/O controller can support.
Add faster or additional I/O controllers. Consider adding more cache (if possible) to your current controllers.
Examine execution plans and see which plans lead to more I/O being consume. It is possible that a better plan (for example, index) can minimize I/O. If there are missing indexes, you may want to run Database Engine Tuning Advisor to find missing indexes

The following DMV query can be used to find which batches/requests are generating the most I/O. You will notice that we are not accounting for physical writes. This is ok if you consider how databases work. The DML/DDL statements within a request do not directly write data pages to disk. Instead, the physical writes of pages to disks is triggered by statements only by committing transactions. Usually physical writes are done by either by Checkpoint or by the SQL Server lazy writer. A DMV query like the following can be used to find the top five requests that generate the most I/Os. Tuning those queries so that they perform fewer logical reads can relieve pressure on the buffer pool. This allows other requests to find the necessary data in the buffer pool in repeated executions (instead of performing physical I/O). Hence, overall system performance is improved.

No comments:

Post a Comment