File System and disk speed issues

Uyarı! This page needs to be rewritten. Some of procedures described here should be used only by experienced users that know exactly what effect they have to the computer

Several areas can be looked at troubleshoot or optimize file i/o performance. The order was initially "what came to mind", but I then tried to organize it by complexity and likelihood of getting useful information or results.

1) You might want to test different file systems -- "xfs" is usually faster for large file read/writes, (but sucks at "rm"ing large numbers of files), reiserfs is supposed to be better at handling large numbers of small files better, but I haven't seen any recent statistics. 1a) Note that file system parameters and defaults may have changed between releases if you are using newly formatted partitions.

2) Try comparing the device READ performance try using "hdparm" (in package of same name). I don't know if hdparm will work on pseudo devices like "raid disks", but in the simple case, it does a good job of benchmarking maximum read speeds independent of the file system.

Example:

hdparm -t /dev/sda

(output looks like:

/dev/sda:
Timing buffered disk reads:  60 MB in  3.00 seconds = 20.00 MB/sec)

For a single partition you can use:

hdparm -t /dev/sda1

With IDE disks, hdparm will tell you what transfer mode is being used with your disks. With SCSI drives, the drivers show the maximum transfer speed (bus transfer rate) during the boot process. Theoretically, that shouldn't have changed, but is an unlikely possibility of performance differences.

3) In your "write" tests, you should ensure that the "writes" are completed before they return -- i.e. use a "sync" command after the "dd" to ensure that the contents are actually written to disk, i.e. use:

"sync;time bash -c "(dd if=/dev/zero of=bf bs=8k count=500000; sync)"

The 1st sync makes sure any previous writes are purged before starting "dd", the 2nd makes sure that the blocks written by "dd" are actually written to disk before the time is measured.

4) Related to disk writes finishing, you might look at mount options. The biggie is the "async" option. I seem to remember that the default options used to mount disks as "async" devices, allowing buffered I/O and I/O optimizations. At some point, the default changed to "synchronous" because of the concern of data-loss in the "default" configuration -- they felt it was safer to, by default, write data "synchronously", i.e. do the write and finish writing to the disk before returning from the system "write" command, but this really slows down performance and disables the ability of the kernel to temporarily buffer writes in memory to optimize writes to disk. It's usually a difference of "seconds", but the linux kernel tries to optimize how it accesses the disk to minimize the need for disk-head movement. I.e. when set for asynchronous, blocks written to disk may not be written to the actual disk in the order they were written by an application. They can be "reordered" to minimize head movement allowing for faster, overall disk performance. Some file systems like "Xfs" also contribute to this buffering by trying to "wait a bit" to see if an application wants to write a big file rather a small one, so that the file system can allocate a big, contiguous space on disk for the file if possible, rather than writing a big file to the first small, empty holes on disk -- that can add a considerable amount to file read and write speed.

5) Ideally you are performing these tests on empty partitions located at the same place on disk (and same drive type) Ideally you want to reuse the same block positions to write to on each device.

Blocks and tracks at different locations on the disk will usually show some speed difference due to how many sectors/track can fit on the tracks at the innermost tracks (few), vs. the outermost tracks (many). I _think_ (would have to look it up to be sure), lower numbered tracks and blocks are toward the outside of the disk, as this is where the disk has the highest number of sectors/track -- since in modern disks, manufacturers use "sectioned" disks that have different numbers of sectors/track (because if each sector takes up a fixed width, you can fit more of them on the outermost tracks vs. the innermost).

6) I don't believe it would be an issue in your RAID-1 configuration, but thought I'd mention it in case you change to another RAID config at some point. When using some RAID configurations (1, 5), some filesystems allow you to specify the "stripe" size that is being used across the RAID disks. This allows the filesystem to have knowledge of stripe sizes when laying out space on disk so that files can be at an optimized speed for a given RAID device. XFS is one such file system (see references to RAID on mkfs.xfs man page).

7) Disk health. As I understand RAID 1, it's primarily using 1 disk as a backup of the other. Read speed can theoretically be faster, as either disk can be read from but that is implementation dependent. Write speed will be slowed down to the speed of the slowest disk written to. I.e. the program (and kernel) has to wait for writes to be done to 2 separate

devices that may not perform a given write at the same speed. That's why it's recommended to use same model number disks in RAID arrays, to optimize the synchronicity of write times. However, even though you have the same model of disk on your systems, a problem can develop as individual drives each have a unique "bad sector remapping" list.

Regardless of the OS used, modern IDE and SCSI drives have the ability to automatically remap bad sectors and bad tracks. This is hidden from the normal OS interface but can be read using the "smart" tools (package name "smartmontools") for IDE disks and the tools in the "scsi" package (scsiinfo, sg_info) for SCSI disks. Some devices also support self-diagnostics and reporting. See man pages for usage.

>> The reason the defect mapping is important in performance testing considerations is that if one or both of the disks has one or more sectors or tracks "remapped", then even though you may be writing to both disks at the same "logical" sector or track, they disk may have remapped the "physical" sector or track to a different physical location on the disk. This means that while drive 0 might be writing to logical track(or sector) 300 and physical track (or sector) 300, a second drive might have remapped the physical track or sector for logical track (or sector) 300 to sector or track 499 (unique "backup" sectors and tracks are chosen by model). What this means to performance is that while disk 0 can write to track 299, 300, 301 and have it be 1 big write, disk 1 would write to physical tracks 299, 499, 301. This means disk 1 has to seek to track 499 and then back to #301 with a seek costing something from < 1ms to a couple (several?) ms, and another seek back to 301. This can _way_ slow down writes not only to RAID arrays but single disks as well.

The remapping is, to some extent, "normal" and increases with disk age. It's possible to have drives "new" from the factory that may have a few "defects" (or remapped sectors) depending on the standards set by the OEM and the "quality" of the disk they are selling. SCSI drives have, historically, had better manufacturing standards.

Tools like "smartmond" can monitor disk health and tell you when a drive is getting close to exhausting it's store of spare tracks/sectors and when a drive is beginning to fail. It does support some SCSI devices as well.

8) make sure no other processes are running during testing; It could be that in 9.3, you have some new process enabled by default that accesses the disk frequently, or something like "syslogd" could have it's configuration change (it can be set to "sync" after each write or not; "syslog.conf(5)" manpage. Same for several other background automatic processes -- some may have had their options changed from an async write to a synchronous write or a new background process might be accessing the disk more frequently on your newer 9.3 setup. That could throw disk tests as well. To rule out their interference, you might boot into single-user mode and make sure you as few "optional" demon processes running as possible.

9) Different kernels can use different disk scheduling algorithms and the defaults for different kernels change over time based on new research and new algorithms. What algorithms are used and what parameters are used with the different algorithms can be selected by the vendor (SuSE) when building their kernel. The different algorithms also can have tweakable numbers that can be adjusted at run time. and often have "tweaks" that can be set at run-time through parameters set through /proc and possibly through yast using the System panel and either "/etc/sysconfig" or "Powertweak Configuration".

That stuff can get real complicated and what's deciding what is "best" depends your needs. You can also mess up your system to perform worse or not at all if you make random changes in those programs.

But suffice it to say that some block-I/O (disk i/o) scheduling algorithms and parameters are tuned more for interactive desktop usage, whereas others can be more ideal for machines used as "servers" where interactive desktop performance is not a priority (they may not even run a desktop).

To read about two of documented algorithms, if you have the linux kernel source loaded in /usr/src/linux, then under that directory, in the subdirectory "Documentation/block/" you can read the files "as-iosched.txt" and "deadline-iosched.txt".

They do assume a fair amount of technical knowledge, however if one really wants to "tweak" and "customize" their kernel, one might(?!) want to become familiar with such things.

That's about as complete as I can think of off the top of my head, hope you find the source of your slowdown...

Several of the above sections could be put on separate pages.