Sunday, July 7, 2013

What are exactly O_DIRECT, O_SYNC Flags, Buffers & Cached in Linux-Storage I/O?

Feels good to post after a long time. I always hear HPC systems people flapping their mouths in context of I/O performance measures in distributed file systems like Lustre or CXFS about Direct I/O so I thought lets dig this. You might have seen usage of dd command on my blog with oflag parameter, but for guys who don't know I have briefly revised it here.
dd if=/dev/zero of=/tmp/testfile bs=2M count=250 oflag=direct
oflag here indicates write operation through dd in accordance with provided symbols, i.e. direct. What exactly is direct we will see a little later in the post.
dd if=/tmp/testfile of=/dev/null bs=2M iflag=direct
Similar to write operation, read operation takes iflag as a parameter with same symbols, particular to our interest is direct.

If you fire these commands with or without oflag/iflag you will notice a significant I/O performance difference in statistics provided by dd. This is basically the effect of cache employed by modern day storage systems/Linux kernel.
Now these caches can be multilevel going right from operating systems buffer cache to storage controller cache to hard drive cache, so depending upon the underlying system architecture cache effects will appear. Filesystem software also plays a huge role in caching behavior. A traditional distributed filesystem might leverage multiple caches on multiple LUN's distributed across multiple storage controllers. An object-based filesystem such as Lustre will have multiple OSS (Object Storage Server's) which will leverage it's own independent OS buffer cache to enhance performance. I am going to do a separate detailed post shortly about Lustre Performance Impact due to OSS Cache. My point is benchmarks of cache effects of a specific HPC system as whole is not comparable to another system unless all the granular details are known and acting in the same direction. Cache effect cannot be completely removed in today's complex systems, we can try to tell underlying components to not use cache if they are configured to accept such requests. When you open a disk file with none of any flags mentioned below, a call to read() & write() for that file returns as soon as data is copied into kernel address space buffer, actual operation happens later on depending upon operating system. Buffer usually defaults to 2.5% of physical memory but this is subject to change depending upon different Linux kernel tree. We also see what is the difference between "buffers" section and "cached" section of free command in later section of this post.

Linux usually tends to cache maximum I/O requests in memory so that consequent read can be served right from main memory & also write operations are performed on main memory giving illusion to application process of higher read/write performance. Meanwhile the data from main memory is committed to disk a little later, you can call it a delayed-write. Process at the same is informed that I/O is completed and it can continue its execution. Linux in the back-end performs all the dirty work of pushing bits to hard drives. Memory pressure can also cause Linux to reduce this main memory buffer.

However some times you really need to make sure that whatever data you want to read is actually coming from disk, this might be a case where distributed/clustered file systems have to share the same data with cluster nodes instantaneously and as stale data will occur on disk if cache's are not committed in proper manner. Direct I/O is useful in case of performance testing of hard-drives. Direct I/O will allow process to specify to perform I/O without using OS cache layer. This makes I/O slower as actual spindle write/read will happen and Linux kernel optimizations may be overlooked.

Notice here that we are telling open(2) call to eliminate or minimize cache layer using symbol O_DIRECT. This doesn't mean that I/O will write to cache and then immediately cache contents will be written to disk. Direct I/O directly performs I/O onto the disk. O_SYNC flag is used to commit cache contents to disk on users request, which we will discuss in next section of post. An O_DIRECT read() or write() is synchronous, control do not return until the disk operation is complete.

Sometimes it is advised to use DIRECT I/O for data which will not be accessed in near future to save double copy of data from application buffer to kernel buffer then to disk. This saves the kernel buffer to be used for caching more frequently accessed data. Hence if you are in control of the complete system and software layer is in-house developed then it is good to use DIRECT I/O.

O_SYNC Flag :
This particular flag when used returns control when I/O is actually performed on the disk through buffer cache, so changes are made persistent in the disk to prevent loses from unexpected crash. This flag is generally used when checkpointing of data needs to be done when a particular process has processed that much amount of data. Please notice here that there is  a very subtle difference between O_DIRECT & O_SYNC, O_SYNC data flows through buffer to disk while in case of O_DIRECT kernel cache layer is eliminated or minimized. O_SYNC flag commits whole data including metadata to disk while O_DSYNC only commits data part. bdflush()/pdflush() kernel daemons are responsible for data commits. Basically O_SYNC flag commits the data and removes the dirty flag from "cached" data so in case of memory pressure kernel can easily drop SYNC'ed data freeing the ram for other "cached" data.
fsync() call or sync command is used in Linux to perform write to disk or any remaining data commits, this will not return until the operation is done. Usually to always make sure that data is committed to disk is to do a DIRECT I/O followed by fsync() ensuring that DMA is used to transfer data to disk and fsync() takes care of all metadata commits to disk. We will verify this in little experiment below. There exists a catch that all modern day hard-drives implement a small on-device cache, O_SYNC'ed data will not always write to disk platters, on-device cache can commit the write later on although the failure of this cache is very rare.

A very good article to read more about O_SYNC & Hard Drive Cache behavior

HDD, FS, O_SYNC : Throughput vs. Integrity

LWN Article

Ensuring data reaches disk

Now lets play with these concepts in Linux, I have fired a virtual machine of CentOS Minimal just to test this in Virtualbox. I am testing this with 1024MB of memory. How much memory you have in the system determines how much percentage of memory is going to be used for dirty pages/buffers/caches. I have kept all the "vm" layer kernel parameters default for the first test to show you when you do a write operation with data which is almost equal to size of main memory then kernel daemon pdflush is forced to writeback the data to disk hence reducing the sync time along with slightly less performance as application process is writing to kernel cache and pdflush daemon is committing it to disk. I am not going to explain much on the dirty related parameters here as this will loose focus on topic of this post and you can easily Google them. 

Default dirty parameters in CentOS 6.4

Behavior of I/O ops large enough considering dirty ratios 

Second snapshot reveals that for Write operation of 524MB the sync operation didn't took much time as dirty pages are already committed by pdflush to immediately flush cache if memory pressure increases or to allocate cache to new demanding process. Take into consideration that even if dirty pages of write operation are committed back to disk pdflush didn't emptied the cache, benefit of this is the consequent read operation on that respective file will be faster. In the same manner, the drop_caches operation also takes very less time as data is already committed by pdflush. Before we go further lets have a quick brief on drop_caches types.

echo 1 > /proc/sys/vm/drop_caches
This frees Pagecache from memory.
echo 2 > /proc/sys/vm/drop_caches
This frees Dentries (Directory Lists) and Inodes.
echo 3 > /proc/sys/vm/drop_caches
This frees Pagecache, Dentries & Inodes.

These operations are non-destructive i.e. if a object is dirty, drop_caches will not touch it and also it attempts to write it back. If sync operation is performed then drop_caches parameter can free the inactive memory immediately. I suggest to monitor all this stuff at /proc/meminfo. Dirty & Writeback fields in meminfo shows current dirty pages and writeback operations queued up. Write-back operations are very transient and you will see them only when sync operation has queued data to commit from dirty pages but hasn't written them to disk yet.

Remember these operations to drop-caches are like command to kernel at present time. As soon as you drop the caches, kernel carries on to cache new requests irrespective of the value of drop_caches. Drop_caches file contains the value which is entered last time and it is no longer in use. So caching behavior is permanent in kernel unless you use O_Direct flag.

Difference between buffers & cached in free command :
From 2.6.X kernel series, buffers are used to cache file system metadata, i.e. buffers remember dentries, file permissions and keeps track what memory region is being written or read to for a particular block device. Cache only contains contents of the file whether it is to be written or read. Another way to tell this is Cache only keeps page caches from which eventually file is read. Just to clarify the "-/+ buffers/cache" line in free output shows how much used & free memory is projected to application, this doesn't include buffers/cached memory as it will be freed when required by flushing daemons.

I have pulled a little experiment to demonstrate how buffers cache the metadata by stressing directory structure using find command. I draw-ed conclusions based on output of /proc/meminfo. Small shell script to create a massive directory structure is given below. This is executed 4 times to create humongous number of directories with small file touched inside each innermost directory.
for i in {1..800}
        mkdir $i
        cd $i
        for j in {1..50}
                mkdir $j
                cd $j
                for k in {1..10}
                        mkdir $k
                        cd $k
                        touch test.file
                        cd ..
                cd ..
        cd ..

Buffers impact on performance & preservation of buffers.
Note - Dirty ratio parameters doesn't affect buffers as it is purely metadata hence kernel flushing daemons will not touch it unless there is memory pressure. I have just highlighted the new dirty ratio parameters for my next experiment.

As you can see from the above snapshot buffers greatly improve performance. For 108989 nested directories, 478MB of cache metadata is present in buffers. This concludes in very less traversal time in next fire of find command. Buffers are kept as long as possible unless explicitly flushed by drop_caches or under memory pressure. Notice the 1.5 Sec over 1 minute 41 seconds. Also notice that this metadata didn't go to cached as it is only showing around 7 MB (which is usually kernel thread caches).

Now lets move to next section. Changes in dirty parameters made by me are given below, these are specifically needed for me to demonstrate the dirty cache behavior as total memory in my virtual machine is only 1024 MB. To keep more dirty data in cache I had to increase these parameters. Remember values chosen here are supposed to be finalized very carefully for production servers depending upon the type of workload and amount of memory present. These values are for experiment only and shouldn't be played with, specifically the centisecs related parameters which is recommended not to be touched.

Parameters tampered & explained :
  1. vm.dirty_background_ratio - New value - 95% : This is the no. of pages at which pdflush starts to flush dirty pages to disk.
  2. vm.dirty_ratio - New value - 95% : This is the no. of pages at which process itself writes dirty data back to disk. This indicates till the time 95% of file system cache is not filled it will not write.
  3. vm.dirty_writeback_centisecs - New value - 7000 : Time interval between pdflush daemon wake-ups, in 100'ths of seconds. Setting this to zero disables periodic writeback completely. With new parameter, exactly after 70 seconds dirty data will be flushed automatically by pdflush.
  4. vm.highmem_is_dirtyable - New value - 1 : Above values are calculated based on lowmem value which will come to small value considering 1024MB ram, so I have enabled highmem as dirtyable so that more memory can be kept dirty. 
    1. Memory Zones on my 32 bit x86 virtual machine -
      • DMA ~  16380KB
      • Normal ~ 890872KB
      • HighMem ~ 141256KB
Memory Zones on x86 system with 1024MB RAM
Note - Above mentioned values can be obtained by multiplying spanned no. of pages with kernel PAGESIZE.

non-destructiveness of drop_caches
Even though drop_caches is executed dirty pages are not dropped off, as pdflush hasn't flushed the data.

On-Demand Sync afterwards
Furthermore to previous operation if we execute sync command which will commit dirty data on demand, drop_caches can immediately remove the cache.

Hope you have enjoyed the article. Let me know if you guys have any thoughts on this.


  1. Thanks for an informative article. I have a question though. Why is it necessary to sync when using O_DIRECT as opposed to dropping_caches.

    Shouldn't O_DIRECT essentially invalidate the cache (data may be old)? Sync seems odd. I assume the data I wrote isn't in cache, so why flush it to disk?

  2. Hi. First of all many thanks for writing a detailed post.
    I have some Lustre specific queries. My code runs faster on workstation (with ext4 filesystem) as compared to the compute cluster (with lustre). I/O seems to be a bottleneck in my case. I gained some ~10% speedup using "iobuf" library. Curios to know if "iflag=direct" can help in getting the read() part of the code to execute faster as compared to using "iobuf" library ?

    Secondly, will increasing the size of data read during each RPC call can help?

    Thanks in advance.