“Cyberspace. A consensual hallucination experienced daily by billions of legitimate operators, in every nation, by children being taught mathematical concepts... A graphic representation of data abstracted from banks of every computer in the human system. Unthinkable complexity. Lines of light ranged in the nonspace of the mind, clusters and constellations of data. Like city lights, receding..."
- William Gibson, Neuromancer

Monday, September 30, 2013

Catching segmentation fault, it's analysis & bunch of other stuff

We all know what is segmentation fault (SIGSEGV) very well, one liner definition of it would be when process writes to a memory location which isn't mapped into its address space, segmentation fault occurs. This is actually a SIGSEGV signal sent to offending process by kernel. To analyze & debug the core dump file is needed, now if due to some reasons core file is not getting generated then you can use catchsegv script provided in Linux to analyze this violation situation. This nifty little script gives us exact details when SIGSEGV signal is received.

Important excerpt from catchsegv script which does all the magic :
LD_PRELOAD=${LD_PRELOAD:+${LD_PRELOAD}:}/\$LIB/libSegFault.so \
SEGFAULT_USE_ALTSTACK=1 \
SEGFAULT_OUTPUT_NAME=$segv_output \
"$prog" ${1+"$@"} 2>&3 3>&-)
exval=$?
libSegFault.so itself generates segmentation fault if executed, but if it is preloaded using LD_PRELOAD then SIGSEGV signal can be caught on any executable following it. This library is provided by-default in any standard Linux distribution.

libSegFault.so is part of glibc package. When preloaded it basically takes the responsibility of signal handling before the main user program is invoked, if program doesn't handle the kernel signal, libSegFault.so catches it and prints debugging information on console.

I have wrote a simple code to demonstrate this using mprotect() call. Luckily when idea about this post came into my mind I was writing a code to understand Memory Mapped I/O in Linux and its performance implication, so I decided to use it for generating segmentation fault. Following is the flow of code.
  1. A file will be present in current directory to which text content needs to be appended.
  2. Fix string to be appended to file contents.
  3. Optional query and print Kernel Page Size, as mapped pages will be aligned with this page boundary.
  4. Open file in RW permission and A mode.
  5. Initialize fstat() structure for original file size.
  6. Calculate offset for ftruncate() call by adding original size and length of string to be appended.
  7. Truncate the file to accommodate new contents.
  8. Again initialize fstat() structure with new file size.
  9. Memory map the file from beginning with new file size as RW.
  10. Put program in sleep to analyze the process memory mappings using pmap or /proc filesystem.
  11. If SEGFAULT is needed call mprotect() and make previously RW memory pages to Read Only.
  12. Again put program in sleep to analyze mappings with new permissions.
  13. Append string contents to memory file using memcpy() call at offset calculated based on initial length of file. This operation fails if mprotect() is executed i.e. memcpy() is going to write on read only memory page causing kernel to generate SIGSEGV, otherwise it will continue to next step.
  14. msync() call will flush all new changes back to disk with Synchronous I/O i.e. actual disk commit unlike Asynchronous I/O where you write to Kernel Buffer layer. Refer to my earlier post for more details.
    1. What are exactly O_DIRECT, O_SYNC Flags, Buffers & Cached in Linux-Storage I/O?
  15. msync() call doesn't generate SIGSEGV even after mprotect() is executed as it only reads memory pages.
  16. Unmap the whole region.
  17. Close file descriptor.
  18. Optionally measure timing at each stage.
Code excerpt from my program for mprotect() call :
//If user needs to generate segmentation fault, then only execute this mprotect() section
if (atoi(argv[1]) == 1)
{
        //Convert mapping protocol to read only
        //mmap_start          : Starting address of mapping
        //myfilestat2.st_size : Length to which mapping protocol needs to be changed, so whole pages are changed
        //PROT_READ           : New protocol is read only
        //FYI                 : Any write operation on these pages after this call will be aborted with SIGSEGV signal
        mprotect (mmap_start, myfilestat2.st_size, PROT_READ);

                if (errno)
                        {
                                printf ("Error from mprotect(): %s\n",strerror(errno));
                                return 1;
                        }
                else
                        {
                                printf ("Mprotect() successful, mmap'ed converted to PROT_READ, SIGSEGV Imminent on memcpy()\n");
                        }
        //Print Memory Map
        system ("sleep 60");
}
Full source code can be downloaded from here.
Now lets execute this code and see it under catchsegv, gdb, valgrind, pmap & /proc/self/maps.
Both sleep timers will give us plenty of time to see memory mapping of file dst_file changing from RW-S to R--S.

catchsegv :

I ran segmentation fault generator under catchsegv utility and output is given below. Here you can see that memcpy()  call after mprotect() call generated the SIGSEGV as map was read only. Now a quick scan through below excerpt will reveal that "/tmp/dst_file" is mapped by mmap() in read mode(r) & shared(s), not for writing at all. This is done by mprotect() call. Now for those who don't know what is shared(s) flag on mapped pages here is a one liner explanation - it indicates that the changes made to this page are committed to the mapped file and are visible to all other processes sharing this mapping. Registers starting with R indicates this as 64 bit operation. 64 bit general purpose register R15 is holding address of starting point of dst_file mapping and accumulator register is holding a address which is in address range of dst_file. These are the registers at the time of segmentation fault. RAX stores string load & store operations, RDI stores points to destination in string operations. Explaining of all registers is all together a different topic. Hopefully you got the idea of debugging using this nifty utility.
[root@testsrv ~]# catchsegv ./sigsegv_generator 1
Appending initial file contents from src_file to dst_file using system() call
[+]cat redirection time: 0.002880 sec
Opened /tmp/dst_file for reading/writing
file desc = 3
Actual string length str=13 content to append with new line = "No Segfault
"
Original File size before truncate = 221
Successfully truncated file to digest new contents, new file size = 234
Mmap() successful
[+]mmap time: 0.000005 sec
Mprotect() successful, mmap'ed converted to PROT_READ, SIGSEGV Imminent on memcpy()
*** Segmentation fault
Register dump:

 RAX: 00002b12ead980dd   RBX: 000000000040146a   RCX: 000000000000004e
 RDX: 000000000000000d   RSI: 000000000040146a   RDI: 00002b12ead980dd
 RBP: 0000000000000003   R8 : 000000000040146a   R9 : 53202c444145525f
 R10: 00007fffe9b94a40   R11: 00002b12eae24530   R12: 00002b12eb12df08
 R13: 00007fffe9b94df0   R14: 00007fffe9b94de0   R15: 00002b12ead98000
 RSP: 00007fffe9b94cb8

 RIP: 00002b12eae24541   EFLAGS: 00010202

 CS: 0033   FS: 0000   GS: 0000

 Trap: 0000000e   Error: 00000006   OldMask: 00000000   CR2: ead980dd

 FPUCW: 0000037f   FPUSW: 00000000   TAG: 00000000
 RIP: 00000000   RDP: 00000000

 ST(0) 0000 000000000000003b   ST(1) 0000 0000000000000005
 ST(2) 0000 0000000000000005   ST(3) 0000 0000000000001d79
 ST(4) 0000 0000000000000000   ST(5) 0000 0000000000000000
 ST(6) 0000 0000000000000000   ST(7) 0000 0000000000000000
 mxcsr: 1fa0
 XMM0:  000000000000000000000000ff000000 XMM1:  000000000000000000000000ff000000
 XMM2:  000000000000000000000000ff000000 XMM3:  000000000000000000000000ff000000
 XMM4:  000000000000000000000000ff000000 XMM5:  000000000000000000000000ff000000
 XMM6:  000000000000000000000000ff000000 XMM7:  000000000000000000000000ff000000
 XMM8:  000000000000000000000000ff000000 XMM9:  000000000000000000000000ff000000
 XMM10: 000000000000000000000000ff000000 XMM11: 000000000000000000000000ff000000
 XMM12: 000000000000000000000000ff000000 XMM13: 000000000000000000000000ff000000
 XMM14: 000000000000000000000000ff000000 XMM15: 000000000000000000000000ff000000

Backtrace:
/lib64/libc.so.6(memcpy+0x11)[0x2b12eae24541]
??:0(main)[0x400cc6]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b12eadbac9d]
??:0(_start)[0x400929]

Memory map:

00400000-00402000 r-xp 00000000 00:5a 25958017 /root/sigsegv_generator
00601000-00602000 rw-p 00001000 00:5a 25958017 /root/sigsegv_generator
00b56000-00b7b000 rw-p 00000000 00:00 0 [heap]
2b12ea970000-2b12ea990000 r-xp 00000000 00:5a 25300304 /lib64/ld-2.12.so
2b12ea990000-2b12ea991000 rw-p 00000000 00:00 0
2b12eab8f000-2b12eab90000 r--p 0001f000 00:5a 25300304 /lib64/ld-2.12.so
2b12eab90000-2b12eab91000 rw-p 00020000 00:5a 25300304 /lib64/ld-2.12.so
2b12eab91000-2b12eab92000 rw-p 00000000 00:00 0
2b12eab92000-2b12eab96000 r-xp 00000000 00:5a 25300249 /lib64/libSegFault.so
2b12eab96000-2b12ead95000 ---p 00004000 00:5a 25300249 /lib64/libSegFault.so
2b12ead95000-2b12ead96000 r--p 00003000 00:5a 25300249 /lib64/libSegFault.so
2b12ead96000-2b12ead97000 rw-p 00004000 00:5a 25300249 /lib64/libSegFault.so
2b12ead97000-2b12ead98000 rw-p 00000000 00:00 0
2b12ead98000-2b12ead99000 r--s 00000000 00:5a 25957962 /tmp/dst_file
2b12ead9c000-2b12eaf23000 r-xp 00000000 00:5a 25300327 /lib64/libc-2.12.so
2b12eaf23000-2b12eb123000 ---p 00187000 00:5a 25300327 /lib64/libc-2.12.so
2b12eb123000-2b12eb127000 r--p 00187000 00:5a 25300327 /lib64/libc-2.12.so
2b12eb127000-2b12eb128000 rw-p 0018b000 00:5a 25300327 /lib64/libc-2.12.so
2b12eb128000-2b12eb12f000 rw-p 00000000 00:00 0
2b12eb134000-2b12eb14a000 r-xp 00000000 00:5a 25300207 /lib64/libgcc_s-4.4.5-20110214.so.1
2b12eb14a000-2b12eb349000 ---p 00016000 00:5a 25300207 /lib64/libgcc_s-4.4.5-20110214.so.1
2b12eb349000-2b12eb34a000 rw-p 00015000 00:5a 25300207 /lib64/libgcc_s-4.4.5-20110214.so.1
7fffe9b82000-7fffe9b97000 rw-p 00000000 00:00 0 [stack]
7fffe9bfe000-7fffe9c00000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
[root@testsrv ~]#

gdb :

To debug above code under gdb compile using command to include DWARF debugging symbols inside the code.
gcc -O2 <<source file>> -ggdb -o <<executable name>>
I have modified the source string to be appended to a single NULL character for ease of calculation of offset in memcpy() call, source file is also containing a single character, so offset happens to be 2. Now execute the code under gdb.
gdb sigsegv_generator
Now run the executable inside gdb with argument 1.
(gdb) run 1
Now either you can use breakpoints to control the flow of execution or simply use  CTRL + C to stop at initial stage and then do step execution using n. When SIGSEGV is caught after memcpy()gdb will show following message.
(gdb) n

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaad55541 in memcpy () from /lib64/libc.so.6
(gdb)
Above address 0x00002aaaaad55541 is the instruction on which SEGFAULT happened. Same address is also indicated by 64 bit Instruction Pointer RIP shown in below register information. Now lets see what happened before and after above memcpy() call.
(gdb) x 0x00002aaaaad55541
0x2aaaaad55541 <memcpy+17>:     0xff480f88
(gdb) x/10i 0x00002aaaaad55530
   0x2aaaaad55530 <memcpy>:     cmp    $0x20,%rdx
   0x2aaaaad55534 <memcpy+4>:   mov    %rdi,%rax
   0x2aaaaad55537 <memcpy+7>:   jae    0x2aaaaad555b0 <memcpy+128>
   0x2aaaaad55539 <memcpy+9>:   test   $0x1,%dl
   0x2aaaaad5553c <memcpy+12>:  je     0x2aaaaad55549 <memcpy+25>
   0x2aaaaad5553e <memcpy+14>:  movzbl (%rsi),%ecx
=> 0x2aaaaad55541 <memcpy+17>:  mov    %cl,(%rdi)
   0x2aaaaad55543 <memcpy+19>:  inc    %rsi
   0x2aaaaad55546 <memcpy+22>:  inc    %rdi
   0x2aaaaad55549 <memcpy+25>:  test   $0x2,%dl
(gdb)
Note that above assembly instruction are of AT&T syntax, I got little confused regarding Intel vs. AT&T related to MOV instruction and it appears that AT&T has MOV SRC DST format. As I am a noob in Assembly Programming I Stackoverflow'ed above query and got convincing response. Link is mentioned at end of section. Above arrow indicates the memcpy() operation.
(gdb) info proc mappings
Related excerpt from output :
Mapped address spaces:

Start Addr      End Addr        Size    Offset  objfile
0x2aaaaaacf000  0x2aaaaaad0000  0x1000  0       /tmp/dst_file
Lets dump the registers. (x86_64 machine, so expect registers starting with R)
(gdb) info registers
rax            0x2aaaaaacf002   46912496267266
rbx            0x40146a 4199530
rcx            0x69     105
rdx            0x1      1
rsi            0x40146a 4199530
rdi            0x2aaaaaacf002   46912496267266
rbp            0x7      0x7
rsp            0x7fffffffe468   0x7fffffffe468
r8             0x40146a 4199530
r9             0x53202c444145525f       5989836176067220063
r10            0x7fffffffe1f0   140737488347632
r11            0x2aaaaad55530   46912498914608
r12            0x2aaaab05eac8   46912502098632
r13            0x7fffffffe5a0   140737488348576
r14            0x7fffffffe590   140737488348560
r15            0x2aaaaaacf000   46912496267264
rip            0x2aaaaad55541   0x2aaaaad55541 <memcpy+17>
eflags         0x10202  [ IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
(gdb)
Now we have all the inputs to make a conclusion what went wrong. As you can see from mapped address space output file dst_file starting address is 0x2aaaaaacf000, including the offset it will be 0x2aaaaaacf002. Now this offset address is present at RDI register which is used for Destination Index. RSI Source Index register is having address of fix string gets appends to the offset location. What "mov %cl,(%rdi)" sentence does is that it transfers contents at source address in CL register to content at offset address at RDI register. It is using register indirect addressing here to do so. CL register is low 8 bits of ECX register which in turn is low 32 bit of 64 bit RCX register. This ECX register is used to perform string operation. memcpy() call comes from string library family. Instruction "movzbl (%rsi),%ecx" just before memcpy() loads source address into ECX register, movzbl instruction performs operation on low 8 bits of address. So as RDI address is Read Only due to mprotect() call, write operation causes illegal memory access and causes kernel to send SIGSEGV to process.

Note 1 : To switch to Intel Assembly syntax, use following in gdb.
set disassembly-flavor intel
Stackoverflow link - http://stackoverflow.com/questions/19158620/memcpy-segmentation-fault-debugging-exercise-on-prot-read-in-mmap?

Note 2 : A very good way to analyze assembly code in gdb is to do it along with C code. So that each statement is properly implicated to assembly. To do this use following command in gdb just after memcpy() call.
(gdb) disassemble /m

pmap :

I have sleep() calls in program to give us time to analyze the process mappings using pmap command. After executing the code, it will sleep for 60 seconds after mmap() call, in this time we will observe that before the mprotect() call the dst_file mapping was RW and it changes to R afterwards. Use following command to see the process mappings.
pmap <<PID>> -x
RW mapping :
[root@testsrv ~]# pmap 1997 -x
1997:   ./sigsegv_generator 1
Address           Kbytes     RSS   Dirty Mode   Mapping
[Excerpt]
00002b6c7ad08000       4       0       0 rw-s-  dst_file
[/Excerpt]
[root@testsrv ~]#
Same mapping R only :
[root@testsrv ~]# pmap 1997 -x
1997:   ./sigsegv_generator 1
Address           Kbytes     RSS   Dirty Mode   Mapping
[Excerpt]
00002b6c7ad08000       4       0       0 r--s-  dst_file
[/Excerpt]
[root@testsrv ~]#

Note : "S" indicates that the map is shared, "P" will indicate it is private.

Extension to above section for more detailed reporting is to use /proc/<<pid>>/smaps. Here we can see Dirty section and other bunch of relevant stuff related to performance of memcpy().

RW mapping :
2b8571cf0000-2b8571cf1000 rw-s 00000000 00:5a 25957962                   /tmp/dst_file
Size:                  4 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
For Read only map, above information is the same except permissions which becomes r--s.

Valgrind :

Valgrind may not be completely useful here to debug illegal access to memory in our case, but it does report the call on which SEGFAULT occurred. Command line to use.
valgrind -v ./<<executable>> <<args>>
Output at Segfault :
[Excerpt]
--2064-- REDIR: 0x4eb5530 (memcpy) redirected to 0x4c28ec0 (memcpy)
==2064==
==2064== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==2064==  Bad permissions for mapped region at address 0x4022005
==2064==    at 0x4C2914F: memcpy (mc_replace_strmem.c:628)
==2064==    by 0x400CC5: main (mprotect_sigsegv.c:173)
--2064-- REDIR: 0x4ea77f0 (free) redirected to 0x4c26890 (free)
[/Excerpt]
Output of Valgrind is pretty much self explanatory. There are no leaks detected.

So that's it for now, I learned a ton of new stuff. It's really been a long post which I didn't intended too but no worries this is the good stuff. Keep debugging.

Sunday, September 15, 2013

Compiling a strict x86 executable on x86_64 architecture kernel

A quick post, recently I felt a need to compile a strict 32-bit binary executable code to test abnormal behavior of CUDA C source code. I wanted the host code of CUDA to be compiled as x86 architecture on RHEL 6.0 x86_64 OS. CUDA compiler nvcc uses gcc to link and compile in the back-end with added inputs, you can easily verify this by looking at common.mk file in Nvidia Computing SDK.

First make sure to create object files for elf32-i386 architecture. x86_64 system uses 64-bit ELF (Executable and Linkable Format) as default. GCC on x86_64 refers to machine architecture as x86_64 by default if not specified, so there is a "-m64" implicit flag. You can verify this by using verbosity flag in gcc while compiling. COLLECT_GCC_OPTIONS doesn't indicate "-march=1686" as in case of "-m32" explicit flag. GCC provides options as "-m<<32 or 64>>" for specifying the architecture.

"gcc -v test.c" results
"gcc -m32 -v test.c" results
Remember if you are going to generate object files first and then going to link it, those object files must be in elf32-i386 architecture too. If you try to link elf64-x86-64 object files with elf32-i386 compilation following error will occur.

Linking x86_64 object files to x86 linker
Object files can be verified for ELF architecture by using either objdump or file command.

ELF 32-bit format of object file
ELF64-x86-64 format of object file
You might have noticed additional ELF section .eh_frame in x86_64 ELF header of object file. GCC doesn't seem to include this section in object file creation for x86 machine architecture. However if you go ahead and create executable out of this, objdump reveals that both x86 & x86_64 executables have .eh-frame & .eh_frame_hdr in them. This is strange & I want to dig more when I will have time. Anyways .eh_frame contains information for frame unwinding during exception handling & .eh_frame_hdr contains a pointer to the .eh_frame section. The concept of Call Stack Unwinding is hard to describe in this short post, though I will try to explain in short in next section. For detailed Information refer to following Linux Foundation documentation & supporting links which I found useful.

Linux Foundation - Linux Standard Base Core Specification 4.1

SCO SYSTEM V Application Binary Interface

ELF Specification for x86

ELF64 Object File Format

Good Call Stack Reference for C

A call stack is mainly used for keeping track of the point to which control should return after execution of subroutine or in general a called function. In general, if you guys are families with push and pop operations, adding a subroutine's entry to the call stack i.e. push operation is called as Winding & removing entries from call stack is obviously called as Unwinding. Unwinding transfers control out of nested subroutine to previous subroutine which in turn restores proper context after popping many stack frames to assist error handling. In other words if exception handling is enabled, then when run-time exception is thrown the stack unwound until an exception handler is matched. So as we can see elf64-x86_64 by default adds exception handling headers to the object file, however the specification document doesn't clearly mention this & I assumed it to be so. Remember that exception handling for C & C++ is not the same. I would suggest you to read more about this on your own. One more important tool in Linux to dig around ELF is readelf.

On the side note, if you got following error while compiling x86 executable on x86_64 system, maybe you need to install glibc-devel-i386 package compatible with your existing glibc version.

Sunday, July 7, 2013

What are exactly O_DIRECT, O_SYNC Flags, Buffers & Cached in Linux-Storage I/O?

Feels good to post after a long time. I always hear HPC systems people flapping their mouths in context of I/O performance measures in distributed file systems like Lustre or CXFS about Direct I/O so I thought lets dig this. You might have seen usage of dd command on my blog with oflag parameter, but for guys who don't know I have briefly revised it here.
dd if=/dev/zero of=/tmp/testfile bs=2M count=250 oflag=direct
oflag here indicates write operation through dd in accordance with provided symbols, i.e. direct. What exactly is direct we will see a little later in the post.
dd if=/tmp/testfile of=/dev/null bs=2M iflag=direct
Similar to write operation, read operation takes iflag as a parameter with same symbols, particular to our interest is direct.

If you fire these commands with or without oflag/iflag you will notice a significant I/O performance difference in statistics provided by dd. This is basically the effect of cache employed by modern day storage systems/Linux kernel.
Now these caches can be multilevel going right from operating systems buffer cache to storage controller cache to hard drive cache, so depending upon the underlying system architecture cache effects will appear. Filesystem software also plays a huge role in caching behavior. A traditional distributed filesystem might leverage multiple caches on multiple LUN's distributed across multiple storage controllers. An object-based filesystem such as Lustre will have multiple OSS (Object Storage Server's) which will leverage it's own independent OS buffer cache to enhance performance. I am going to do a separate detailed post shortly about Lustre Performance Impact due to OSS Cache. My point is benchmarks of cache effects of a specific HPC system as whole is not comparable to another system unless all the granular details are known and acting in the same direction. Cache effect cannot be completely removed in today's complex systems, we can try to tell underlying components to not use cache if they are configured to accept such requests. When you open a disk file with none of any flags mentioned below, a call to read() & write() for that file returns as soon as data is copied into kernel address space buffer, actual operation happens later on depending upon operating system. Buffer usually defaults to 2.5% of physical memory but this is subject to change depending upon different Linux kernel tree. We also see what is the difference between "buffers" section and "cached" section of free command in later section of this post.

O_DIRECT Flag :
Linux usually tends to cache maximum I/O requests in memory so that consequent read can be served right from main memory & also write operations are performed on main memory giving illusion to application process of higher read/write performance. Meanwhile the data from main memory is committed to disk a little later, you can call it a delayed-write. Process at the same is informed that I/O is completed and it can continue its execution. Linux in the back-end performs all the dirty work of pushing bits to hard drives. Memory pressure can also cause Linux to reduce this main memory buffer.

However some times you really need to make sure that whatever data you want to read is actually coming from disk, this might be a case where distributed/clustered file systems have to share the same data with cluster nodes instantaneously and as stale data will occur on disk if cache's are not committed in proper manner. Direct I/O is useful in case of performance testing of hard-drives. Direct I/O will allow process to specify to perform I/O without using OS cache layer. This makes I/O slower as actual spindle write/read will happen and Linux kernel optimizations may be overlooked.

Notice here that we are telling open(2) call to eliminate or minimize cache layer using symbol O_DIRECT. This doesn't mean that I/O will write to cache and then immediately cache contents will be written to disk. Direct I/O directly performs I/O onto the disk. O_SYNC flag is used to commit cache contents to disk on users request, which we will discuss in next section of post. An O_DIRECT read() or write() is synchronous, control do not return until the disk operation is complete.

Sometimes it is advised to use DIRECT I/O for data which will not be accessed in near future to save double copy of data from application buffer to kernel buffer then to disk. This saves the kernel buffer to be used for caching more frequently accessed data. Hence if you are in control of the complete system and software layer is in-house developed then it is good to use DIRECT I/O.

O_SYNC Flag :
This particular flag when used returns control when I/O is actually performed on the disk through buffer cache, so changes are made persistent in the disk to prevent loses from unexpected crash. This flag is generally used when checkpointing of data needs to be done when a particular process has processed that much amount of data. Please notice here that there is  a very subtle difference between O_DIRECT & O_SYNC, O_SYNC data flows through buffer to disk while in case of O_DIRECT kernel cache layer is eliminated or minimized. O_SYNC flag commits whole data including metadata to disk while O_DSYNC only commits data part. bdflush()/pdflush() kernel daemons are responsible for data commits. Basically O_SYNC flag commits the data and removes the dirty flag from "cached" data so in case of memory pressure kernel can easily drop SYNC'ed data freeing the ram for other "cached" data.
fsync() call or sync command is used in Linux to perform write to disk or any remaining data commits, this will not return until the operation is done. Usually to always make sure that data is committed to disk is to do a DIRECT I/O followed by fsync() ensuring that DMA is used to transfer data to disk and fsync() takes care of all metadata commits to disk. We will verify this in little experiment below. There exists a catch that all modern day hard-drives implement a small on-device cache, O_SYNC'ed data will not always write to disk platters, on-device cache can commit the write later on although the failure of this cache is very rare.

A very good article to read more about O_SYNC & Hard Drive Cache behavior

HDD, FS, O_SYNC : Throughput vs. Integrity

LWN Article

Ensuring data reaches disk

Now lets play with these concepts in Linux, I have fired a virtual machine of CentOS Minimal just to test this in Virtualbox. I am testing this with 1024MB of memory. How much memory you have in the system determines how much percentage of memory is going to be used for dirty pages/buffers/caches. I have kept all the "vm" layer kernel parameters default for the first test to show you when you do a write operation with data which is almost equal to size of main memory then kernel daemon pdflush is forced to writeback the data to disk hence reducing the sync time along with slightly less performance as application process is writing to kernel cache and pdflush daemon is committing it to disk. I am not going to explain much on the dirty related parameters here as this will loose focus on topic of this post and you can easily Google them. 

Default dirty parameters in CentOS 6.4

Behavior of I/O ops large enough considering dirty ratios 

Second snapshot reveals that for Write operation of 524MB the sync operation didn't took much time as dirty pages are already committed by pdflush to immediately flush cache if memory pressure increases or to allocate cache to new demanding process. Take into consideration that even if dirty pages of write operation are committed back to disk pdflush didn't emptied the cache, benefit of this is the consequent read operation on that respective file will be faster. In the same manner, the drop_caches operation also takes very less time as data is already committed by pdflush. Before we go further lets have a quick brief on drop_caches types.

echo 1 > /proc/sys/vm/drop_caches
This frees Pagecache from memory.
echo 2 > /proc/sys/vm/drop_caches
This frees Dentries (Directory Lists) and Inodes.
echo 3 > /proc/sys/vm/drop_caches
This frees Pagecache, Dentries & Inodes.

These operations are non-destructive i.e. if a object is dirty, drop_caches will not touch it and also it attempts to write it back. If sync operation is performed then drop_caches parameter can free the inactive memory immediately. I suggest to monitor all this stuff at /proc/meminfo. Dirty & Writeback fields in meminfo shows current dirty pages and writeback operations queued up. Write-back operations are very transient and you will see them only when sync operation has queued data to commit from dirty pages but hasn't written them to disk yet.

Remember these operations to drop-caches are like command to kernel at present time. As soon as you drop the caches, kernel carries on to cache new requests irrespective of the value of drop_caches. Drop_caches file contains the value which is entered last time and it is no longer in use. So caching behavior is permanent in kernel unless you use O_Direct flag.

Difference between buffers & cached in free command :
From 2.6.X kernel series, buffers are used to cache file system metadata, i.e. buffers remember dentries, file permissions and keeps track what memory region is being written or read to for a particular block device. Cache only contains contents of the file whether it is to be written or read. Another way to tell this is Cache only keeps page caches from which eventually file is read. Just to clarify the "-/+ buffers/cache" line in free output shows how much used & free memory is projected to application, this doesn't include buffers/cached memory as it will be freed when required by flushing daemons.

I have pulled a little experiment to demonstrate how buffers cache the metadata by stressing directory structure using find command. I draw-ed conclusions based on output of /proc/meminfo. Small shell script to create a massive directory structure is given below. This is executed 4 times to create humongous number of directories with small file touched inside each innermost directory.
#!/bin/bash
for i in {1..800}
        do
        mkdir $i
        cd $i
        for j in {1..50}
                do
                mkdir $j
                cd $j
                for k in {1..10}
                        do
                        mkdir $k
                        cd $k
                        touch test.file
                        cd ..
                        done
                cd ..
                done
        cd ..
        done

Buffers impact on performance & preservation of buffers.
Note - Dirty ratio parameters doesn't affect buffers as it is purely metadata hence kernel flushing daemons will not touch it unless there is memory pressure. I have just highlighted the new dirty ratio parameters for my next experiment.

As you can see from the above snapshot buffers greatly improve performance. For 108989 nested directories, 478MB of cache metadata is present in buffers. This concludes in very less traversal time in next fire of find command. Buffers are kept as long as possible unless explicitly flushed by drop_caches or under memory pressure. Notice the 1.5 Sec over 1 minute 41 seconds. Also notice that this metadata didn't go to cached as it is only showing around 7 MB (which is usually kernel thread caches).

Now lets move to next section. Changes in dirty parameters made by me are given below, these are specifically needed for me to demonstrate the dirty cache behavior as total memory in my virtual machine is only 1024 MB. To keep more dirty data in cache I had to increase these parameters. Remember values chosen here are supposed to be finalized very carefully for production servers depending upon the type of workload and amount of memory present. These values are for experiment only and shouldn't be played with, specifically the centisecs related parameters which is recommended not to be touched.

Parameters tampered & explained :
  1. vm.dirty_background_ratio - New value - 95% : This is the no. of pages at which pdflush starts to flush dirty pages to disk.
  2. vm.dirty_ratio - New value - 95% : This is the no. of pages at which process itself writes dirty data back to disk. This indicates till the time 95% of file system cache is not filled it will not write.
  3. vm.dirty_writeback_centisecs - New value - 7000 : Time interval between pdflush daemon wake-ups, in 100'ths of seconds. Setting this to zero disables periodic writeback completely. With new parameter, exactly after 70 seconds dirty data will be flushed automatically by pdflush.
  4. vm.highmem_is_dirtyable - New value - 1 : Above values are calculated based on lowmem value which will come to small value considering 1024MB ram, so I have enabled highmem as dirtyable so that more memory can be kept dirty. 
    1. Memory Zones on my 32 bit x86 virtual machine -
      • DMA ~  16380KB
      • Normal ~ 890872KB
      • HighMem ~ 141256KB
Memory Zones on x86 system with 1024MB RAM
Note - Above mentioned values can be obtained by multiplying spanned no. of pages with kernel PAGESIZE.

non-destructiveness of drop_caches
Even though drop_caches is executed dirty pages are not dropped off, as pdflush hasn't flushed the data.

On-Demand Sync afterwards
Furthermore to previous operation if we execute sync command which will commit dirty data on demand, drop_caches can immediately remove the cache.

Hope you have enjoyed the article. Let me know if you guys have any thoughts on this.