Wednesday, April 18, 2012

Transparent Huge Pages Performance Implications

Recently I have observed performance improvement of memory intensive applications on RHEL 6.x systems due to THP (Transparent Huge Pages). THP's are different than Huge Pages. It is enabled by default in RHEL 6.x versions & available in kernel 2.6.38 onwards. Lets clear some basic points
  • Page size on x86 arch = 4kB.
  • Page size on x86_64 arch = 8kB.
  • Huge page size = 2MiB => 512 x 4kB page = 2MiB.
  • Physical page is called as Page Frame.
  • Virtual memory to physical memory address translations are stored into Page Tables.
  • Traversing these Page Tables for memory addresses is called as Page Walk.
  • Page tables can be linked hierarchically to address memory.
  • Every memory access needs to go through Page Walk.
  • Operating System decides about using Normal page or Huge Page for process memory allocation on brk() call or mmap() call.
  • Page walk entries are cached into TLB's (Translational Look-aside Buffers).
  • TLB Cache-Hit or Cache-Miss happens according to Page table size and access pattern associated.
  • TLB cache needs to be flushed with every context switch.
  • Typically TLB can hold upto 8 to 1024 entries in cache.
  • Usage of Huge Pages enables each TLB entry to hold 512 normal pages, increasing the Cache-Hit ratio.
  • Huge Pages may cause fragmentation overhead due to excess data size re-segment (brk() call).
  • cat /proc/meminfo reflects "AnonHugePages" as THP's. THP's are used for allocation of Anonymous memory only as of now.
  • No configuration required for applications to exploit THP's.
Enabling THP :
echo always > /sys/kernel/mm/redhat_transparent_hugepage/enabled
Disabling THP :
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
I have a written a tiny code to stress TLB cache & to perform page walk with normal pages & huge pages. This code creates a single process single thread random operation on arrays utilizing 1GB-50GB of memory.

C code for 1GB memory :
/*Purpose - HugePages Benchmark on RHEL 6.x
Coder - Subodh Pachghare (www.thesubodh.com)
Date Modified - 16-4-12
Compilation - gcc -O2 HugePages_test_rhel6_svp.c*/
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <strings.h>

#define access 500000000UL
int gar1[access];
clock_t time_start;

int main()
{
  //Modify size variable to change memory utilization of code
size_t size1 = 1024UL*1024*1024*1;
time_start = clock();
char *p = (char*)malloc(size1);
  //Zeros p
bzero(p, size1);
printf("[+]mem_alloc: %f sec\n",((double)clock() - time_start / CLOCKS_PER_SEC));
    
long i;
for (i = 0; i < access; i++)
    {
    gar1[i] = random() % size1;
    }

time_start = clock();
  //Each array access will need Page Table lookup
for (i = 0; i < access; i++) 
    {
    p[gar1[i]] = 2 * gar1[i];
    }
printf("[+]random_op: %f sec\n",((double)clock() - time_start / CLOCKS_PER_SEC));
exit(0);
}

Performance Graph :
THP Performance of Code
  • Code reports collective time of memory allocation (malloc() call) & memory touch operation (bzero() call) in first phase.
  • Second phase time reported by code is math operation performed on arrays.
  • Each access to array will go through TLB to do a Page Walk.
  • From graph, memory allocation & memory touch operation improves with THP's enabled. Performance improvement are more for higher memory usages.
  • Execution operation shows significant performance improvement with THP's enabled. This indicated less TLB cache-miss while traversing through page table doing array operation.
  • As I mentioned earlier, this is single process operation doing memory access on shared memory system with ccNUMA architecture. Results may vary for more number of process/threads depending upon access pattern. 
THP's For Parallels Virtuozzo Users :
  • THP's in Base node are made available to all containers automatically. There is no need for any container specific configuration. THP's in Virtuozzo are container template agnostic, i.e. RHEL 6.x base node can provide THP's to SLES template based containers.
My Take - There is no point in disabling this feature if you got RHEL 6.x running with memory intensive application. It will definitely give you performance without any modification in application.

1 comment:

  1. random() only generates 31-bit numbers, so regardless of the total allocated size your
    random accesses will only be touching 2GByte of p[] But you'll also be doing sequential
    accesses to the gar1[] array, which is 500Mx8 = 4GByte. So that will confuse the issue with
    an extra cache miss for every 8 iterations.

    ReplyDelete