virtual memory problem for Tru64 (May05-Sep05)

Whats the problem? (shortly)

We own a GS1280 from HP, which has 128GB memory. If one uses more than about 15GB memory (depending from file I/O) the machine becomes very slow, some processes can hang or be killed because of filled swap whereas there is still lot of free memory and borrowed UBC (30-80GB).

Problem is fixed by the latest kernel patch. There are only some minor problems left, which can be handled. Thanks to Han (Sep05).

Whats this page for?

First its for users asking me, why they are disturbed using our "super"-computer. Second its for our colleagues which just want to know what HP is playing with its customers. Third its a probably helpfull report for Tru64 users or other interested people having similar problems. Last but not least its just for me, remembering how frustrating it is to communicate with HPs software support. Feel free and correct me, if I am completely wrong.

News (Mai 24, 2005)

First we set vm_page_prewrite_target to 32K to avoid other effects.
We also want UBC with lowest priority (ubc_borrowpercent=ubc_minpercent=0).

What should happen if (free-pages<2*vm_page_prewrite_target)?

 1. get pages from ubc                   (very fast)
 2. steal free pages from other RADs          (fast)
 3. get borrowed pages from ubc of other RADs (fast)
 4. swap out pages                            (slow)

But what happens:
 1. steal free pages from other RADs (failes without vm_overflow patch if >3RADs)
 2. get pages from ubc to slow (not quite sure, fixed with vm_overflow?)
 3. _no_ pagestealing from other RADS ubc -- Thats the big problem!
 4. swapping (starts already on 2.)
   => swapping where a lot of free (incl. borrowed ubc)
      memory is on the other RADs

Workaround but with slow file-I/O:
 set ubc to minimum, by setting ubc_maxpercent=1 (not desired)
 set vm_page_prewrite_target large enough (32k)

Solution:
 kernel patch

Details

C-program to reproduce the misbehavior. Log-file for our GS1280, Tru64-V5.1B-PK4 and UBC-Patch installed. Please see below for a newer and better version. Swap (36GB) is configured in lazy mode, because more swap than memory (128GB) needed for the eager mode would be a wast of recources. If you have questions ask Joerg Schulenburg, who has detected these problems and reported it to HP. You need a C Compiler, vmubc and Tru64 to reproduce the kernel bug. Here are the facts:

If you alloc and use more than about 5GB memory, used swap is increasing slowly and often pagein is increasing too beside there is more than enough free memory. Sometimes (after UBC-patch installation?) I observed also pageout, but not in every case (before UBC-patch).
The swapping starts earlier if your application consumes more memory. For 5GB after around 3 write cycles of the test program, for 12GB on the second write cycle.
If other big memory jobs are running at the same time on other CPUs they can also be increase the used swap instead of the test program (see at rss and vmsize of top).
Swapping does not start at a fixed memory value. It varies from test to test. Sometimes it occours after writing to the allocated pages the second time, sometimes on the last GB in the first cycle already.
Used swap stops growing if its size is about the memory demand of the application or if swap is full. For the last case the program will be killed even if RSS was already greater than swap space.
Machine will become very slowly because of the high swap activity. Be careful reproducing our tests! As long as the application is running it will not stop swapping (for ever?).
Updated: Programs with high disk activity (reading) can cause swapping, because system allocates UBC-pages for caching disks and does not free it, if other RADs need memory. Also UBC consumes memory on low memory conditions.
After killing the test program, swapspace is not freed sometimes.
Even memory disk can be swapped out and it looks as it pages for ever.
Same problem seems to exist for a GS160 (12G mem, 24G swap, not patched, lazy mode). I could not test it under clean conditions, because machine is heavily loaded. May be I can test it later. Swap was growing very slowly instead of decreasing the UBC (summed over all RADs).
Setting the undocumented parameter cpus_in_rad to 2 and reboot (sysconfig -q generic cpus_in_rad) seems to shift the trigger value from 5GB to 15GB.
Using vmstat -R you can observe how memory gets stolen from the neighbouring RADs. Swapping starts after four RADs are filled and the free memory of one of this RADs becomes very small. If there is no swap device defined, the system becomes very slow too.

Updated: All together I think the bug is in the page steal algorithm. If the free memory of a RAD goes down, for some unknown reason pages can not be stolen from the other RADs which still have free pages and the vm_rss_block_target limit triggers blocking of the process. Some inactiv marked pages of that RAD are paged out. This happens again and again letting the swap grow. And together with the weakness of Tru64 to never get back a swapped page to free memory pages you can get a swapping machine with lot of free memory as a worst case scenario. That way the buggy RAD system makes a HP-NUMA with 128GB memory and 32 CPUs as slow as a PC with 4GB memory and 128GB swap for big memory jobs. I found no usefull documentation and no system variables about page stealing from other RADs. HPs support did fix this first Problem, but there are more problems. A problem with giving back borrowed UBC still causes the machine to swap if it is not necessary. Any help is welcome.

History of the bug

I wrote this report because communication with HPs support is a nightmare. Buying a expensive machine, selling lot of money for support and getting no sufficient help. This is an excerpt of events, which may be connected with the kernel bug.

June 2003: a GS1280, 32CPUs, 128GB memory is installed and ready for use for our scientists, operating system is Tru64-5.1B,
July 2003: swap (35GB) full because swap is allocated in advance in eager mode, switched to lazy mode, applications can allocate more than 35GB now
July 2003: seeing a 64GB memory disk swapped out (76GB swap), killing some processes to get more free memory but swapping does not stop whereat 12GB are free available, calling HPs support
July 2003: Call 3724224RE answer from HP: problem is categorized as tuning problem and not included in the software support, we have to pay extra for it, but we refuse such kind of help
July 2003: doing some experiments, as I noticed 3% used swap, whereat 80GB memory are used and 40GB memory are free available, found growing swap activity by adding and using a 32GB memdisk (8GB free!)
August 2003: two crashes, today I dont exclude a coherence between high usage causing unnecessary swapping and that crashes, switching off all unwanted deamons (which could also swapped out) to reduce usage as much as possible
January 2004: again, massive swapping, largest process took 23GB memory, 64GB used, 48GB free but used for UBC (disk cache), taking UBC as cause for the problem into account, further tests
March 2004: crash (panic (cpu 0): mcs_lock: time limit exceeded, also SCSI errors), disk replaced (was it really defect or just notified as defect from the buggy kernel?)
April 2004: after a high load week end lot of correctable error events in the log, taking into account high syscall activity processes and reduce syscall activity to test it
July 2004: again a disk is notified as defect and replaced
Oct. 2004: after heavy load lot of different nonreproducable crashes, hints that the problem may be connected with the high load are ignored by HPs support and hardware errors assumed
Nov. 2004: memory replaced
Dec. 2004: crashes, crashes, crashes ..., board + memory replaced
March 2005: again 34GB free memory but machine is swapping and processes are killed because of missing swap space, thinking that the UBC which uses the free memory is the reason (but today I know there were more bugs which interfere, making analysis difficult) making some tests with a simple program reading big files (70GB) to the memory (40GB), machine crashs after hours running the test program
March 2005: be sure that I detected a kernel bug and called HPs support again, telling about the tiny trigger program (disk to mem reader); HP tells me to apply the latest patches and I do so, problem remains
March 2005: reducing the program to write memory only (memeater) and observe swapping after filling the 6th GB with numbers, seeing the program becomming very slow and massive swapping on 57th GB, telling HP about that facts and exclude UBC as a swappng reason
March 2005: getting some curious statements from HP, first: its not a bug, Tru64 is designed to work so (heah?), second: please use nmadvise for large memory applications (its slower and how to do so for third party software?), third: HP's NUMA is not NUMA (not for applications using more than one RADs memory (which is 4GB)) --- not a joke! --- I was complaining about that statements and demand to fix the kernel bug. The answer was, that its not a bug but a suboptimal adaption of Tru64 to HP's NUMA, so its look like HP is trying to play against us.
March 18th, 2005: I was informed by HP about a hidden kernel option which could set the number of CPUs per RAD to 2, so there is twice as much memory per RAD available. The trigger value for the problem seems to be shifted to higher values and memory access is about 40% slower for less-than-4GB applications. Engeneers are working on a n-CPUs-per-RAD patch especially for us.
March 21th, 2005: crash again, this time a software bug is taken into account, switching off ARMTech deamon, swapping problem is still there
March 31th, 2005: call HP again and asking what they plan to do, I got the original statements (see above) from the american support team again.
April 4th, 2005: getting an nmadvise sample from HP, I corrected some bugs in it and tried it out. Also tried out disabling any swap space, but system became very slow instead of paging.
April 14th, 2005: getting a Customer-Specific Patch Kit which introduces the kernel parameter vm_overflow. System swappes only a small number of pages out but becomes very slow if writing on the 15th GB (99% cpu for kernel, 1% cpu for the process). Report it to HP.
April 15th, 2005: getting some calls and emails, which give me the hope, that HP is reconsidering their meaning about the (non-)bug.
April 18th, 2005: searching for other people, who could help on http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=859125
April 19th, 2005: reboot and did some tests with the patched vm_overflow option, Do not know how usefull the results are because of using system()-calls to make the output more usefull (see pitfalls). vm_overflow seems to reduce swap usage, but nevertheless the system becomes very slow at the 16th GB where free_of_oneRAD becomes minimum (10pages).
April 22th, 2005: Try to use getrusage instead of system call, see usage.ru_idrss, usage.ru_isrss growing over all loops (see example log below).
May 9th, 2005: Got new patchkit (Fixes to vm_overflow & page migration). Did some tests withe the old kernel, seeing paging (swap) system with 12M free pages for both vm_overflow variants. After applying the new patch the memeater got 90GB for the first time without heavy paging. Great step to success, but still little paging (swap=100 pages, 48GB free). Found vmstat.acti raising if free touches 2*vm_page_prewrite_target (maybe ok).
May 10-11th, 2005: Trying out different configurations of vm_overflow, ubc_* and vm_page_prewrite_target, looking for paging and page stealing and asking support about details. Found out, that high vm_page_prewrite_target delays paging.
May 17th, 2005: Got important answer. Pages overflow to the next RAD if free_pages < 2*vm_page_prewrite_target (not documented elsewhere).
May 24th, 2005: Doing experiments with UBC. Explore that pages are not stolen from other RADs if memory is borrowed by the UBC. Notify the Support. Big missbehaviour!
May 26th, 2005: Calling support again. Emphasize importance.
May 30th, 2005: Find our machine paging heavily (30GB free, 29GB swap).
Jun 1th, 2005: Resetting all kernel parameters to sys_check recommendations, requested by support. Problem remains.
Jun 14th+15th, 2005: HP expert met, new patch, problem confirmed (see log), cpu_in_rads=32 as workaround
Jul 01th, 2005: got a new patch
Jul 04th, 2005: tried new patch, still unneccessary page outs, details on Fig. 8, let users login
Jul 05th, 2005: found machine with 2.5GB In-use swap and 14GB free, vmstat.free=3143..83K, ubc filled, switched back to workaround cpu_in_rads=32

Explanations

RAD (Recource affinity domain)

To let the Tru64 aware of the NUMA archtekture, there is one RAD per CPU and 4GB memory belonging to that CPU. Each RAD has its own paging scheduler. If the memory of the RAD is consumed, memory pages from the neighbouring RADs are stolen. The application should not be aware of that, beside of the access time to the nonlocal memory will be a bit slower. So memory appears to be flat from users side as expected on NUMA machines likewise on real SMP machines (HPs support told me otherwise but I am sure they are wrong).

UBC (Unified Buffer Cache)

To speed up (slow) disk operations, (faster) memory can be used as buffer cache. Part of the UBC is called borrowed, so unused memory can help to speed up disk operations. Borrowed UBC is given back on low memory conditions. Paging should not occour as long as there are borrowed UBC pages, as stated in the sys_attrs_vm man page. Because the RAD implementation is buggy, it fails on NUMA systems.

nmadvise

is a routine within the numa library. Its possible to tell the kernel to distribute the allocated memory pages accross a list of defined RADs. Its the only way to work around the kernel bug. You also have to avoid that other processes steal memory from one of this RADs. So all processes have to care about the NUMA specifics. I think its fine that this fine tuning is possible, but telling the user that he has to use that specific thing all the time is not amusing.

Tru64

is as far as I know developped by DEC as Digital UNIX. DEC is assimilated by Compaq 1998 and 1999 the UNIX was renamed by Compaq. Compaq is aquired by HP 2002.

GS1280

is an multiprocessor system based on alpha chips which have a very good interprocessor connection and makes nonlocal memory access very fast. Every alpha chip has 4GB local memory and 4 ways to its neighbour processors and their memory. This is a 2D-NUMA concept. 32 CPUs are connected in a 4*8 Torus and can access the whole 128GB memory (thats the theory).

pitfalls

Hunting the bug I did some stupid things. I added syscalls to the test program calling date, ps, vmstat, swapon after writing avery GB. Thats a bad idea, because system usually calls fork and exec. Fork doubles the process inclusive memory. I saw the system calls becoming very slow and of course that has influenced the measurements. Also you have to define __mingrowfactor (see man malloc) to avoid, that libc allocates 10% of the actual process memory instead the wanted place which can has additional side effects.

Testprogram vmbug4.c:

#include <stdlib.h>
#include <stdio.h> 
#include <assert.h>
#include <time.h>

#define ALLOC_SIZE 1024 * 1024 * 1024

int main (int argc, char *argv[]) {
  int index, chunks, cycle;
  char **buf;   
  time_t t1, t2;

  if (argc != 2) { printf(" usage: %s memory[GB]\n", argv[0]); exit (1); }
  chunks = atoi (argv[1]);  setvbuf (stdout, NULL, _IONBF, 0);
  buf = malloc (chunks * sizeof (char *));  assert(buf);

  printf (" alloc memory:");
  for (index = 0; index < chunks; index++) {
    buf[index] = (char *) malloc (ALLOC_SIZE); assert (buf[index]);
    printf (" %dGB", index);
  }

  for (cycle = 0; cycle < 200 ; cycle ++ )  {
     printf ("\n fill memory cycle=%d:",cycle);
     for (index = 0; index < chunks; index++) {
       t1 = clock();
       /* Zero fill memory to allocate pages*/ 
       memset (buf[index], 0, ALLOC_SIZE);
       t2 = clock();
       printf (" %dGB %dms", index, (int) (t2-t1)/1000);
     }
  }
  printf ("\n");
}

Example logs (April 14th):

Here the results, before the vm_overflow patch was installed. After a reboot, the test was started using 1, 3, 4, 5, 6, 8, 12, 13, 14, 15, 16, 20 and 60GB. Problem becomes hard from 16GB and higher. I startet the program with runon -r 26 and observed the activity using xmesh, vmstat -R, top and ps aux.

 
CPU array, neighbours of RAD 26:
   17  19  21
   24 (26) 28
   25  27  29

Fig 1: Free pages of involved RADs over time during the experiment, sources: gnuplot-file and vmstat-R-log
On Fig 1 a 16GB test is shown. First RAD26 hit the limit of 2*vm_prewite_target=3072(?) where page stealing from RAD27 (lower neighbour) starts. But further pages are taken from the RAD26 free list which probably causes filling of the inactiv list (sence not clear). If RAD27 has given its free pages to RAD26, free pages from RAD24 (left neighbour) get stolen. After that pages are stolen from RAD28 (right neighbour). The inactiv lists of RAD26, 27 and 24 start above zero because the previous runs left them populated. After 40 seconds 4 RADs a 4GB are filled but the inactiv lists still grow for unknown reason. 25K pages were stolen from RAD19 (upper neighbour). Only 133 pages are left in the free page list of RAD26. After 100s the memeater program was killed by me. So why inactiv list is populated? Does it have an important influence to the misbehavior shown below?
FreePagesDiagram

Fig 2: Free pages of involved RADs over time during the experiment, sources: gnuplot-file and vmstat-R-log
After the 4th RAD (RAD28) is filled the number of free pages on RAD 26 drops below 40 pages and paging starts (4*4GB=16GB touched). Until here every GB is written in about 1-3 seconds. Now the writing speed goes down dramatically and 40s later the program is nearly blocked (pcpu is less than 2%). After 5 minutes I started a dumpsys (Thu Apr 14, 13:28 vmzcore.4 666MB + 18MB vmunix.4). No further vmstat available because this terminal was hanging. Seems that only 5 RADs are involved. That would be a explanation for having no such trouble on a GS160 which has only 4 RADs.

terrible slow, ps aux shows over 90% pcpu for kernel activity.
13:16    cycle0: 16th GB steht top<2%  vsz=61.2GB   
14:05    cycle0: 27th GB pcpu=2.4% rss=29G kernel=87%cpu26 free=90GB 
14:49    cycle0: 37th GB pcpu=1.8% rss=39G kernel=97%cpu26 free=80GB 
14:56    Ctrl-C -> swap=249

Example logs (April 19th):

Did some tests with the patch, see vmbug4.log.

Example logs (Jun 11th) vm_overflow=1:

Working condition, 68GB stated as free. According to the sys_check recommandations ubc_borrowpwercent=20 (pages= 105K/RAD, 3355K/total), vm_page_free_target=2048 and vm_page_prewite_target=4096 was set. runon rad26.
FreePagesDiagram

Fig 3: Free pages of involved RADs over time during the experiment, sources: gnuplot-file and vmstat-R-log, Values above 1K are rounded to Kilopages.
20GB OK (not shown). 40GB OK (not shown). 70GB until RAD2 (most distant RAD) free pages used, actu.rad26 ok, after 64GB paging, stopping with ^C, On logfile line 9440 RAD26.free jumps from 2k to 3.8K. Why? Transfered to RAD5? Only the actu value of RAD26 (the home RAD) goes down after all available free pages are consumed. Most part of ubc pages appear as inactive pages. Other RADs do not spent ubc pages. Do not give borrowed ubc back is not expected on usual shared memory systems. Please have in mind, that first page outs apear before borrowed ubc pages are given back. This is in contrast to the man page of sys_stats_vm. Also notice, that when RAD26.actu goes below ubc_borrowpercent paging is remarkable high.
paging (not swap but mmap?) of other processes dont stop! (pcpu still 2%..60% of some users after 30min, another bug?)

Example logs (Jun 15th):

After reboot, clean machine, UBC filled by reading files from disk using dd (slow because reading fragmented files parallel).
Tru64-ubc-problem

Fig 4: filling the ubc and starting 20GB memeater causes nonregular paging, sources: gnuplot-file and vmstat-log

Tru64-ubc-problem

Fig 5: right part of Fig. 4, ubc filled, starting 20GB memeater causes nonregular paging, sources: gnuplot-file and vmstat-log,

Work around: Now kernel patched, switched to 32CPUs/RAD (1 RAD only), no UBC problems expected,
Tru64-32CPUs-1RAD

Fig 6: filling the ubc and starting 20GB, 80GB and additional 20GB memeater without excessiv paging, sources: gnuplot-file and vmstat-log

Tru64-32CPUs-1RAD-paging

Fig 7: force regular paging just for fun, pin is much bigger than pout, sources: gnuplot-file and vmstat-log

Example logs (Jul 04th):

Fig 8: after the new patch applied paging happens later but the reason is unclear to me because vm_page_free_min=20 was not hit (RAD16.free was 2418 on the first page out, other RADs had 19K..275K free), also there was still enough free remote memory, sources: gnuplot-file and vmstat-log

Criticism

Support: 1st: Level 3 support (sounds great) reached. Still email communication via Level 1 (?) support. L1 Support forwards modified emails without sending a copy to me, so I dont know what information L3 is getting and sending. Everything is filtered. 2nd: One kernel bug already explored, customer seems categorized as a PEBKAC, support still assumes, that problem is just a tuning problem. Blind confidence in HP software (sys_check).
ITRC Forum: No responses from the developping team. Proposals which improve Tru64 are not taken serious.
Tru64-Kernel: Undocumented or bad documented RADs (sys_attrs_vm only valid for 1 RAD). Undocumented kernel parameters (ubc_overflow, cpus_in_rad). Nonsensical default configuration (vm_swap_eager=1). Unneccessary complexity of the kernel (vm_swap_eager, ubc_overflow), less complexity means less bugs, easier support. No sources available, which would be not a problem if customers are taken seriously. Swap can be added on runtime, but not released on runtime, making debugging expendable (reboots).

Joerg Schulenburg, 2005, Administrator (inquire more details if you want)