virtual memory problem for Tru64 (May05-Sep05)

Whats the problem? (shortly)

We own a GS1280 from HP, which has 128GB memory. If one uses more than about 15GB memory (depending from file I/O) the machine becomes very slow, some processes can hang or be killed because of filled swap whereas there is still lot of free memory and borrowed UBC (30-80GB).

Problem is fixed by the latest kernel patch. There are only some minor problems left, which can be handled. Thanks to Han (Sep05).

Whats this page for?

First its for users asking me, why they are disturbed using our "super"-computer. Second its for our colleagues which just want to know what HP is playing with its customers. Third its a probably helpfull report for Tru64 users or other interested people having similar problems. Last but not least its just for me, remembering how frustrating it is to communicate with HPs software support. Feel free and correct me, if I am completely wrong.

News (Mai 24, 2005)

First we set vm_page_prewrite_target to 32K to avoid other effects.
We also want UBC with lowest priority (ubc_borrowpercent=ubc_minpercent=0).

What should happen if (free-pages<2*vm_page_prewrite_target)?

 1. get pages from ubc                   (very fast)
 2. steal free pages from other RADs          (fast)
 3. get borrowed pages from ubc of other RADs (fast)
 4. swap out pages                            (slow)

But what happens:
 1. steal free pages from other RADs (failes without vm_overflow patch if >3RADs)
 2. get pages from ubc to slow (not quite sure, fixed with vm_overflow?)
 3. _no_ pagestealing from other RADS ubc -- Thats the big problem!
 4. swapping (starts already on 2.)
   => swapping where a lot of free (incl. borrowed ubc)
      memory is on the other RADs

Workaround but with slow file-I/O:
 set ubc to minimum, by setting ubc_maxpercent=1 (not desired)
 set vm_page_prewrite_target large enough (32k)

Solution:
 kernel patch

Details

C-program to reproduce the misbehavior. Log-file for our GS1280, Tru64-V5.1B-PK4 and UBC-Patch installed. Please see below for a newer and better version. Swap (36GB) is configured in lazy mode, because more swap than memory (128GB) needed for the eager mode would be a wast of recources. If you have questions ask Joerg Schulenburg, who has detected these problems and reported it to HP. You need a C Compiler, vmubc and Tru64 to reproduce the kernel bug. Here are the facts:

Updated: All together I think the bug is in the page steal algorithm. If the free memory of a RAD goes down, for some unknown reason pages can not be stolen from the other RADs which still have free pages and the vm_rss_block_target limit triggers blocking of the process. Some inactiv marked pages of that RAD are paged out. This happens again and again letting the swap grow. And together with the weakness of Tru64 to never get back a swapped page to free memory pages you can get a swapping machine with lot of free memory as a worst case scenario. That way the buggy RAD system makes a HP-NUMA with 128GB memory and 32 CPUs as slow as a PC with 4GB memory and 128GB swap for big memory jobs. I found no usefull documentation and no system variables about page stealing from other RADs. HPs support did fix this first Problem, but there are more problems. A problem with giving back borrowed UBC still causes the machine to swap if it is not necessary. Any help is welcome.

History of the bug

I wrote this report because communication with HPs support is a nightmare. Buying a expensive machine, selling lot of money for support and getting no sufficient help. This is an excerpt of events, which may be connected with the kernel bug.

Explanations

RAD (Recource affinity domain)
To let the Tru64 aware of the NUMA archtekture, there is one RAD per CPU and 4GB memory belonging to that CPU. Each RAD has its own paging scheduler. If the memory of the RAD is consumed, memory pages from the neighbouring RADs are stolen. The application should not be aware of that, beside of the access time to the nonlocal memory will be a bit slower. So memory appears to be flat from users side as expected on NUMA machines likewise on real SMP machines (HPs support told me otherwise but I am sure they are wrong).
UBC (Unified Buffer Cache)
To speed up (slow) disk operations, (faster) memory can be used as buffer cache. Part of the UBC is called borrowed, so unused memory can help to speed up disk operations. Borrowed UBC is given back on low memory conditions. Paging should not occour as long as there are borrowed UBC pages, as stated in the sys_attrs_vm man page. Because the RAD implementation is buggy, it fails on NUMA systems.
nmadvise
is a routine within the numa library. Its possible to tell the kernel to distribute the allocated memory pages accross a list of defined RADs. Its the only way to work around the kernel bug. You also have to avoid that other processes steal memory from one of this RADs. So all processes have to care about the NUMA specifics. I think its fine that this fine tuning is possible, but telling the user that he has to use that specific thing all the time is not amusing.
Tru64
is as far as I know developped by DEC as Digital UNIX. DEC is assimilated by Compaq 1998 and 1999 the UNIX was renamed by Compaq. Compaq is aquired by HP 2002.
GS1280
is an multiprocessor system based on alpha chips which have a very good interprocessor connection and makes nonlocal memory access very fast. Every alpha chip has 4GB local memory and 4 ways to its neighbour processors and their memory. This is a 2D-NUMA concept. 32 CPUs are connected in a 4*8 Torus and can access the whole 128GB memory (thats the theory).
pitfalls
Hunting the bug I did some stupid things. I added syscalls to the test program calling date, ps, vmstat, swapon after writing avery GB. Thats a bad idea, because system usually calls fork and exec. Fork doubles the process inclusive memory. I saw the system calls becoming very slow and of course that has influenced the measurements. Also you have to define __mingrowfactor (see man malloc) to avoid, that libc allocates 10% of the actual process memory instead the wanted place which can has additional side effects.
Testprogram vmbug4.c:
#include <stdlib.h>
#include <stdio.h> 
#include <assert.h>
#include <time.h>

#define ALLOC_SIZE 1024 * 1024 * 1024

int main (int argc, char *argv[]) {
  int index, chunks, cycle;
  char **buf;   
  time_t t1, t2;

  if (argc != 2) { printf(" usage: %s memory[GB]\n", argv[0]); exit (1); }
  chunks = atoi (argv[1]);  setvbuf (stdout, NULL, _IONBF, 0);
  buf = malloc (chunks * sizeof (char *));  assert(buf);

  printf (" alloc memory:");
  for (index = 0; index < chunks; index++) {
    buf[index] = (char *) malloc (ALLOC_SIZE); assert (buf[index]);
    printf (" %dGB", index);
  }

  for (cycle = 0; cycle < 200 ; cycle ++ )  {
     printf ("\n fill memory cycle=%d:",cycle);
     for (index = 0; index < chunks; index++) {
       t1 = clock();
       /* Zero fill memory to allocate pages*/ 
       memset (buf[index], 0, ALLOC_SIZE);
       t2 = clock();
       printf (" %dGB %dms", index, (int) (t2-t1)/1000);
     }
  }
  printf ("\n");
}
Example logs (April 14th):
Here the results, before the vm_overflow patch was installed. After a reboot, the test was started using 1, 3, 4, 5, 6, 8, 12, 13, 14, 15, 16, 20 and 60GB. Problem becomes hard from 16GB and higher. I startet the program with runon -r 26 and observed the activity using xmesh, vmstat -R, top and ps aux.
 
CPU array, neighbours of RAD 26:
   17  19  21
   24 (26) 28
   25  27  29

FreePagesDiagram
Fig 1: Free pages of involved RADs over time during the experiment, sources: gnuplot-file and vmstat-R-log
On Fig 1 a 16GB test is shown. First RAD26 hit the limit of 2*vm_prewite_target=3072(?) where page stealing from RAD27 (lower neighbour) starts. But further pages are taken from the RAD26 free list which probably causes filling of the inactiv list (sence not clear). If RAD27 has given its free pages to RAD26, free pages from RAD24 (left neighbour) get stolen. After that pages are stolen from RAD28 (right neighbour). The inactiv lists of RAD26, 27 and 24 start above zero because the previous runs left them populated. After 40 seconds 4 RADs a 4GB are filled but the inactiv lists still grow for unknown reason. 25K pages were stolen from RAD19 (upper neighbour). Only 133 pages are left in the free page list of RAD26. After 100s the memeater program was killed by me. So why inactiv list is populated? Does it have an important influence to the misbehavior shown below?
FreePagesDiagram
Fig 2: Free pages of involved RADs over time during the experiment, sources: gnuplot-file and vmstat-R-log
After the 4th RAD (RAD28) is filled the number of free pages on RAD 26 drops below 40 pages and paging starts (4*4GB=16GB touched). Until here every GB is written in about 1-3 seconds. Now the writing speed goes down dramatically and 40s later the program is nearly blocked (pcpu is less than 2%). After 5 minutes I started a dumpsys (Thu Apr 14, 13:28 vmzcore.4 666MB + 18MB vmunix.4). No further vmstat available because this terminal was hanging. Seems that only 5 RADs are involved. That would be a explanation for having no such trouble on a GS160 which has only 4 RADs.
terrible slow, ps aux shows over 90% pcpu for kernel activity.
13:16    cycle0: 16th GB steht top<2%  vsz=61.2GB   
14:05    cycle0: 27th GB pcpu=2.4% rss=29G kernel=87%cpu26 free=90GB 
14:49    cycle0: 37th GB pcpu=1.8% rss=39G kernel=97%cpu26 free=80GB 
14:56    Ctrl-C -> swap=249   
Example logs (April 19th):
Did some tests with the patch, see vmbug4.log.
Example logs (Jun 11th) vm_overflow=1:
Working condition, 68GB stated as free. According to the sys_check recommandations ubc_borrowpwercent=20 (pages= 105K/RAD, 3355K/total), vm_page_free_target=2048 and vm_page_prewite_target=4096 was set. runon rad26.
FreePagesDiagram
Fig 3: Free pages of involved RADs over time during the experiment, sources: gnuplot-file and vmstat-R-log, Values above 1K are rounded to Kilopages.
20GB OK (not shown). 40GB OK (not shown). 70GB until RAD2 (most distant RAD) free pages used, actu.rad26 ok, after 64GB paging, stopping with ^C, On logfile line 9440 RAD26.free jumps from 2k to 3.8K. Why? Transfered to RAD5? Only the actu value of RAD26 (the home RAD) goes down after all available free pages are consumed. Most part of ubc pages appear as inactive pages. Other RADs do not spent ubc pages. Do not give borrowed ubc back is not expected on usual shared memory systems. Please have in mind, that first page outs apear before borrowed ubc pages are given back. This is in contrast to the man page of sys_stats_vm. Also notice, that when RAD26.actu goes below ubc_borrowpercent paging is remarkable high.
paging (not swap but mmap?) of other processes dont stop! (pcpu still 2%..60% of some users after 30min, another bug?)
Example logs (Jun 15th):
After reboot, clean machine, UBC filled by reading files from disk using dd (slow because reading fragmented files parallel).
Tru64-ubc-problem
Fig 4: filling the ubc and starting 20GB memeater causes nonregular paging, sources: gnuplot-file and vmstat-log

Tru64-ubc-problem
Fig 5: right part of Fig. 4, ubc filled, starting 20GB memeater causes nonregular paging, sources: gnuplot-file and vmstat-log,


Work around: Now kernel patched, switched to 32CPUs/RAD (1 RAD only), no UBC problems expected,
Tru64-32CPUs-1RAD
Fig 6: filling the ubc and starting 20GB, 80GB and additional 20GB memeater without excessiv paging, sources: gnuplot-file and vmstat-log

Tru64-32CPUs-1RAD-paging
Fig 7: force regular paging just for fun, pin is much bigger than pout, sources: gnuplot-file and vmstat-log

Example logs (Jul 04th):
Tru64-32RADs-paging
Tru64-32RADs-paging excerpt
Fig 8: after the new patch applied paging happens later but the reason is unclear to me because vm_page_free_min=20 was not hit (RAD16.free was 2418 on the first page out, other RADs had 19K..275K free), also there was still enough free remote memory, sources: gnuplot-file and vmstat-log

Criticism

Support
1st: Level 3 support (sounds great) reached. Still email communication via Level 1 (?) support. L1 Support forwards modified emails without sending a copy to me, so I dont know what information L3 is getting and sending. Everything is filtered. 2nd: One kernel bug already explored, customer seems categorized as a PEBKAC, support still assumes, that problem is just a tuning problem. Blind confidence in HP software (sys_check).
ITRC Forum
No responses from the developping team. Proposals which improve Tru64 are not taken serious.
Tru64-Kernel
Undocumented or bad documented RADs (sys_attrs_vm only valid for 1 RAD). Undocumented kernel parameters (ubc_overflow, cpus_in_rad). Nonsensical default configuration (vm_swap_eager=1). Unneccessary complexity of the kernel (vm_swap_eager, ubc_overflow), less complexity means less bugs, easier support. No sources available, which would be not a problem if customers are taken seriously. Swap can be added on runtime, but not released on runtime, making debugging expendable (reboots).

Joerg Schulenburg, 2005, Administrator (inquire more details if you want)