Memory limits for jobs sharing nodes


When running jobs on the Helium cluster it is important that the jobs fit the memory of the machines. This is especially important when jobs are sharing compute nodes with other jobs in order to minimize job contention. There have been many cases where jobs have exceeded the memory guidelines and have caused other people's jobs to get aborted due to the compute node running out of memory.

As such, the Helium technical team has determined that per process memory limits are needed for jobs not requesting whole compute nodes in the UI and all.q queues. The amount of the memory limit will be determined by how many job slots are requested for the 'smp/smp1' parallel environments and the memory per core of the machine for both 'orte' parallel environment and serial job submissions. The value is influenced by the request for the 144G memory machines via the '-l big_mem=true' request. The initial limit will be set to a value that should work on all node types but this may need to be adjusted at some later time.

This change will take effect on March 22, 2013. If a job process exceeds this limit, the job will be aborted. This is being done to protect the compute nodes as well as jobs running on those nodes. For jobs that already fit the memory constraints of the Helium nodes nothing will change. The SGE parallel environments that request whole nodes will not be affected by this change. For jobs that do not fit the memory constraints of the Helium nodes alternate job requests will be needed to ensure proper resource requests. The technical team believes this will improve system stability and increase overall productivity. 

For more information, see this Wiki entry for a discusion of the memory requirements on helium nodes.


If you have any questions or concerns about this please contact us at: