Four changes to the current queuing system policies have been proposed and approved by the HPC Policy Committee. The HPC team plans to implement these changes at the next maintenance, tentatively planned for November 6th unless there are significant concerns raised by the community. The changes are:
1. Qlogin will default to a 24-hour wall clock but may be overridden by the user.
2. Sandbox queues will have a hard 24-hour wall clock limit.
3. Total slot (processor core) limits will become permanent in UI queues. On Helium this will remain at 516 cores and on Neon it will be implemented at 320 cores.
4. The maximum number of running jobs in UI-HM on Neon will be reduced to two.
5. Jobs requesting scarce resources (such as Xeon Phi and K20 accelerators) will receive a modest increase in job priority to make them easier to schedule.
6. Sandbox queue will have access to Xeon Phi and Nvidia K20 accelerator cards.
7. ***Postponed*** Jobs in the UI queue that utilize multiple nodes will have a 72-hour wall clock limit. - This will not be implemented and the HPC operations team will be soliciting additional feedback from the user community.
Why are these changes being implemented?
1. Qlogin – It is common for individuals to forget to logout of qlogin sessions, tying up resources that they are not using. This change will minimize the length of time that this can occur for.
2. Sandbox queue – These queues are designed for prototyping and long running jobs should not be run in these locations. This change will prevent any job from running for more than 24 hours.
3. Total slot count – This change prevents any single individual from using more than ~25% of the total available cores in the high priority UI queue. This is designed to improve fairness among users.
4. Neon UI-HM - There are only four nodes in this queue. To keep one individual from monopolizing the queue the number of simultaneous jobs is being dropped to two.
5. Scarce Resources - The Neon system includes various scarce resources such as Xeon Phi accelerators, Nvidia K20 accelerators, and high memory configurations. To encourage these nodes to be used by jobs that need access to these special resources a modest increase in scheduling priority is being given to jobs requesting these resources.
6. Accelerator Access - There has recently been increased interest in accelerator cards and prototyping with these resources. To facilitate more rapid access to these resources one node with a Xeon Phi and one node with an Nvidia K20 are being placed into the sandbox queue.
7. UI Wall Clock Limit – This is being postponed until additional community feedback has been collected. - We have seen an increasing number of users running large core count jobs for long periods of time. This degrades turnaround time for other users. Additionally, users with multicore jobs generally have the ability to checkpoint and restart their jobs. Use of this mechanism is a best practice and as users utilize other supercomputing resources they will be required to use these mechanisms. Single node jobs are being exempted from this requirement because most of the applications do not have checkpoint/restart. Single node jobs are also limited by a total running job limit of 25 jobs on Helium and 10 jobs on Neon that help to balance availability of nodes between large and small core count jobs. This change will be carefully monitored to ensure that it does not lead to an increase in the number of high throughput (low core count) jobs that can run at the expense of large core count jobs.
Frequently Asked Questions
What is a wall clock limit? – A wall clock limit is the maximum of real time that a job can run for. For example, if you start a job at 1PM on October 1st, and there is a 24-hour wall clock limit, then the job will only be allowed to run until 1PM on October 2nd. At this time the job will be killed.
How will these changes affect me if I run small core count jobs? – Unless you are using qlogin (which is not recommended for most production jobs) or are running in the sandbox queue you are not likely to notice these changes.
How will these changes affect me if I run large core count jobs (more than ~24 cores)? – Our goal is that you will see the amount of time before your job starts decrease in the UI queue. However, your jobs will only be allowed to run for 72 hours at a time so it will be important to implement the checkpoint/restart mechanism in your code if you wish to run for longer. Depending on the size of your jobs you may also not be able to run as many simultaneous jobs in the UI queue due to the new core count limit on the Neon system.
Are you changing all.q? – At this time no changes are being implemented in all.q. We continue to encourage high throughput users to leverage this queue and at present there are no limits on the number of jobs that can be run. The policy committee has requested that an investigation of a fair share scheduling algorithm occur to determine if this could help improve fairness in this queue.
If you have questions or feedback on these queueing policy changes please contact firstname.lastname@example.org