SGE down again due to HTC job overload

The SGE system is currently unresponsive due to being overloaded from several continuous streams of job submissions from scripts. 

We will have to find and shut down scripts and will be reaching out to people. If you have High Throughput computational jobs, please try to submit them as array jobs rather than individual job submissions. Sometimes, that is not possible, but if it is possible to use array jobs, it helps everyone.

Update: Job submissions have been disabled until the scheduler can clear some of the backlog.

Update (05/24/2021, 3:17 PM): Job submissions have been enabled and we will continue to monitor the situation. If you can convert your HTC jobs to an array jobs, please do so.

Update (05/27/2021, 10:48 AM): SGE has been stable for a few days now although the conditions that are believed to have triggered the overload on 05/24/2021 have not reappeared. There was a slight change made to the scheduler triggers, which might help to stem off an overload situation. We are also working on mitigation strategies and will be communicating those in the near future.

Update (06/02/2021): A new tool has been added to Argon to help create array jobs, along with expanded documentation. See Array Jobs for more information.

Update (06/14/2021): The issue has popped up again, albeit more sporadically.

Without going into technical details, what is happening is that SGE has more to do than it can handle in the allotted time that is has for processing jobs per scheduling cycle. This has been triggered by an uptick in the number of jobs being submitted via scripts, presumably in a loop. While it is natural to write a script to submit a 100 or a 1000 jobs, when several such scripts are running at the same time, the rate of job submission overwhelms SGE. SGE will spend most of its cycles trying to submit the jobs, but eventually it has to break from that to schedule jobs to run on the system. Obviously, a high rate of job submissions also produces a large number of jobs that have to be processed. This makes the scheduler thread take a longer time to complete. Meanwhile, more jobs are coming in at a high rate, and a snow ball effect begins, eventually leading to time outs, and failed commands, including qsub. SGE will prioritize scheduling over other events, so jobs are still being scheduled even though commands are timing out.

The only solution to this is to reduce the rate of job submission. The best way to do that is to use array jobs as those can reduce, for example, 1000 job submissions down to a single job submission. If the jobs do not have complex dependencies then it is usually possible to create a task file that will contain the list of computations, and submit the task file as an array job. In order to facilitate converting submission scripts that submit many jobs to creating and submitting task files, we have added a new tool, called "qbatch", and expanded the documentation for array jobs, including usage of qbatch.

If you are submitting a large number of jobs through a script we ask that you please review the above documentation and convert your submission scripts to an array job of a task file. It will make job submission much faster for you, as well as help keep the system responsive for everyone else. We can help with the conversion and answer any questions that you may have.

Update (06/25/2021): A small amount of jitterhas been introduced to the submission pipeline to help splay jobs that are submitted in large numbers concurrently.