Update: 2018-11-15, 11:00 A.M
All compute nodes except those with existing hardware issues have been returned to production. Jobs appear to be running normally, but we are continuing to monitor. Thank you for your patience.
Update: 6:32 P.M. Queued jobs have started to run. There are still a number of nodes (32) out of production at the moment.
Update: 2018-11-14, 6:16 P.M.
We are continuing to work to restore the cluster to service. It is anticipated that the majority of the cluster be restored by 7:00 P.M. Thank you for your patience, and we apologize for the inconvenience.
Update 2018-11-14, 2:54 PM
The jobs that were running earlier will all be lost due to restarting the QMaster. The jobs which were waiting would still be waiting. All of the queues as of now are disabled until the problem is solved. Again, apologies for the inconvenience this causes our users.
Update, 2018-11-14. The SGE service is currently down. This is the same issue we encountered as before on November 4, 2018. We are currently working to restore service.
Update, 2018-11-04: SGE service was restored at 12:14 PM. The SGE database had become corrupted and required repair after the head node crashed. Some jobs that were running may have finished but not recorded anything to the accounting file. Other jobs may have been lost as the queue master lost track of them.
The SGE service is currently down. We are currently working to restore service.