At 11:18 this morning, argon-login-1 experienced a kernel panic and rebooted. Users that were logged into this node would have had their sessions disconnected. The login node is back in service and cause is being investigated.
Update - 7:18 P.M. We believe we have isolated the issue. Queues have been re-enabled and jobs are running normally at this time.
We are currently seeing issues with the scheduler on Neon. For the time being, all jobs in queue are being held.
Update - Maintenance on the Argon and Neon HPC systems is now complete, and the clusters are available for use. Thank you for your patience.
The next scheduled maintenance window for the HPC systems is 8AM on August 16, 2017 - 8AM on August 17, 2017.
UPDATE - 10:21 A.M. Service to the Neon cluster has been restored and jobs are running normally. The datacenter operations team has confirmed a brief power outage earlier this morning that caused the issue.
Thank you for your patience.
The root cause of slowness has been identified and a fix is being pushed out. Unfortunately, this requires a reboot of the GPU nodes. This is in process via high priority jobs that will reboot a node once it has no more jobs on it.