Argon head node down

The Argon HPC system is back online after being offline due to corruption of the main filesystems on the head node. We apologize for the outage and if you have any questions please contact research-computing@uiowa.edu

January 16, 2019 - A root cause analysis of the issues involved has been completed. The findings of the analysis are:

  • The root cause was filesystem corruption due to unexpected reboots of the head node system. This is in part related to faulty hardware and in part due to bugs in the current version of Linux that we are running. Neither of these issues has been mitigated at this time but we do have a plan to mitigate them.
  • During the February 7th, 2019 maintenance we will upgrade to a newer software version that is reported to fix a related bug.
  • We have ordered new hardware and once it arrives in 6-8 weeks we will install it during the Spring 2019 maintenance.
  • We are also looking at restoration processes and have identified several opportunities that should accelerate restoration should a similar scenario occur in the future.

 

Outage Updates:

  • 10:00PM January 3, 2019 - The Argon system is back online and login capability has been restored.
  • 8:30PM January 3, 2019 - The system restore is nearly complete and most compute nodes are now reimaged. However a significant number of compute nodes are experiencing issues. As such work continues to resolve these problems and finalize the system restoration.
  • 5:00PM January 3, 2019 - Significant progress has been made in the restoration but testing is revealing a few remaining issues. At this time we are expecting restoration tomorrow, January 4th. An email will be sent to all users once the system is restored.
  • 12:00PM January 3, 2019 - Login node reboots were required at this time to allow restoration work to continue.
  • 9:00AM January 3, 2019 - Restoration work is progressing at a slow pace. At this time we do not expect the system to be restored before end of day.
  • 8:00AM January 3, 2019 - A plan is in place to restore the system but due to the widespread file corruption we do not yet have an ETA for system restoration.
  • 8:30PM January 2, 2019 - Another important filesystem on the head node appears to be corrupt. We are diagnosing this issue and will likely need to also restore this filesystem.
  • 4:00PM January 2, 2019 - The filesystem was not recoverable and we are in the process of restoring the system from backups. At this time we do not expect the system restore to complete today. We hope the system will be restored and will provide more updates as they become available.
  • 9:00AM January 2, 2019 - The filesystem on the Argon head node has been damaged. Evaluation of the damage is in progress.
  • 7:00AM January 2, 2019 - The head node of the Argon cluster is currently down. This will affect your ability to interact with the queue system as well as software installations. The effect on running jobs, besides job accounting is unknown at this time. Updates will be posted here as more information becomes available.