We are aware that some people are experiencing code crashes after the update to CentOS-7.5 during the May 2019 HPC maintenance. We are currently trying to identify the issue and determine corrective actions. Updates will be posted here as they become available.
Update, 05/21/2019 4:07 PM: The code crashing is directly related to the update to CentOS-7.5. The exact cause of the problem remains unknown. Not all code is crashing but there is probably more susceptible to this than what we are currently aware of. Reverting a node back to CentOS-7.4 does mitigate the issue and is currently being explored as a solution.
Update, 05/22/2019 8:23 PM: In order to mitigate the problem with certain codes crashing after the update to CentOS-7.5 we are rolling compute nodes back to the image just before the maintenance, which is CentOS-7.4. This is being done via opportunistic reboot jobs so no running jobs will be affected. Until all nodes are rebooted, it is still possible that there could be issues with codes crashing if the job lands on a node that does not completely free up to allow the reboot job to take hold, ie., shared nodes. Note that nodes may be unavailable for up to an hour while all of the configuration processes run.