Issues Encountered During Maintenance - See below for updates
- 6:00PM November 22, 2016 - The data transfer node and Globus endpoint re-entered production about 4:00PM. Note that the present maintenance is intended to allow this system to remain available after Helium is decommissioned.
- 5:00PM November 22, 2016 - The Neon system was put back in production around 4:15 PM on November 22, 2016. This was past the normal maintenance window and was caused by an unexpected issue in the tool we use for deploying images to compute nodes. We try to minimize transfers when possible and try to simply update the image. However, in this case it was necessary to deploy a new image to each node. This is supposed to work in a scalable way, intended for large HPC systems. However, that scalability tool completely failed and transfers were all funneled to a central server. This made it impossible to deploy the images in a timely fashion. This tool is used on a regular basis but not at full scale. Since it was a scalability issue it was not noticed until the process was scaled. The extra time in the maintenance was used to engineer a work around for this scalability issue and we appreciate the patience that everyone had.
- 12:30PM November 22, 2016 - Progress is being made on restoring the system. At this time we continue to anticipate that the system will reenter production by 5:00PM today. Depending on the pace of restoration the system may come up without all compute nodes available.
- 12:20AM November 22, 2016 - More details of the compute node issues have been determined but restoration of service continues to be a slow process. At this time we continue to anticipate an extension of the maintenance until 5:00PM today.
- 11:00PM November 21, 2016 - The Helium HPC system is back in production.
- 9:10PM November 21, 2016 - Issues with restoring compute nodes to service are occurring on the Neon HPC system. At this time we are extending the maintenance until 5:00PM November 22, 2016. We apologize for the inconvenience. At this time we anticipate Helium will be available prior to the end of scheduled maintenance.
- 7:30PM November 21, 2016 - Issues with restoration of the Helium Data Transfer node/Globus Online have been encountered and we are not expecting service to be restored until after the standard maintenance window is complete. Note: This is not a heavily used service so must users will not be impacted by this issue.
The next scheduled maintenance window for the HPC systems is November 21st, 2016 7 AM to November 22nd, 2016 7AM.
During this time the Helium and Neon systems will be offline for routine maintenance. All jobs running at the beginning of the maintenance will be killed.
The following packages will be removed from Neon during this time unless concerns are raised by the user community:
cfitsio-3.370 - This is a dependency for root and gdal, both of which use cfitsio-3.390
pythia-8.210 - this is a dependency of root, which now uses pythia-8.219
nco-4.2.3 - This is an old version with known issues. We have the newer 4.4.9 and 4.6.1 versions
Platypus/gcc48_0.8.1 - This does not have a proper python installation and is replaced by Platypus/0.8.1_python-2.7.12_intel-composer_xe_2015.3.187
boost-1.60.0 - This version is not compatible with the stock CentOS-6 compiler, gcc4.4.7. It was built with gcc48 as a work around. There are now patches for 1.60.0 but removing this and having the set of 5 builds of boost-1.61.0 is a better option.