All HPC compute nodes are purchased with a five year warranty. Compute nodes will be allowed to run for up to 7 years within the following parameters.
The final two years of the compute node life are outside warranty and are on a best effort basis.
If software/OS can no longer support compute equipment prior to the end of its 7 year life the HPCteam may, in consultation with the HPC Policy Committee, determine that the life of compute equipment is shorter than 7 years. Should this occur the HPC team will strive to provide at least six months of notice to the HPC community before equipment is decommissioned.
In the event that compute nodes fail outside of warranty they will not be repaired. The HPC team will attempt to keep investor queues at the purchased capacity, to the extent possible, based on the following process and guidelines.
Investor compute nodes that fail out of warranty will be replaced with compute nodes from the UI queue within the same generation of hardware.
- When possible compute nodes will be replaced with same or higher specification hardware. This will not be possible in all cases. In cases where this is not possible the investor will be contacted by the HPC team with available options.
- Transfer of compute nodes from the UI queue to investor queues will occur in the order in which failures occur.
- UI queue compute node availability is finite and is unlikely to be able to sustain all investor queues at full capacity for a 7 year life. As such investors should not assume that their queue will remain at full capacity for the duration of the two year life outside of warranty.
What does this mean in the context of Neon hardware?
Hardware purchased as part of Neon will be retired on or about January 1, 2021.
Approximately 30% of Neon hardware is owned by the UI queue. The current failure rate on Neon hardware has been about 20% per year. As such we estimate that by approximately June 2020 investor queues will not remain at full capacity and the UI queue will be depleted.
What is the impact of this policy on UI queue compute capacity?
The UI queues are the most highly utilized queues on the campus HPC clusters. This policy does mean a decrease over time in the amount of generally available compute capacity as compared to an option where investor queues were not kept whole to the extent possible. The University however has budgeted for periodic addition of new hardware. As such we will work to try to mitigate this capacity constraint as budget allows. Additionally, this is not a significant departure for the UI queue from the previous model in which an HPC cluster was previously shut down after approximately five years.
What is the impact of this policy on the HPC team?
Since the policy does not repair compute nodes that fail outside of warranty it does not significantly increase the burden on the HPC team in terms of hardware repairs or reallocation of nodes between queues. The larger aggregate number of nodes and diversity of hardware architectures does have an impact on the HPC team but this was expected as part of the HPC model change.
Please contact firstname.lastname@example.org.