Change in HPC Model

Based on feedback from the community the HPC team has determined that we will be changing the University of Iowa HPC model. If you have questions please contact us at hpc-sysadmins@iowa.uiowa.edu.

Short Overview

  • Today – We have built independent cluster systems (Helium, Neon, and Argon) in isolation from one another and have had very specific hardware requirements for the HPC environment. We have also had times where it was not possible to buy into the HPC environment because of these strict cycles.
  • Future - Maintain the Argon system “indefinitely” through a continuous buy and refresh model that specifies new hardware models every 6-12 months. This model will provide a similar user experience to today but we will only have one campus HPC cluster that contains a variety of hardware of different ages that would be requested via scheduler resources to ensure users can still get the performance optimal for their jobs. Neon hardware will be shutdown and reinstalled as part of Argon after the completion of the Fall 2018 semester. Neon hardware is end of life but we will run it off warranty for up to two years or until the systems fail, whichever is sooner.
  • When will the change happen? - We have already started this change and are now offering a new type of compute nodes in Argon (https://hpc.uiowa.edu/user-services/buy-compute-nodes).  Currently expected major milestones are as follows (though this is subject to change):
    • May 1, 2018 - HPC Model Change Decision Made & Consumer GPU Nodes Announced
    • July 2018 - First Phase 2 Argon GPU Nodes Installed
    • November 1, 2018 - No New Neon Accounts Created
    • January 7, 2019 - Neon System Shutdown (Scratch and Home Accounts Remain Accessible for Data Migration until March 1, 2019)
    • January 21, 2019 - Neon nodes reinstalled and accessible in Argon.

Long Overview

The HPC team spent the beginning of 2018 reviewing options for the Neon system. We have also seen major interest in GPU computing (data center and consumer cards) and have been experiencing more pressure to have continuous buying available in the HPC systems as the user base grows. Taken together with other factors this has caused the HPC team to review our campus cluster model, and after engaging the community for feedback, make the change in HPC model. 

The New Model

In the new model we will have a single integrated campus cluster, that for continuity, will be built on top of the Argon system. This new system will live on “indefinitely” and will contain multiple generations of hardware and will span the ITF and LC data centers. The HPC team will work with the community to determine new node types every 6-12 months making continuous buying an option and improving flexibility of the system as hardware diversity increases. The target would be to refresh the cluster software stack (Operating System, Scheduler, etc) at approximately 4-5 year intervals or if need requires we will work with with the HPC policy committee to adopt an off cycle change with a target of providing at least one year of notice of change to the community for major software changes. User software will continue to be updated periodically similar to today and will be delivered through modules so that multiple versions are available. The integrated Argon system will also continue to support Singularity container technology.

What are the major changes?

  • Instead of having multiple distinct clusters we would have a single integrated cluster.
  • All compute nodes will continue to be purchased with a five year warranty and will be maintained for five years. Nodes will be allowed to run up to 7 years or as determined feasible by the HPC team but if hardware failures occur outside warranty nodes will not be repaired. (Note: Exact details still to be finalized with the HPC policy commitee. An alternative after warranty is investors purchasing an “extended warranty” for some amount, on the order of $500-$1000/year or a percent of machine cost for out of warranty equipment. Feedback strongly encouraged on this.)
  • More rapid technology refresh cycle for compute nodes. Infrastructure will be refreshed on an as needed basis.

What won’t change?

  • With the exception of increased cluster heterogeneity and a single cluster environment we expect the user experience to remain the same as it is today.
  • We will be working to provide similar levels of support that you experience from our team today.
  • The way nodes are purchased and allocated in queues.
  • All software currently working on Argon will continue to operate.

What are the benefits?

  • Single HPC Account - Users will no longer need to have separate accounts on Neon and Argon, nor will they have to apply for new accounts on new future systems.
  • Run Compute Nodes Off Warranty - This model makes it easier for us to run compute nodes off warranty for a period of time, helping to address the increasing cost of computing hardware.
  • Fewer Data Storage Islands – Users will no longer have two separate home accounts that require copying data back and forth.
  • Neon Software Upgrade Path - Provides a straightforward upgrade path for Neon hardware. We recommend that after the Fall 2018 semester we turn off Neon and then reinstall it as part of Argon. For neon nodes that are still functional off warranty we will create investor queues on the newly integrated Argon using the Neon hardware.
  • One HPC Software Stack – Decreases from two to one the number of HPC software stacks that the HPC team has to maintain.
  • Integrated Environment – Provides an integrated environment where multiple generations or diverse types of hardware can be used together in the same workflow more easily.

What are the downsides?

  • Increased Cluster Heterogeneity - The heterogeneity of the cluster will increase. This will mean a proliferation of resource types available in our scheduling environment. Users will need to understand these resources and properly request them for optimal performance. This will require a significant investment in education. On the upside, this is actually similar to what users must deal with in the cloud so in some sense it better prepares people to work with cloud systems.
  • Changes Required - There are many details that will need to be worked out and some things will likely break along the way since this is a new way of doing things.