Change in HPC Model

Based on feedback from the community the HPC team has determined that we will be changing the University of Iowa HPC model. If you have questions please contact us at hpc-sysadmins@iowa.uiowa.edu.

Short Overview (For more background please see the Long Overview at the bottom of this page)

  • Today – We have built independent cluster systems (Helium, Neon, and Argon) in isolation from one another and have had very specific hardware requirements for the HPC environment. We have also had times where it was not possible to buy into the HPC environment because of these strict cycles.
  • Future - Maintain the Argon system “indefinitely” through a continuous buy and refresh model that specifies new hardware models every 6-12 months. This model will provide a similar user experience to today but we will only have one campus HPC cluster that contains a variety of hardware of different ages that would be requested via scheduler resources to ensure users can still get the performance optimal for their jobs. Neon hardware will be shutdown and reinstalled as part of Argon after the completion of the Fall 2018 semester. Neon hardware is end of life but we will run it off warranty for up to two years or until the systems fail, whichever is sooner.
  • When will the change happen? - We have already started this change and are now offering a new type of compute nodes in Argon (https://hpc.uiowa.edu/user-services/buy-compute-nodes).  Currently expected major milestones are as follows (though this is subject to change):
    • May 1, 2018 - HPC Model Change Decision Made & Consumer GPU Nodes Announced
    • July/August 2018 - First Phase 2 Argon GPU Nodes Installed
    • November 1, 2018 - No New Neon Accounts Created
    • January 7, 2019 - Neon System Shutdown
    • January 21, 2019 - Neon nodes reinstalled and accessible in Argon.
    • March 1, 2019 - Last day to transfer data off Neon /home and /nfsscratch.

Frequently Asked Questions

  • Will Neon accounts be migrated to Argon? - Not all Neon accounts will be migrated to Argon automatically. Investors will have the option to have individuals migrated automatically. Non-investors who do not currently have an Argon account will need to apply for an account here.
  • What will happen to data I have stored on Neon?
    • Home Accounts - Data will not be automatically migrated. Access to data will remain through March 1, 2019,  Access instructions here
    • /nfsscratch - Data will not be automatically migrated. Access to data will remain through March 1, 2019.  Access instructions here
    • /localscratch - Data will not be migrated and will be permanently unavailable January 7, 2019.  Access instructions here
    • Paid Shared Storage (Large Scale Storage - /dedicated and /shared) - No data migration necessary. Should already be accessible from Argon.
  • Current Memory Resource Request Syntax is Deprecated - Due to the increased heterogeneity of the new Argon cluster the syntax for memory requests had to be changed. The old syntax will continue to work until Neon nodes are integrated with Argon in January 2019. For details please visit the Argon wiki.
  • What CPU architectures exist in the Argon system? - At present the system contains Intel CPUs from two different generations. There are currently Broadwell (Argon Phase 1) and Skylake (Argon Phase 2 - Consumer GPU) CPU architectures. In January 2019 we will add Neon nodes to the system which will add Sandy Bridge and Ivy Bridge processors. New resources are documented on the wiki for selecting different architectures.
  • Will central software packaging change on the new system? - Yes. The increased hardware heterogeneity of the integrated Argon system will require changes to the way packages are built and provided centrally. Details will be shared on this page as they become available. 
  • Will Xeon Phi Accelerator cards be supported in the new Argon system? - No, Xeon Phi accelerator cards will not be supported after Neon nodes are integrated with the Argon system.
  • When is the last day for creating new accounts on the Neon system? - November 1, 2018
  • When is the last day for new central software installs on the Neon system? - November 1, 2018
  • How do I choose the right fabric (High Speed Network) for MPI jobs? - Once Neon is integrated with Argon the system will contain more than one fabric. Neon nodes will continue to be connected with Infiniband. Argon Phase 1 nodes are connected with Omnipath. And Argon Phase 2 - Consumer GPU nodes are currently only connected with Ethernet. New resources are documented on the wiki for selecting different fabrics.
  • How do I ensure that my jobs stay within a certain data center? - The Argon system spans two data centers that are approximately ten miles from one another. For single node/high throughput jobs this does not present a problem. For tightly coupled MPI jobs it is important to keep workloads colocated in a single data center for best performance. Argon Phase 1 nodes are housed in the LC data center while Neon and Argon Phase 2 nodes currently reside in the ITF data center. New resources are documented on the wiki for selecting different data centers.
  • Will applications that I built on Neon continue to work on Argon after the integration? - Maybe. Some applications may continue to work but they will likely not perform optimally on new processor architectures. Additionally, Neon is running CentOS 6 while the integrated Argon system will be CentOS 7. Because of the difference in operating system it is likely a number of applications will need to be recompiled to function.
  • How long will Neon nodes be available in the integrated system? - For detailed policy on compute node lifetime visit the Buy Compute Nodes page
    • Hardware purchased as part of Neon will be retired on or about January 1, 2021.
    • Approximately 30% of Neon hardware is owned by the UI queue. The current failure rate on Neon hardware has been about 20% per year. As such we estimate that by approximately June 2020 investor queues will not remain at full capacity and the UI queue will be depleted.

Long Overview

The HPC team spent the beginning of 2018 reviewing options for the Neon system. We have also seen major interest in GPU computing (data center and consumer cards) and have been experiencing more pressure to have continuous buying available in the HPC systems as the user base grows. Taken together with other factors this has caused the HPC team to review our campus cluster model, and after engaging the community for feedback, make the change in HPC model. 

The New Model

In the new model we will have a single integrated campus cluster, that for continuity, will be built on top of the Argon system. This new system will live on “indefinitely” and will contain multiple generations of hardware and will span the ITF and LC data centers. The HPC team will work with the community to determine new node types every 6-12 months making continuous buying an option and improving flexibility of the system as hardware diversity increases. The target would be to refresh the cluster software stack (Operating System, Scheduler, etc) at approximately 4-5 year intervals or if need requires we will work with with the HPC policy committee to adopt an off cycle change with a target of providing at least one year of notice of change to the community for major software changes. User software will continue to be updated periodically similar to today and will be delivered through modules so that multiple versions are available. The integrated Argon system will also continue to support Singularity container technology.

What are the major changes?

  • Instead of having multiple distinct clusters we would have a single integrated cluster.
  • The lifespan of compute nodes will be increased up to seven years within the limitations as outlined on the Buy Compute Nodes page.
  • More rapid technology refresh cycle for compute nodes. Infrastructure will be refreshed on an as needed basis.

What won’t change?

  • With the exception of increased cluster heterogeneity and a single cluster environment we expect the user experience to remain the same as it is today.
  • We will be working to provide similar levels of support that you experience from our team today.
  • The way nodes are purchased and allocated in queues.
  • All software currently working on Argon will continue to operate.

What are the benefits?

  • Single HPC Account - Users will no longer need to have separate accounts on Neon and Argon, nor will they have to apply for new accounts on new future systems.
  • Run Compute Nodes Off Warranty - This model makes it easier for us to run compute nodes off warranty for a period of time, helping to address the increasing cost of computing hardware.
  • Fewer Data Storage Islands – Users will no longer have two separate home accounts that require copying data back and forth.
  • Neon Software Upgrade Path - Provides a straightforward upgrade path for Neon hardware. We recommend that after the Fall 2018 semester we turn off Neon and then reinstall it as part of Argon. For neon nodes that are still functional off warranty we will create investor queues on the newly integrated Argon using the Neon hardware.
  • One HPC Software Stack – Decreases from two to one the number of HPC software stacks that the HPC team has to maintain.
  • Integrated Environment – Provides an integrated environment where multiple generations or diverse types of hardware can be used together in the same workflow more easily.

What are the downsides?

  • Increased Cluster Heterogeneity - The heterogeneity of the cluster will increase. This will mean a proliferation of resource types available in our scheduling environment. Users will need to understand these resources and properly request them for optimal performance. This will require a significant investment in education. On the upside, this is actually similar to what users must deal with in the cloud so in some sense it better prepares people to work with cloud systems.
  • Changes Required - There are many details that will need to be worked out and some things will likely break along the way since this is a new way of doing things.