OmniPath fabric is down on Argon

At approximately 3:15 PM on February 4, 2021 the OmniPath Fabric of the Argon HPC cluster went down. This will have an effect on jobs running on multiple nodes connected to that fabric, such as MPI jobs. This is currently being investigated.

Note that the Infiniband fabric is not affected, so if your nodes are recently purchased then this issue will not affect you. If submitting to the UI queue you could select the infiniband fabric explicitly with 'qsub -l fabric=infiniband'.

Update (2/5/2021): The problem with the OmniPath fabric has been isolated to a single failed switch. That switch has been powered down, which allowed the fabric to be restarted, and the process of replacing the switch has begun. This single switch failure limits the impact to a single rack of servers affecting 35 compute nodes. Owners of those nodes will be notified separately.

Update (2/9/2021): The faulty switch is scheduled to be replaced tomorrow, 2/10/2021, in the afternoon.

Update (2/10/2021): The faulty switch has been replaced so the Omnipath fabric is restored.