September 7, 2018
While Spectre/Metldown related changes implemented during the August 8, 2018 maintenance have improved performance we are now experiencing stability issues. These stability issues were manageable until yesterday when over 40% of Argon compute nodes went offline. Mitigation of these issues requires rebooting the affected compute nodes. The effect of this from a user perspective is that jobs go into a "dr" state despite users attempting to delete them or jobs may fail. We are finalizing plans to begin rolling updates (machines will be rebooted as available, no full system outage/maintenance) to attempt to address the stability issues. In the meantime we are monitoring for problematic nodes.
August 8, 2018
Patches to mitigate performance impact of Spectre/Meltdown were implemented on the HPC system and initial results indicate that most performance hits have been mitigated.
July 20, 2018
Security patches for the Meltdown and Spectre vulnerabilities continue to be issued by OS vendors and hardware manufacturers. The HPC team continues to apply these patches where applicable to maintaining system security. Unfortunately we have identified a real world example of a 50% performance decrease due to one of these BIOS patches that was installed during May 2018. We are currently evaluating options to mitigate the situation. If you jobs are compute intensive and are experiencing poor performance of those jobs, we encourage you to contact the HPC team.
January 25, 2018
The HPC team has concluded initial evaluation of the Meltdown and Spectre processor vulnerabilities. All HPC systems are vulnerable, as are the vast majority of computing devices. These are serious security vulnerabilities that can lead to the loss of sensitive data. While the vulnerabilities are currently difficult to exploit, they do require action. Please read on for details as they relate to the HPC environment.
Do you plan to patch the HPC systems? – Yes, we plan to patch the Argon system at the next maintenance and the Neon system will be patched soon after. While patches for these exploits continue to evolve we plan to apply the current patches that are available. Not patching is not an option as all future updates will be dependent on installation of the patches for these vulnerabilities.
Will patching affect system performance? – Yes, we do expect some decrease in system performance. Based on the testing we have performed we expect the average performance decrease to be in the 5-10% range. However actual performance variation is highly impacted by the type of workload you are performing. For heavily compute intensive jobs we have seen many cases where there is nearly no performance impact. I/O intensive jobs including jobs that write large amounts of data to disk or send large number of MPI messages are likely to be more impacted with some workloads seeing performance hits more in the 30% range.
How can I minimize the performance impact on my work? – The biggest thing you can do to minimize performance impact is to limit your I/O to the greatest extent possible. We also recommend that you optimize your I/O to decrease the number of small file operations you are executing. In many respects optimizing in this new environment is similar to traditional optimization and mirrors many of our recommendations for high throughput jobs.
Can you help me minimize my performance impact? – The HPC team is happy to engage in these conversations but unfortunately has limited time available. As such, all consults in this area will be handled on a first come, first serve, time available basis. We also strongly encourage you to have done initial evaluation of performance and optimized I/O as best you can before contacting us.
Do you expect stability issues? – There have been reports of stability issues in the media but to date we have not experienced stability issues. We will be carefully evaluating the patches we apply for stability issues before they are rolled out across the clusters.