Argon Performance Issues

During the last few months Argon system performance has at times been poorer than expected. At a high level these issues are largely a result of four factors:

  • The increasing size/scale/performance of the Argon system (the system is now over 28,000 slots and ~150 GPU cards).
  • Increasing utilization of the system.
  • The I/O patterns (size of data sets/number of files) of specific types of jobs that users are executing.
  • Potentially due to network congestion between data centers.
  • Potentially due to network congestion within the system.

 

What actions is the HPC team taking to mitigate these issues?

  • In 1-1 consultations with members of the HPC community the HPC team is looking for opportunities to suggest better performing/less impactful ways to structure I/O. - Ongoing
  • Adding new instrumentation that allows us to better measure network performance within the system. - Timing TBD
  • Alternative storage architecture test bed systems (leveraging distributed filesystems) will be installed to begin testing of more performant storage technologies. - September 2019
  • Investigating additional software tuning of data storage servers in the Argon system.
  • Investigating adding larger SSD cache to Argon storage systems.
  • Continuing migration of home accounts to new home account servers.

 

What actions have already been taken to mitigate these issues?

  • The HPC team has increased the number of home account servers from one to three. New user accounts are being distributed between the new systems. We moved some active users to the new systems during the May 16th, 2019 maintenance (transparent to those migrated).
  • The HPC team has been monitoring performance and when degraded performance occurs is actively investigating the root causes and contacting users that may have I/O patterns that cause performance degradation.
  • The HPC team is the early stages of exploring alternative storage architectures that may provide improved performance and better scaling. The team continues to look for opportunities to optimize and improve the current architecture but options to do this are decreasing. An architecture review/retreat was held May 3, 2019. 
  • At one point issues with /nfsscratch being full were also contributing to performance issues. This has been resolved. https://hpc.uiowa.edu/system-news/nfsscratch-cleaning-changes 
  • Modifications to the way that file color coding operates that may decrease load on storage systems, particularly when listing large numbers of files.
    Color attributes for ownership and permissions have been removed. This will minimize the number of expensive lstat calls that were needed to provide the colorization of those attributes. This will become effective with future logins but if you wish to enable it in a current session type: 'source /etc/profile'.
  • The Lmod module system will use caching to speed up listing of a large set of modules. By default, this caching is at the user level and the cache files are stored in user home directories. This is fine for speeding up the module listing itself but the cache location in homes will add to the load on that server. A system cache has been created to help reduce some of the NFS load of Lmod caching in user home directories.
  • Added bandwidth for data transfers in LC data center.
  • Added bandwidth for data transfers in ITF data center.
  • Added a second /nfsscratch server, called /nfsscratch_itf, in the ITF data center.
  • Increased the available bandwidth between data centers in the Argon system.
  • Restored 10G bandwidth for internal interface of head node.
  • CPU intensive processes should not be performed on login nodes but in case they are, a CPU quota has been put in place to prevent those processes from consuming too many resources.

       

      What can you do?

      • Volunteer to be migrated to a new home account server and let us know if it has an impact on the performance you experience. To do so contact research-computing@uiowa.edu
      • If you are a high throughput user follow best practices on this page: https://wiki.uiowa.edu/display/hpcdocs/Best+Practices+for+High+Throughpu...
      • Avoid creating large numbers of very small files where possible.
      • Avoid performing I/O intensive computations in your home account wherever possible.
      • Do not run long running CPU intensive processes on the login nodes.

       

      Past Issues

      August 1-2, 2019 - The internal interface was saturated due to large downloads as part of jobs. Extra bandwidth will be added.

      July 8, 2019 - Argon's /nfsscratch filesystem is under very heavy load, causing commands which access it to return very slowly. HPC staff are investigating the problem and exploring mitigations.

      April 19, 2019 - Performance issues with Argon /nfsscratch are occuring again that result in degraded system performance. The issue was investigated and mitigated. We are sorry for any inconvenience this may have caused. Please reach out to us at research-computing@uiowa.edu if you have any questions/concerns.

      March 25, 2019 - Performance issues with Argon /nfsscratch are occuring again that result in degraded system performance. The issue was investigated and mitigated. We are sorry for any inconvenience this may have caused. Please reach out to us at research-computing@uiowa.edu if you have any questions/concerns.

      March 24, 2019 - Argon was running slow due to storage related issues, the issue was mitigated and returned to normal. We are sorry for any inconvenience this may have caused.

      If you have questions, please contact research-computing@uiowa.edu.