Slowdown of /scratch file system 03/21/2012

Between the afternoon of March 20, 2012 and the morning of March 21-2012, the scratch file system was in an overloaded state. While still functioning, it was very slow. This was mostly due to many compute nodes going deep into swap. In this condition, the nodes were not responding well and the Lustre server evicted the clients from the file system. Since the clients were not really dead, just slow, they reconnected. This cycle was being repeated continuously and caused the loads on at least two of the Object Storage Servers to get extremely high. Since these OSSs are used by all compute nodes, the slowdown was exhibited cluster wide.