We’ve continued to offload the problematic file server and loads across the cluster have actually looked quite good (almost all servers under a load of 5, many around 2-3 and only one was ever at or above 10) so we had hoped that we could restore snapshot backups for users on that server. Unfortunately the result increased the disk usage again on that file server to the point that is has been displaying the same symptoms as before. The good news is that the fix is simply a matter of disabling the backup snapshots again and then dropping that data. We’ve already done the disabling and are in the process of dropping - this is resulting in some very inflated loads across the cluster but as soon as we have completed this I’ll be issuing soft reboots (easier on the hardware) that will fix the loads you’re seeing as well as adding an update here. This should eliminate the performance issues that have been reported (we’ll just have to get that particular file server even lower before we restore those snapshots again). My apologies for this and for the concerns that it has caused.
Update: The utilization is already down from 93% to 92% but this is multiple TB of data at issue so it may take a few hours to complete the dropping of the data. I’ll be keeping an eye on it and rebooting any servers that appear to need extra attention (changes like this often result in that extra step being needed to stabilize things). There will be another update once we’re satisfied that things are fixed.
Update: We’re now back down to 89% - I am still seeing load issues so we’re not out of the woods yet but once the process completes we’ll know if that did it or not (I will continue to update).
Update: Utilization is at 87% and loads have improved so I am updating the severity to medium but not marking it resolved as we still need to see the loads and performance back to what they were a week ago.
Update Tuesday, May 13th: Utilization crept back up (it appears that there were still some rules on the file server leading to creation of the snapshots we deleted, these have been removed and the deletion process is underway again). The admin team is also looking into completely removing this file server from the system (for now we intend to keep offloading it non-stop to ensure that no more issues crop up).
Update Thursday, May 15th: The snapshot data is gone and loads have dropped back to what they should be but we’re going to be moving people off of that file server (we have new hardware coming in that should allow us to completely offload it and eventually scrap it).