|
ITS UPDATE: Email System Problems Today
|
As you know we migrated students to their own new server over interim and copied all of their emails to the new faster server. Next in our list of projects was to move faculty and staff email to a new server, as the current server is over 5 years old and slow. Prep work for moving faculty and staff to a new server started today. All of the old student emails that were no longer needed were being deleted to free up space for the faculty/staff post office and make the move to a new server easier. Unfortunately, the volume that houses all of those files was configured in such a way that while you may think you are deleting a file, you were not. The process is similar to the recycle bin in Windows. Since we were deleting millions of files, the old server became overwhelmed with the task of managing a recycling bin that large and froze in its tracks around 10 AM (obviously a day where it can really happen) . The server then rebooted itself. It automatically mounts the volume that houses all of the emails, but when it first loads this volume it has to determine what files really should be deleted. Again it saw millions of files that were set to be deleted and it took the server about 45 minutes to parse through the log file that listed all of these files. During this time faculty and staff email inboxes were unavailable. At about 11 AM the server was done parsing the log of files and normal email activity was restored, it then took about 5-10 minutes to deliver all of the email that had been queued up during the approximate hour outage. No emails were lost during this recovery and down time.
The good news is the old server is now warm and functioning appropriately. Based on this incident we are going to change the setting on the new faculty/staff email server so that once a file is deleted, it is deleted immediately, not tagged to be deleted later. Also, once the faculty and staff post office is moved to the new server it will have a faster connection to the volume and would recover faster if a similar problem were to occur again.
|
|