Early this morning a device on our Storage Area network failed. As a result disk storage for Rmail, my.ryerson, Blackboard, and other systems went offline. (Other services like RAMSS and eHR are not reachable when my.ryerson is down.) We are investigating the cause and working to restore service. Updates to follow.
Update 8:20 AM: A storage appliance that manages the storage connections to the Rmail, my.ryerson and other servers failed. A second storage appliance is available as a backup but for some reason the servers were not able to tolerate the storage connectivity failures before the second device took over.
Update 9:36 AM: The problem seems to be related with a failed controller on one of the storage appliances that provides storage virtualization. The vendors are working on the problem and replacement parts are being sent to us.
In the meantime, we’re also working on bringing up services on the backup appliance, with the idea that we will be able to “fail over” to the other appliance (in a controlled way) when it is working properly.
Update 12:37 PM: With the help of the vendor, whose staff are on site, we are in the process of restoring/rebuilding the failed storage appliance. We estimate that the earliest we can be back on line is some time after 2:40 PM. It may take significantly longer. In parallel with the work to restore the failed device we are also preparing another appliance which we can reconfigure to replace the failed unit.
Update 2:21 PM: We have run into other problems rebuilding the storage appliance that failed and now think we will have a replacement appliance ready before we can repair the unit that failed. This pushes back the earliest possible time to fix the problem until after 5 PM. Once again, please accept my apologies for the prolonged outage.
Update 5:40 PM: The new storage appliance is in place but it will be many hours before all services are restored. The appliance must go through a diagnostic process, test its connections to the storage devices, and be reconfigured to provide storage for each server. Initial estimates on how long that would take are proving to be optimistic. Once everything is working again we must start the process of validating that all files and databases have been recovered correctly. From what we have seen in the past this will also take hours for some services like Rmail. Based on what we’ve seen over the last hour we can’t provide a good estimate of how long it will take to get on line this evening. CCS staff will work as long as necessary to recover all services.
Update 9:30 PM: At roughly 7:30 PM we experienced a setback with restoring the configuration of the appliance. The problem means we do not have network access to the device and we are working with the vendor to continue recovering the device’s configuration. We have lost at least two hours because of this.
I know many people would like to know when the system will be restored. We are still working towards it being back online in the late evening or early hours in the morning, but at this time, we cannot predict when service will be restored.
-Brian
Update 10:54 PM: With the help of the vendor’s third-level support staff we have finally completed the automatic recovery process and are beginning the final configuration steps of the storage appliance. Network connectivity has also been established.
Update 11:17 PM: Both storage appliances are now restored and working correctly. Staff are now checking the databases and file systems that were impacted by the outage.
Update 00:40 AM Tuesday, Aug. 28: A few systems have now been recovered and services are available. CAS, uPortal services were restored around 00:20 AM, Blackboard was restored around 00:25 AM. RMAIL was restored around 00:45 AM. The GroupWise web interface had to be restarted and was unavailable between ~ 11:00 PM – 11:40 PM. The Cold Fusion services were also restored around 00:35 AM. We continue the recovery of other systems.
Update 01:30 AM Tuesday, Aug. 28: Alfresco services were restored at approximately 01:10 AM. We continue working on the recovery of the RAMSS and HR systems.
Update 02:20 AM Tuesday, Aug. 28: The RAMSS services were restored at approximately 02:00 AM. We continue working on the recovery of the HR system and we expect it to be available shortly.
Update 02:45 AM Tuesday, Aug. 28: The HR system is available as of 2:45 AM. All services have been restored.
Update 11:25 AM Tuesday, Aug. 28: During the outage the front-end Rmail servers that receive email continued to function while the back-end systems that store mail for each email account were offline. Before service was restored the front-end servers queued over 100,000 messages for later delivery. This morning the queues cleared and all pending mail was delivered. We do not believe any mail was lost during the outage.
We apologize for the service disruption.
-Computing and Communications Services