Tuesday, December 21, 2004

Friday inspection to check array record

Last friday, we worked on TP9100 failure problem. Below is a brief report .

1. Problem description
As the cust reported, A TP9100 was experiencing failure due to disk error ( amber LED ), filesystems on this device were not avaible, after power circle the host system (octane) and TP9100 , all TP9100 disk stated as amber LED , filesystem not avaible.

2. Work Procedure

2.1 On-site, checked TP9100 with management tool , WAM , some RAID disk were marked as offline, some were unconfigured, LUN was in status of not usable;
2.2 Power circle the TP9100, all disk showed green LED, raid controller was ok, but LUN / filesystem was still unavaible;
2.3 Rebuilt the COD information on RAID device, LUN was rebuild, filesystem could be mounted as normal, but xfs_repair and xfs_check failed , reporting Fatal error, it caused system panic and hang. At this stage, I can say that all data on TP9100 was corrupted, or , the Raid 5 could not recovery;
2.4 Formatted the filesystems, mounted ok.

3. Review

A Raid 5 device has fault tolerance , it could stand for one disk's failure without interrupting regular access , hot spare disk could give you extra protection. On the site of CACT Group, There are 8 disks in TP9100, 6 disks were combined into a Raid 5 device, 2 LUNs build on it, one disk was assigned as hot spare disk , and, the last disk was unused, marked as 'not stable' . Now we know, unfortunately, the hot spare disk was not stable or defective, with GAM tool, you can see that this driver capacity is 0Mb. so, with 2 disk's failure, RAID 5 device corrupted, the data on RAID 5 was lost. at last, we have to re-format the filesystem to make it usable.

5. Recommendation

5.1 Take care of the temperature in your machine room, though the operating temperature for TP9100 is 10 °C to 40 °C , Be mind that hard disk will have much more chance crash down in high temperature .
5.2 Check the system log regularly, not only Octane, but also TP9100, even eyesight checking is good for finding out any hareware failure.
5.3 Replace the failing disk.


Post a Comment

<< Home