Storage Hardware Failure on `jay` for groups in storage volumes `1, 2, 3, umgc` Wednesday 13th July 2022 10:08:00


@11:50PM, 07-12-2022: hardware failure occurred on one of our networked storage nodes. During the node's reconstruction process, some storage volumes failed to complete.

Groups that may be affected by this hardware failure are groups in the following volumes: 1, 2, 3, umgc. Users in these groups may experience slow access times in primary storage.

This hardware failure may also affect other services as well such as NICE & NoMachine as those services involve writing and reading files to a user's home directory.

Currently, there is no estimated time of resolution as we are in contact with our storage vendor to resolve the issue.

UPDATES:

@10:50AM, 07-13-2022: on call with storage vendor Panasas and started node reconstruction process. The Reconstruction process has been stuck and has been unable to complete.

@08:00AM, 07-14-2022: Volume reconstruction process was able to begin again late last night. As of this morning, the reconstruction process was still going which was unexpected as it was predicted to have completed much sooner. We are again in contact with our vendor to look into the issues with the reconstruction process.

The list of affected volumes identified has been expanded to be:0, 1, 2, 3, 4, 5, 6, umgc, software, risdb, archives_tmp. These affected volumes remain available, however, some portions of the volumes will result in hangs upon access as those blocks of data have not completed its reconstruction process.

@01:00PM, 07-15-22: Volumes 6, umgc**, & risdb have fully recovered and are online. Other volumes are still under reconstruction and are still running into hiccups in the process.

**umgc is recovered and online, but will require additional repairs that will have to be perform at the next maintenance.

All volumes are now back online