Watch this page in the wiki to subscribe to automatic updates to this status page.
Current Status
All Systems UP | |
---|---|
Include the keyword "issue" in updates above to trigger actions.
Report a problem
Upcoming Scheduled Maintenance
Start | End | What is happening? | What will be affected? |
---|---|---|---|
2017-03-25 10:15PM | scratch MDT failover | scratch is unavailable, df hangs, paused the scheduler and sent user email etc. | |
2017-03-25 4pm | 2017-0325 8pm | login node ps hang | rebooted h1-h3, lost bw/h2ologin DNS record, had neteng recreate the record. Had to rotate login in and out of round-robins until all rebooted. User email sent (2). |
2017-03-23 (1000) | 2017-03-23 (1500) | NCSA Nebula Outage | Nebula will take an outage to balance and build a more stable setup for the file system. This will require a pause of all instances, and Horizon being unavailable. |
2017-03-28 (0000) | 2017-03-29 (1600) | Blue Waters maintenance | Due to maintenance of cooling infrastructure at NPCF, Blue Waters will have full system unavailable during the maintenance. Cray will also take this maintenance window to perform some system updates at the same time. |
Previous Outages
Start | End | What happened? | What was affected? | Outcome |
---|---|---|---|---|
2017-03-16 (0630) | 2017-03-16 (1130) | LSST monthly maintenance | GPFS filesystems will go offline for entire duration of outages. Some systems may be rebooted, especially those that mount one or more of the GPFS filesystems. | |
2017-03-15 | 2017-03-15 16:01 | Failure on cabinet c9-7, affecting HSN. | Filesystem hung for several minutes. | Scheduler was paused for 50 minutes. Warmswap cabinet c9-7. Nodes on c9-7 are reserved for further diagnosis. |
2017-03-15 09:00 | 2017-03-15 12:47 | UPS work at ACB. Reshuffling electrical drops on 10k controllers, storage IB switches and some servers. | Scheduler will be paused for regular jobs. MWT2 and DES will continue run on their nodes. | UPS work at ACB - incomplete (required additional parts) Redistributing power work done. Scheduler was paused for 3hrs 50 mins. |
2017-03-10 13:00 | 2017-03-10 18:00 | ICCP - We lost 10K controllers due to some type of power disturbance at ACB. | ICCP - Lost all filesystem and its a cluster wide outage. | Recovered missing LUNs and rebooted the cluster. Cluster was back in service at 18:00. |
2017-03-09 0900 | 2017-03-09 1500 | ROGER planned PM | batch, hadoop, data transfer services & Ambari | system out for 6hrs, DT services out until 0000 |
2017-03-08 19:41 | 2017-03-08 22:41 | XDP powered off that served the four cabinets (c16-10, c17-10, c18-10, c19-10). | scheduler paused, four rack power cycled. moab required a restart, too many down nodes and itterations were stuck. | Scheduler paused three hours |
2017-03-03 1700 | 2017-03-03 2200 | BW hpss emergency outage to clean up db2 database | ncsa#nearline, stores are failing with cache full | Resolved cache full errors |
2017-02-28 1200 | 2017-02-28 1250 | ICC Resource Manager down | User can't submit new jobs or start new jobs | Remove corrupted job file |
2017-02-22 1615 | 2017-02-221815 | Nebula Gluster Issues | All Nebula instances paused while gluster repaired | Nebula is available. |
2017-02-11 1900 | 2017-02-11 2359 | NPCF Power Hit | BW Lustre was down, xdp heat issues. | RTS 2017-02-11 2359 |
2017-02-15 0800 | 2017-02-15 1800 | ICC Scheduled PM | Batch jobs and login nodes access |