Watch this page in the wiki to subscribe to automatic updates to this status page.
Current Status
NCSA Condo is down. |
---|
Include the keyword "issue" in updates above to trigger actions.
Report a problem
Upcoming Scheduled Maintenance
SWO
Start | End | What System/Service is affected | What is happening? | What will be affected? |
---|---|---|---|---|
Previous Outages
Start | End | What System/Service was affected? | What happened? | What was affected? | Outcome |
---|---|---|---|---|---|
2017-04-19 16:54 | 2017-04-20 08:45 | gpfs01, iforge, cforge | Filled-up metadata disks on I\O servers caused failures on gpfs01. | iforge and cforge clusters, including all currently running jobs. | Scheduling on iForge and cForge was paused for the duration of the incident. Running jobs were killed.13% metadata space was freed. Clusters were rebooted and scheduling resumed. |
2017-04-19 08:00 | 2017-04-19 13:00 | Campus Cluster | Merging xpacc data and /usr/local back to data01 (April PM) | Resource manager and Scheduler were unavailable during the maintenance. | Once again, /usr/local, /projects/xpacc and /home/<xpacc users> are mounting from data01. No more split cluster. |
2017-04-04 (1330) | 2017-04-04 (1600) | Networking | Some fiber cuts caused a routing loop inside one of the campus ISP's network. | Certain traffic that traversed this ISP would never make the final destination. Some DNS lookups would have also failed. | Campus was able to route around the problem, and the ISP also corrected their internal problem. The cut fiber was restored last night. |
2017-03-28 (0000) | 2017-03-29 (1600) | LSST | NPCF Chilled Water Outage | LSST - Slurm cluster nodes will be offline during the outage. All other LSST systems are expected to remain operational. | No issues. Slurm nodes restarted. |
2017-03-28 (0000) | 2017-03-29 (0230) | Blue Waters | NPCF Chilled Water Outage | Full system shutdown on Blue Waters (except Sonexion which is needed for fsck) | FSCK done on all lustre file systems, XDP piping works done (no leakage found), Software updates (PE, darshan) completed. |
2017-03-25 10:15PM | 2017-03-26 00:08AM | Blue Waters | BW scratch MDT failover, df hangs | BW scratch MDT failover, load on mds was 500+ delayed failover. Post FO had some issues that delayed RTS. | scheduler was paused |
2017-03-25 4pm | 2017-03-25 8Ppm | Blue Waters | BW login node ps hang | rebooted h1-h3, lost bw/h2ologin DNS record, had neteng recreate the record. Had to rotate login in and out of round-robins until all rebooted. User email sent (2). | Logins nodes rebooted DNS round-robin changes |
2017-03-23 (1000) | 2017-03-23 (1500) | Nebula | NCSA Nebula Outage | Nebula will take an outage to balance and build a more stable setup for the file system. This will require a pause of all instances, and Horizon being unavailable. | File system online and stable. At this time all blocks were balanced and healed. |
2017-03-16 (0630) | 2017-03-16 (1130) | LSST | LSST monthly maintenance | GPFS filesystems will go offline for entire duration of outages. Some systems may be rebooted, especially those that mount one or more of the GPFS filesystems. | |
2017-03-15 15:11 | 2017-03-15 16:01 | Blue Waters | Failure on cabinet c9-7, affecting HSN. | Filesystem hung for several minutes. | Scheduler was paused for 50 minutes. Warmswap cabinet c9-7. Nodes on c9-7 are reserved for further diagnosis. |
2017-03-15 09:00 | 2017-03-15 12:47 | Campus Cluster | UPS work at ACB.Reshuffling electrical drops on 10k controllers, storage IB switches and some servers. | Scheduler will be paused for regular jobs. MWT2 and DES will continue run on their nodes. | UPS work at ACB - incomplete (required additional parts)Redistributing power work done.Scheduler was paused for 3hrs 50 mins. |
2017-03-10 13:00 | 2017-03-10 18:00 | Campus Cluster | ICCP - We lost 10K controllers due to some type of power disturbance at ACB. | ICCP - Lost all filesystem and its a cluster wide outage. | Recovered missing LUNs and rebooted the cluster. Cluster was back in service at 18:00. |
2017-03-09 0900 | 2017-03-09 1500 | Roger | ROGER planned PM | batch, hadoop, data transfer services & Ambari | system out for 6hrs, DT services out until 0000 |
2017-03-08 19:41 | 2017-03-08 22:41 | Blue Waters | XDP powered off that served the four cabinets (c16-10, c17-10, c18-10, c19-10). | scheduler paused, four rack power cycled. moab required a restart, too many down nodes and itterations were stuck. | Scheduler paused three hours |
2017-03-03 1700 | 2017-03-03 2200 | Blue Waters | BW hpss emergency outage to clean up db2 database | ncsa#nearline, stores are failing with cache full | Resolved cache full errors |
2017-02-28 1200 | 2017-02-28 1250 | Campus Cluster | ICC Resource Manager down | User can't submit new jobs or start new jobs | Remove corrupted job file |
2017-02-22 1615 | 2017-02-221815 | Nebula | Nebula Gluster Issues | All Nebula instances paused while gluster repaired | Nebula is available. |
2017-02-11 1900 | 2017-02-11 2359 | NPCF | NPCF Power Hit | BW Lustre was down, xdp heat issues. | RTS 2017-02-11 2359 |
2017-02-15 0800 | 2017-02-15 1800 | Campus Cluster | ICC Scheduled PM | Batch jobs and login nodes access |