Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

StartEndWhat System/Service was affected?What happened?What was affected?Outcome
2017-04-19 08:002017-04-19 13:00 Campus ClusterMerging xpacc data and /usr/local back to data01 (April PM)Resource manager and Scheduler were unavailable during the maintenance.Once again, /usr/local, /projects/xpacc and /home/<xpacc users> are mounting from data01. No more split cluster.
2017-04-04 (1330)2017-04-04 (1600) NetworkingSome fiber cuts caused a routing loop inside one of the campus ISP's network.Certain traffic that traversed this ISP would never make the final destination. Some DNS lookups would have also failed.Campus was able to route around the problem, and the ISP also corrected their internal problem. The cut fiber was restored last night.
2017-03-28 (0000)2017-03-29 (1600) LSSTNPCF Chilled Water OutageLSST - Slurm cluster nodes will be offline during the outage. All other LSST systems are expected to remain operational.No issues. Slurm nodes restarted.
2017-03-28 (0000)2017-03-29 (0230) Blue WatersNPCF Chilled Water OutageFull system shutdown on Blue Waters (except Sonexion which is needed for fsck)FSCK done on all lustre file systems, XDP piping works done (no leakage found), Software updates (PE, darshan) completed.
2017-03-25
10:15PM
2017-03-26
00:08AM
 Blue WatersBW scratch MDT failover, df hangsBW scratch MDT failover, load on mds was 500+ delayed failover. Post FO had some issues that delayed RTS.scheduler was paused
2017-03-25
4pm
2017-03-25
8Ppm
 Blue WatersBW login node ps hangrebooted h1-h3, lost bw/h2ologin DNS record, had neteng recreate the record. Had to rotate login in and out of round-robins until all rebooted. User email sent (2).Logins nodes rebooted
DNS round-robin changes
2017-03-23 (1000)2017-03-23 (1500) NebulaNCSA Nebula OutageNebula will take an outage to balance and build a more stable setup for the file system. This will require a pause of all instances, and Horizon being unavailable.File system online and stable. At this time all blocks were balanced and healed.
2017-03-16 (0630)2017-03-16 (1130) LSSTLSST monthly maintenanceGPFS filesystems will go offline for entire duration of outages. Some systems may be rebooted, especially those that mount one or more of the GPFS filesystems. 
2017-03-15
15:11 
2017-03-15
16:01 
 Blue WatersFailure on cabinet c9-7, affecting HSN.Filesystem hung for several minutes.Scheduler was paused for 50 minutes.
Warmswap cabinet c9-7.
Nodes on c9-7 are reserved for further diagnosis.  
2017-03-15 09:002017-03-15 12:47 Campus ClusterUPS work at ACB.Reshuffling electrical drops on 10k controllers, storage IB switches and some servers.Scheduler will be paused for regular jobs. MWT2 and DES will continue run on their nodes.UPS work at ACB - incomplete (required additional parts)Redistributing power work done.Scheduler was paused for 3hrs 50 mins.
2017-03-10 13:002017-03-10 18:00 Campus ClusterICCP - We lost 10K controllers due to some type of power disturbance at ACB.ICCP - Lost all filesystem and its a cluster wide outage.Recovered missing LUNs and rebooted the cluster. Cluster was back in service at 18:00.
2017-03-09 09002017-03-09 1500 RogerROGER planned PMbatch, hadoop, data transfer services & Ambarisystem out for 6hrs, DT services out until 0000
2017-03-08 19:412017-03-08 22:41 Blue WatersXDP powered off that served the four cabinets
(c16-10, c17-10, c18-10, c19-10).
scheduler paused, four rack power cycled.
moab required a restart, too many down nodes
and itterations were stuck.
Scheduler paused
three hours
2017-03-03 17002017-03-03 2200 Blue WatersBW hpss emergency outage to clean
up db2 database
ncsa#nearline, stores are failing with cache fullResolved cache full errors
2017-02-28 12002017-02-28 1250 Campus ClusterICC Resource Manager downUser can't submit new jobs or start new jobsRemove corrupted job file
2017-02-22 16152017-02-221815 NebulaNebula Gluster IssuesAll Nebula instances paused while gluster repairedNebula is available.
2017-02-11 19002017-02-11 2359 NPCFNPCF Power HitBW Lustre was down, xdp heat issues.RTS 2017-02-11 2359
2017-02-15 08002017-02-15 1800 Campus ClusterICC Scheduled PMBatch jobs and login nodes access