Page History

...

Start	End	What is happening?	What will be affected?
2017-03- 16 (0630)	2017-03-16 (1130)	LSST monthly maintenance	GPFS filesystems will go offline for entire duration of outages. Some systems may be rebooted, especially those that mount one or more of the GPFS filesystems.	2017-03-23 (1000)	2017-03-23 (1500)	NCSA Nebula Outage	Nebula will take an outage to balance and build a more stable setup for the file system. This will require a pause of all instances, and Horizon being unavailable.

...

Start	End	What happened?	What was affected?	Outcome
2017-03-16 (0630)	2017-03-16 (1130)	LSST monthly maintenance	GPFS filesystems will go offline for entire duration of outages. Some systems may be rebooted, especially those that mount one or more of the GPFS filesystems.


2017-03-15 15:11	2017-03-15 16:01	Failure on cabinet c9-7, affecting HSN.	Filesystem hung for several minutes.	Scheduler was paused for 50 minutes. Warmswap cabinet c9-7. Nodes on c9-7 are reserved for further diagnosis.
2017-03-15 09:00	2017-03-15 12:47	UPS work at ACB. Reshuffling electrical drops on 10k controllers, storage IB switches and some servers.	Scheduler will be paused for regular jobs. MWT2 and DES will continue run on their nodes.	UPS work at ACB - incomplete (required additional parts) Redistributing power work done. Scheduler was paused for 3hrs 50 mins.
2017-03-10 13:00	2017-03-10 18:00	ICCP - We lost 10K controllers due to some type of power disturbance at ACB.	ICCP - Lost all filesystem and its a cluster wide outage.	Recovered missing LUNs and rebooted the cluster. Cluster was back in service at 18:00.
2017-03-09 0900	2017-03-09 1500	ROGER planned PM	batch, hadoop, data transfer services & Ambari	system out for 6hrs, DT services out until 0000
2017-03-08 19:41	2017-03-08 22:41	XDP powered off that served the four cabinets (c16-10, c17-10, c18-10, c19-10).	scheduler paused, four rack power cycled. moab required a restart, too many down nodes and itterations were stuck.	Scheduler paused three hours
2017-03-03 1700	2017-03-03 2200	BW hpss emergency outage to clean up db2 database	ncsa#nearline, stores are failing with cache full	Resolved cache full errors
2017-02-28 1200	2017-02-28 1250	ICC Resource Manager down	User can't submit new jobs or start new jobs	Remove corrupted job file
2017-02-22 1615	2017-02-221815	Nebula Gluster Issues	All Nebula instances paused while gluster repaired	Nebula is available.
2017-02-11 1900	2017-02-11 2359	NPCF Power Hit	BW Lustre was down, xdp heat issues.	RTS 2017-02-11 2359
2017-02-15 0800	2017-02-15 1800	ICC Scheduled PM	Batch jobs and login nodes access

Child pages

Versions Compared

Old Version 41

New Version 42

Key