Watch this page in the wiki to subscribe to automatic updates to this status page.

Current Status

Condo Issue

NFS partitions for the condo are currently marked down.

Impact: Nebula, UofI library

Problem: UofI library admin reported an I/O error on a file; a online FSCK was started at 4pm on Friday May 12 at 4pm. The FSCK reported too many online errors in a snapshot, and GPFS requested a offline FSCK to be done. That was started at 11pm on May 12. The NFS partitions were taken offline to do this. The offline FSCK failed with too many errors along with 5 others that were ran throughout the night. The current errors are in a replicated partition, so a replication restripe has been started on Saturday morning May 13. Stay tuned for more information or when the condo is brought back online.

Include the keyword "issue" in updates above to trigger actions.

Report a problem

Upcoming Scheduled Maintenance

SWO

Start	End	What System/Service is affected	What is happening?	What will be affected?
5/11/2017	5/12/2017	ESnet 100G connection	NCSA and ESnet will be moving their 100G connection to a different location in Chicago.	We have several diverse high speed paths to ESnet and DOE, traffic will be redirected to a secondary path.
5/20/2017 07:00	5/20/2017 13:00	Campus cluster and Active Data Storage (ADS)	Entire building power outage for ACB	All campus cluster nodes and storage. Entire ADS service and all connections to ACB

Previous Outages

Start	End	What System/Service was affected?	What happened?	What was affected?	Outcome
2017-05-11 06:45	2017-05-11 07:33	NCSA Jabber upgrade	Upgraded Openfire XMMP jabber software	NCSA Jabber was unavailable during the upgrade.	Jabber was upgraded to the latest version of Openfire
2017-05-09 07:00	2017-05-09 18:15	iForge, cForge, GPFS, License Servers	iForge/cForge Planned Maintenance	iForge/cForge systems, including the ability to submit/run jobs.	Pm was completed early at 1815
2017-05-06 22:00	2017-05-06 23:00	NCSA Open Source	Upgrades of Atlassian software	NCSA Open Source BitBucket	BitBucket is upgraded.
2017-05-06 09:00	2017-05-06 10:00	NCSA Open Source	Upgrade of Atlassian Software	Most services hosted at NCSA Open Source were down for 5 minutes during rolling upgrades.	The following services were upgraded: HipChat, Bamboo, JIRA, Confluence, FishEye and CROWD.
2017-05-05 17:43	2017-05-05 20:02	ITS vSphere	A VM node panicked	Several VMs died when the node panicked and were restarted on other VM nodes. This included LDAP, JIRA, Help/RT, SMTP, Identity, and others.	All affected VMs were restarted on other VM nodes. Most restarted automatically.
2017-04-27 18:10	2017-04-27 18:55	Campus Cluster	Another GPFS interruption	Both Resource Manager and Scheduler went down along with hand full of compute nodes.	Restarted the RM and Scheduler and rebooted all down nodes.
2017-04-27 13:11	2017-04-27 14:20	Nebula	glusterfs crashed due to this bug, so no instances could access their filesystems	All instances running on Nebula	Needed to reboot the node that systems were mounting from, but took the opportunity to upgrade all gluster clients on other systems while waiting for a reboot. Version 3.10.1 fixes the bug. All instances with errors in their logs were restarted.
2017-04-27 11:20	2017-04-27 12:45	Campus Cluster	GPFS interruption	Both Resource Manager and Scheduler went down.	Torque serverdb file was corrupted. Restore the file from this morning snapshot and modified the data to match the current state.
2017-04-26 12:00	2017-04-26 18:30	Condo	A bug in the delete of a disk partition from GPFS. a problem within GPFS	DES, Condo partitions, and UofI Library.	Partitions had been up for 274 days, and many changes. The delete partition bug caused us to stop ALL operations on the condo and repair each disk through GPFS. Must have quarterly maintenance. Just too complicated to go a year without reseting things.
2017-04-19 16:54	2017-04-20 08:45	gpfs01, iforge, cforge	Filled-up metadata disks on I\O servers caused failures on gpfs01.	iforge and cforge clusters, including all currently running jobs.	Scheduling on iForge and cForge was paused for the duration of the incident. Running jobs were killed.13% metadata space was freed. Clusters were rebooted and scheduling resumed.
2017-04-19 08:00	2017-04-19 13:00	Campus Cluster	Merging xpacc data and /usr/local back to data01 (April PM)	Resource manager and Scheduler were unavailable during the maintenance.	Once again, /usr/local, /projects/xpacc and /home/<xpacc users> are mounting from data01. No more split cluster.
2017-04-04 (1330)	2017-04-04 (1600)	Networking	Some fiber cuts caused a routing loop inside one of the campus ISP's network.	Certain traffic that traversed this ISP would never make the final destination. Some DNS lookups would have also failed.	Campus was able to route around the problem, and the ISP also corrected their internal problem. The cut fiber was restored last night.
2017-03-28 (0000)	2017-03-29 (1600)	LSST	NPCF Chilled Water Outage	LSST - Slurm cluster nodes will be offline during the outage. All other LSST systems are expected to remain operational.	No issues. Slurm nodes restarted.
2017-03-28 (0000)	2017-03-29 (0230)	Blue Waters	NPCF Chilled Water Outage	Full system shutdown on Blue Waters (except Sonexion which is needed for fsck)	FSCK done on all lustre file systems, XDP piping works done (no leakage found), Software updates (PE, darshan) completed.
2017-03-25 10:15PM	2017-03-26 00:08AM	Blue Waters	BW scratch MDT failover, df hangs	BW scratch MDT failover, load on mds was 500+ delayed failover. Post FO had some issues that delayed RTS.	scheduler was paused
2017-03-25 4pm	2017-03-25 8Ppm	Blue Waters	BW login node ps hang	rebooted h1-h3, lost bw/h2ologin DNS record, had neteng recreate the record. Had to rotate login in and out of round-robins until all rebooted. User email sent (2).	Logins nodes rebooted DNS round-robin changes
2017-03-23 (1000)	2017-03-23 (1500)	Nebula	NCSA Nebula Outage	Nebula will take an outage to balance and build a more stable setup for the file system. This will require a pause of all instances, and Horizon being unavailable.	File system online and stable. At this time all blocks were balanced and healed.
2017-03-16 (0630)	2017-03-16 (1130)	LSST	LSST monthly maintenance	GPFS filesystems will go offline for entire duration of outages. Some systems may be rebooted, especially those that mount one or more of the GPFS filesystems.
2017-03-15 15:11	2017-03-15 16:01	Blue Waters	Failure on cabinet c9-7, affecting HSN.	Filesystem hung for several minutes.	Scheduler was paused for 50 minutes. Warmswap cabinet c9-7. Nodes on c9-7 are reserved for further diagnosis.
2017-03-15 09:00	2017-03-15 12:47	Campus Cluster	UPS work at ACB.Reshuffling electrical drops on 10k controllers, storage IB switches and some servers.	Scheduler will be paused for regular jobs. MWT2 and DES will continue run on their nodes.	UPS work at ACB - incomplete (required additional parts)Redistributing power work done.Scheduler was paused for 3hrs 50 mins.
2017-03-10 13:00	2017-03-10 18:00	Campus Cluster	ICCP - We lost 10K controllers due to some type of power disturbance at ACB.	ICCP - Lost all filesystem and its a cluster wide outage.	Recovered missing LUNs and rebooted the cluster. Cluster was back in service at 18:00.
2017-03-09 0900	2017-03-09 1500	Roger	ROGER planned PM	batch, hadoop, data transfer services & Ambari	system out for 6hrs, DT services out until 0000
2017-03-08 19:41	2017-03-08 22:41	Blue Waters	XDP powered off that served the four cabinets (c16-10, c17-10, c18-10, c19-10).	scheduler paused, four rack power cycled. moab required a restart, too many down nodes and itterations were stuck.	Scheduler paused three hours
2017-03-03 1700	2017-03-03 2200	Blue Waters	BW hpss emergency outage to clean up db2 database	ncsa#nearline, stores are failing with cache full	Resolved cache full errors
2017-02-28 1200	2017-02-28 1250	Campus Cluster	ICC Resource Manager down	User can't submit new jobs or start new jobs	Remove corrupted job file
2017-02-22 1615	2017-02-221815	Nebula	Nebula Gluster Issues	All Nebula instances paused while gluster repaired	Nebula is available.
2017-02-11 1900	2017-02-11 2359	NPCF	NPCF Power Hit	BW Lustre was down, xdp heat issues.	RTS 2017-02-11 2359
2017-02-15 0800	2017-02-15 1800	Campus Cluster	ICC Scheduled PM	Batch jobs and login nodes access

Child pages

NCSA Status Home

Current Status

Condo Issue

Report a problem

Upcoming Scheduled Maintenance

Previous Outages