Include the keyword "issue" in updates above to trigger actions.
Upcoming Scheduled Maintenance
|Start||End||What System/Service is affected||What is happening?||What will be affected?|
|Start||End||What System/Service was affected?||What happened?||What was affected?||Outcome|
|2017-07-25 02:36||2017-07-25 18:00||Campus Cluster / Scheduler down||Blip on mgmt1 causing GPFS drop and scheduler to crash||Scheduler offline||Still taking long time for Scheduler to initialize but jobs can start and run as usual. Opened case with Adaptive.|
|2017-07-20 09:00||2017-07-20 17:00||ROGER Ambari and OpenStack||Updates to openstack control node and the Ambari cluster||Ambari nodes (cg-hm08 - cg-hm18), OpenStack instances and servers||Openstack was back in service on time. Ambari had issues mounting hdfs was held out of service. HDFS was remounted on 25 July|
|2017-07-20 06:00||2017-07-20 10:00||All NCSA hosted LSST resources||Monthly OS patches (addressing issues including CESA-2017:1615 and CESA-2017:1680). Roll-out updated puppet modules. Batch nodes updated firmware.||All nodes in NCSA 3003 and NPCF (batch nodes) will reboot.||Overall success. Exceptions: verify-worker31 failed a firmware update and is out of comission (LSST-914) and there are connectivity issues for some VMs used by the NCSA DM team (IHS-365). adm01, backup01, and test[09-10] will be patched in the near future.|
|2017-07-19 08:00||2017-07-19 14:44||Campus Cluster||July Maintenance (applied security patch)||Cluster wide, except mwt2 nodes||Applied new kernel, glibc, bind patches and newest NVIDIA driver.|
|2017-06-30 0000||Blue Waters||Emergency maintenance to apply security patch addressing Stack Guard security vulnerability.||Compute, Login, Scheduler are offline.||Kernel and glibc library patched on all affected system.|
|2017-06-22 0800||2017-06-22 1200||All NCSA hosted LSST resources||CRITICAL kernel and package updates to address Stack Guard Page security vulnerability.|
Systems will be patched and rebooted.
|Outage was extended to last past 1000 until 1200. Systems were successfully patched as planned except for qserv-db12 and qserv-db27, which will not boot. We will follow up on those with a ticket.|
|2017-06-22 0800||2017-06-22 0930||LSST cluster nodes (verify-worker*, qserv*, sui*, bastion01, test*, backup01)||Deploy Unbound (local caching DNS resolver)||DNS resolving may have a short (~30 mins) delay.||Successfully deployed and all tests (including reverse DNS and intra-cluster SSH) pass.|
|Bluewaters||XDP shutting down causing EPO on cabinet c1-7 and c2-7.||Scheduler was paused to isolate the failing components, then resumed.||Warmswap of failing components, and returned them to service.|
|NCSA Open Source|
Security upgrade needed for Bamboo, will also update the following components: Bamboo, JIRA, Confluence, BitBucket, FishEye
|Most of the subcomponents of NCSA opensource will be down for a short time when the software is updated.||Upgraded Bamboo, JIRA, Confluence, BitBucket, FishEye to latest versions|
|ROGER Openstack nfs backend failed and was restarted||The primary CES server for the openstack backend failed and tried to fail over to the secondary server, which also failed. SET was notified and they had the CES nfs service back up by 1100||The RoGER openstack dashboard went down and needed a restart. Several VM's experienced "virtual drive errors" and will need to be restarted||SET is still investigating the cause of the GPFS CES service failover. CyberGIS is working with their users to get the affected VM's restarted|
|2017-06-15 0800||2017-06-15 0930||LSST cluster nodes (verify-worker*, qserv*, sui*, bastion01, test*, backup01)||Deploy unbound||DNS resolving may have a short (~30 mins) delay.|
Updates deployed successfully via new puppet module. All tests passed.
EDIT 2017-06-15 1500 - Reverse DNS not working, which broke ssh to qserv* nodes. Disaabled unbound.
|Network Core Switch||Network Engineering will be replacing a line card in one of our Core switches due to hardware issue.||All services should remain active. Any affected switch will have a second redundant link to the other core to pass traffic.||Line card was successfully replaced.|
|2017-06-08 12:00||2017-06-11 22:20||Campus Cluster (scheduler paused)||Disk Enclosure 3 failure on DDN 10K.||Lost redundancy and force us to drain the cluster.||Repair/replacement for controller can be time consuming so we took action to rebalance data out of failed enclosure. Scheduler was resumed as of 22:00.|
|2017-06-07 12:42||NCSA LDAP||The NCSA LDAP service crashed||NCSA LDAP service was unavailable||LDAP software and OS were updated and server rebooted. LDAP is working normally.|
|2017-05-31 20:06||2017-05-31 20:36||NCSA LDAP||The NCSA LDAP service was timing out||NCSA LDAP service was unavailable||The root cause of LDAP timeouts is still being investigated.|
|2017-05-22||2017-05-26||Campus Cluster VMs||Network issue ESXI (hypervisor) Boxes after maintenance||Could no longer able to login to start VMs. License Server, nagios, all MWT2 VMs were down|
The issue is fixed on 5/24. Restored license and Nagios service on 5/24. Moved MWT2 VMs to Campus Farm. All VMs return to service as of noon 5/26.
|5/12/2017||5/18/2017||Condo/NFS partitions only||the NFS partition for the condo became extremely unstable after a replication (normal daily maintenance) was completed. Many iterations with FSCK and IBM on the phone got it resolved, and then 1.5 days restoring files that had been put in Lost and found.||UofI library was switched to the READONLY version on the ADS during this time||The root cause is still being investigated.|
|2017-05-23 14:05||2017-05-23 14:13||NCSA LDAP||The NCSA LDAP service was timing out||NCSA LDAP service was unavailable||The issue is still being investigated, but seems to be steadily available since the incident.|
|2017-05-22 15:41||2017-05-22 15:51||idp.ncsa.illinois.edu|
|Apache Tomcat out of memory||InCommon/SAML IdP and OIDC authentication services were unavailable.||Service restored by failing over to secondary server while memory is being increased on primary server.|
|DES nodes on Campus Cluster||Could not communicate outside the switch||All nodes connected to switch in POD22 Rack2 @ACB||Upgraded the code on the switch resolved the issue.|
|05/20/2017 05:00||05/20/2017 21:09||Campus Cluster and Active Data Storage (ADS)||Total power outage at ACB||All systems currently reside at ACB|
Power was restored around 13:00hrs. We rotated ADS rack to align with Campus Cluster Storage Rack. Changed couple of VLAN IDs to reflect campus for future merger. ESXI boxes are down due to a configuration error after reboot. No major issue from output of FSCK from scratch02.
|05/17/2017 02:00||05/17/2017 10:45||Internet2 WAN connectivity||Intermittent WAN connectivity. The outage was a result of Tech Services' DWDM system, which provides us with our physical optical path up to Chicago via the ICCN. Specifically, the Adva card that our 100G wave is on was seeing strange errors, which was causing input framing errors for traffic coming in on this interface.||General WAN connectivity to XSEDE sites, certain commodity routes, and other I2 AL2S connections.||The Adva card was rebooted and we stopped seeing the input framing errors. Tech Services is working with Adva to find the root cause of the issues on the card.|
|5/11/2017||5/12/2017||ESnet 100G connection||NCSA and ESnet will be moving their 100G connection to a different location in Chicago.||We have several diverse high speed paths to ESnet and DOE, traffic will be redirected to a secondary path.|
|NCSA Jabber upgrade||Upgraded Openfire XMMP jabber software||NCSA Jabber was unavailable during the upgrade.||Jabber was upgraded to the latest version of Openfire|
|iForge, cForge, GPFS, License Servers||iForge/cForge Planned Maintenance||iForge/cForge systems, including the ability to submit/run jobs.||Pm was completed early at 1815|
|2017-05-06 22:00||2017-05-06 23:00||NCSA Open Source||Upgrades of Atlassian software||NCSA Open Source BitBucket||BitBucket is upgraded.|
|2017-05-06 09:00||2017-05-06 10:00||NCSA Open Source||Upgrade of Atlassian Software||Most services hosted at NCSA Open Source were down for 5 minutes during rolling upgrades.||The following services were upgraded: HipChat, Bamboo, JIRA, Confluence, FishEye and CROWD.|
|2017-05-05 17:43||2017-05-05 20:02||ITS vSphere||A VM node panicked||Several VMs died when the node panicked and were restarted on other VM nodes. This included LDAP, JIRA, Help/RT, SMTP, Identity, and others.||All affected VMs were restarted on other VM nodes. Most restarted automatically.|
|2017-04-27 18:10||2017-04-27 18:55||Campus Cluster||Another GPFS interruption||Both Resource Manager and Scheduler went down along with hand full of compute nodes.||Restarted the RM and Scheduler and rebooted all down nodes.|
|2017-04-27 13:11||2017-04-27 14:20||Nebula||glusterfs crashed due to this bug, so no instances could access their filesystems||All instances running on Nebula||Needed to reboot the node that systems were mounting from, but took the opportunity to upgrade all gluster clients on other systems while waiting for a reboot. Version 3.10.1 fixes the bug. All instances with errors in their logs were restarted.|
|2017-04-27 11:20||2017-04-27 12:45||Campus Cluster||GPFS interruption||Both Resource Manager and Scheduler went down.||Torque serverdb file was corrupted. Restore the file from this morning snapshot and modified the data to match the current state.|
|2017-04-26 12:00||2017-04-26 18:30||Condo||A bug in the delete of a disk partition from GPFS. a problem within GPFS||DES, Condo partitions, and UofI Library.||Partitions had been up for 274 days, and many changes. The delete partition bug caused us to stop ALL operations on the condo and repair each disk through GPFS. Must have quarterly maintenance. Just too complicated to go a year without reseting things.|
|2017-04-19 16:54||2017-04-20 08:45||gpfs01, iforge, cforge|
Filled-up metadata disks on I\O servers caused failures on gpfs01.
|iforge and cforge clusters, including all currently running jobs.|
Scheduling on iForge and cForge was paused for the duration of the incident. Running jobs were killed.13% metadata space was freed. Clusters were rebooted and scheduling resumed.
|2017-04-19 08:00||2017-04-19 13:00||Campus Cluster||Merging xpacc data and /usr/local back to data01 (April PM)||Resource manager and Scheduler were unavailable during the maintenance.||Once again, /usr/local, /projects/xpacc and /home/<xpacc users> are mounting from data01. No more split cluster.|
|2017-04-04 (1330)||2017-04-04 (1600)||Networking||Some fiber cuts caused a routing loop inside one of the campus ISP's network.||Certain traffic that traversed this ISP would never make the final destination. Some DNS lookups would have also failed.||Campus was able to route around the problem, and the ISP also corrected their internal problem. The cut fiber was restored last night.|
|2017-03-28 (0000)||2017-03-29 (1600)||LSST||NPCF Chilled Water Outage||LSST - Slurm cluster nodes will be offline during the outage. All other LSST systems are expected to remain operational.||No issues. Slurm nodes restarted.|
|2017-03-28 (0000)||2017-03-29 (0230)||Blue Waters||NPCF Chilled Water Outage||Full system shutdown on Blue Waters (except Sonexion which is needed for fsck)||FSCK done on all lustre file systems, XDP piping works done (no leakage found), Software updates (PE, darshan) completed.|
|Blue Waters||BW scratch MDT failover, df hangs||BW scratch MDT failover, load on mds was 500+ delayed failover. Post FO had some issues that delayed RTS.||scheduler was paused|
|Blue Waters||BW login node ps hang||rebooted h1-h3, lost bw/h2ologin DNS record, had neteng recreate the record. Had to rotate login in and out of round-robins until all rebooted. User email sent (2).||Logins nodes rebooted|
DNS round-robin changes
|2017-03-23 (1000)||2017-03-23 (1500)||Nebula||NCSA Nebula Outage||Nebula will take an outage to balance and build a more stable setup for the file system. This will require a pause of all instances, and Horizon being unavailable.||File system online and stable. At this time all blocks were balanced and healed.|
|2017-03-16 (0630)||2017-03-16 (1130)||LSST||LSST monthly maintenance||GPFS filesystems will go offline for entire duration of outages. Some systems may be rebooted, especially those that mount one or more of the GPFS filesystems.|
|Blue Waters||Failure on cabinet c9-7, affecting HSN.||Filesystem hung for several minutes.||Scheduler was paused for 50 minutes. |
Warmswap cabinet c9-7.
Nodes on c9-7 are reserved for further diagnosis.
|2017-03-15 09:00||2017-03-15 12:47||Campus Cluster||UPS work at ACB.Reshuffling electrical drops on 10k controllers, storage IB switches and some servers.||Scheduler will be paused for regular jobs. MWT2 and DES will continue run on their nodes.||UPS work at ACB - incomplete (required additional parts)Redistributing power work done.Scheduler was paused for 3hrs 50 mins.|
|2017-03-10 13:00||2017-03-10 18:00||Campus Cluster||ICCP - We lost 10K controllers due to some type of power disturbance at ACB.||ICCP - Lost all filesystem and its a cluster wide outage.||Recovered missing LUNs and rebooted the cluster. Cluster was back in service at 18:00.|
|2017-03-09 0900||2017-03-09 1500||Roger||ROGER planned PM||batch, hadoop, data transfer services & Ambari||system out for 6hrs, DT services out until 0000|
|2017-03-08 19:41||2017-03-08 22:41||Blue Waters||XDP powered off that served the four cabinets |
(c16-10, c17-10, c18-10, c19-10).
|scheduler paused, four rack power cycled.|
moab required a restart, too many down nodes
and itterations were stuck.
|2017-03-03 1700||2017-03-03 2200||Blue Waters||BW hpss emergency outage to clean|
up db2 database
|ncsa#nearline, stores are failing with cache full||Resolved cache full errors|
|2017-02-28 1200||2017-02-28 1250||Campus Cluster||ICC Resource Manager down||User can't submit new jobs or start new jobs||Remove corrupted job file|
|2017-02-22 1615||2017-02-221815||Nebula||Nebula Gluster Issues||All Nebula instances paused while gluster repaired||Nebula is available.|
|2017-02-11 1900||2017-02-11 2359||NPCF||NPCF Power Hit||BW Lustre was down, xdp heat issues.||RTS 2017-02-11 2359|
|2017-02-15 0800||2017-02-15 1800||Campus Cluster||ICC Scheduled PM||Batch jobs and login nodes access|