Start	End	What System/Service is affected	What is happening?	What will be affected?
2018-01-18 13:00	2018-01-18 17:00	ISDA Hypervisors, NCSA Open Source	Hypervisor updates. If this update is not possible on Monday, this will happen on Thursday.	This will result in all VM's being shutdown for a short amount of time. (postponed to Thursday).

Previous Outages or Maintenance

Start	End	What System/Service was affected?	What happened?	What was affected?	Outcome
2018-01-18 00:00	2018-01-18 24:00	Campus Cluster	Copying all data to new filesystem. Deploying new Storage (14K). Dividing cluster into two (IB & Ethernet). Upgrading GPFS to 4.2.3.6. Deploying new management node and new image server (if time permit). Applying Security patches to compute nodes(no FW update at this time).	All systems unavailable.
2018-01-18 13:00	2018-01-18 17:00	ISDA Hypervisors, NCSA Open Source	Hypervisor updates. If this update is not possible on Monday, this will happen on Thursday.	This will result in all VM's being shutdown for a short amount of time. (postponed to Thursday).

...

Outcome

Start

End

What System/Service was affected?

What happened?

What was affected?

New Storage System was brought online, additional capacity and performance was added.

2018-01-18 18:40

2018-01-18 23:00

LSST

LSST Firewall outage in NPCF. Both pfSense firewalls were accidentally powered off.

PDAC (Qserv & SUI) and verification clusters were inaccessible, as well as introducing GPFS issues across many services, e.g. lsst-dev01.

The pfSense firewall appliances were power cycled and services restored.

2018-01-18 12:58

2018-01-18 14:10

Code42 Crashplan backup system

Code42 Crashplan server were upgraded to latest JDK and Code42 6.5.2.

Clients were unable to perform restores or push files into backup archive from roughly 13:35 - 13:55

Code42 servers are now running latest security updates to the crashplan service.

2018-01-18 08:00

2018-01-18 10:00

LSST

Monthly OS updates, network switch updates, firmware updates, etc.

All dev systems unavailable. Qserv and SUI nodes will remain available.

Status

colour	Green
title	Complete

2018-01-17 10:35

2018-01-17 13:00

RSA Authentication Manager Servers

Upgraded to Authentication Manager 8.1sp1p7

No systems should have seen any impact

Latest security patches are applied.

2018-01-12 06:00

2018-01-12 10:00

Decommission NCSA Rocket.chat

The old NCSA Rocket.chat service was shutdown.

Any archived conversations or content are no longer be available to users.

NCSA Rocket.chat service was shutdown and redirected to NCSA @ Illinois Slack.

Friday, Jan 12th, 0000-0600 CST

Internet2

Engineers from Internet2 will be migrating our BGP peering with I2's Commercial Peering Service (CPS) to a new location. Small disruptions may occur with the maintenance for the CPS service, but no user traffic disruptions should occur.

None, Alternatives routes are present.

none

Maintenance was completed successfully.

2018-01-11 08:00

2018-01-11 13:30

LSST

Critical patches on lsst-dev systems (incl. kernel updates).

All systems unavailable.

Status

colour	Green
title	Complete

Thursday, Jan 11th, 0000 CST

Thursday, Jan 11th, 0400 CST

Connectivity to Internet2 and backup LHCONE peerings - ICCP and MWT2 respectively

Engineers from Internet2 performed maintenance that affected certain BGP peerings that exist on the device that is ICCP/MWT2's upstream router, CARNE. Specifically, both the 100G Internet2 peering and the Internet2 LHCONE peering on CARNE were disrupted during this timeframe. MWT2 currently gets to LHCONE through CARNE's ESnet peering, which was fully functional. They also were able to get to UChicago through CARNE's OmniPoP 100G peering. As for ICCP, traffic to/from Internet2 based routes rerouted through the ICCN.

Nothing was reported to be service impacting by this maintenance from neither ICCP nor MWT2.

Successful maintenance was completed.

2018-01-08 10:47

2018-01-08 11:30

Nebula

Storage nodes lost networking

All nebula instances

Storage nodes were brought back online, instances were rebooted

2018-01-02
09:00

2018-01-05 17:00

Nebula

Nebula was shut down for hardware and software maintenance from January 2nd, 2018 at 9am until January 5th, 2018 at 5pm. Spectre and Meltdown patches were applied, as well as all firmware updates, OS/distribution updates, and the filesystem was upgraded.

All systems were unavailable.

Faster system that is now homogenous, so OpenStack upgrades are now possible.

2018-01-04 17:00

2018-01-05 20:00

Blue Waters

One OST hosting the home file system has three drives failed simultaneously.

Portion of home file system (with data on the affected OST) are not accessible.

Repair works were carried out on the failed OST. Scheduler continued to operate but restricting only jobs not affected by the failed OST to start.

Full operation resumed after successful recovery of the failed OST.

2017-12-20 08:00

2017-12-20 10:00

LSST

(1) Firewall maintenance (08:00-09:00) and (2) migration of NFS services (08:00-10:00).

Firewall maintenance: There should be no noticeable effect but scope of service includes most systems at NPCF (including PDAC, SUI, and Slurm/batch/verify nodes).

Migration of NFS services: SUI and lsst-demo* nodes.

Maintenance completed without issues.

2017-12-14 06:00

2017-12-14 20:30

LSST

Monthly OS updates, network switch updates, firmware updates, etc.

All systems unavailable.

All systems back online.
We ran into issues with the policy based routing on the LSST aggregate switches in NPCF that caused the outage to be extended longer than planned.

2017-12-13 09:00

2017-12-13 11:00

JIRA Upgrade

Upgraded JIRA to version 7.6 from 7.0

NCSA Jira

Succesfully upgraded

2017-12-13 06:30

2017-12-13 07:39

NCSA Jabber

Attempted to upgrade Openfire XMMP jabber software.

NCSA Jabber was unavailable during the upgrade.

The upgrade failed. Jabber is available, but still running the old version. The upgrade will be rescheduled.

2017-12-11 10:00

2017-12-11 16:00

Unused AFS fileserver were upgraded to 1.6.22

After moving all volumes to servers updated on 2017-12-07, the now unused AFS servers were upgraded to OpenAFS 1.6.22.

No impact to other systems as they were unused at the time they were upgraded.

All of NCSA's afs cell is running on OpenAFS 1.6.22

2017-12-09 03:00

2017-12-09 07:42

BlueWaters Portal

The BlueWaters portal software crashed. Automated monitoring processes did not restart it correctly.

The BlueWaters portal website was unavailable.

The BlueWaters portal service was manually restarted and the website is available.

2017-12-09 1000hrs

2017-12-09 1400hrs

Globus Online (Globus.org)

Please be advised that the Globus service will be unavailable on Saturday, December 9, 2017, between 10:00am and 2:00pm CST while we conduct scheduled upgrades. Active file transfers will be suspended during this time and they will resume when the Globus service is restored. Users trying to access the service at globus.org (or on your institution's branded Globus website) will see a maintenance page until the service is restored.

All NCSA Globus endpoints.

2017-12-07

Unused AFS file servers were upgraded to 1.6.22

Three unused AFS fileserver were upgraded to the latest 1.6.22 release of OpenAFS

No impact to other systems as they were unused.

These AFS fileserver can no longer be crashed by malicious clients.

2017-12-07

AFS database servers were upgraded to 1.6.22

The three database servers were upgraded to the latest 1.6.22 release of OpenAFS

No modern clients noticed the staggered updates.

These servers can no longer be crashed by malicious clients.

2017-12-05 16.00

2017-12-05 16:20

dhcp.ncsa.illinois.edu

NCSA Neteng will be migrating the DHCP server VM to Security team's VMware infrastructure.

- Hosts on the NCSAnet wireless network might be impacted.
- Any activated hosts that might be on the roaming range might be impacted.
+ Illinoisnet and Illinois_Guest wireless will be available at ALL times.
+ Wired network connection will be available throughout the maintenance window.

Maintenance was completed successfully and services are running as expected.

2017-12-02 09:30

2017-12-02 11:45

NCSA opensource

Upgrade of Bamboo, JIRA, Confluence, BitBucket FishEye, and CROWD

Sub services of opensource can be down for a short time.

All services upgraded and running as normal.

2017-11-20 18:21

2017-11-29 14:30

ROGER OpenStack cluster

I/O issues highlighted that GPFS CES NFS servers probably shouldn't run 400+ days without reboot

ROGER's OpenStack and the various services which were hosted therein, including JupyterHub Server

reboot of all nodes, including CES servers as well as the reboot of all hypervisors (with the fallout being one node required fsck and second reboot and another node/hypervisor is still unavailable) cleared most of the problems. I/O contention was felt as many instances were simultaneously attempting to start/restart. instances that were housed on the unavailable node are being migrated to another hypervisor

2017-11-21 9:00

2017-11-22 14:00

Open Source

ISDA servers

Update the fileserver that hosts VM's
all the XEN servers.

NCSA Open Source unavailable
Most of ISDA servers unavailable

Network issues delayed updates
All hosts updated and everything back to normal.

2017-11-21 16:00

2017-11-21 16:40

Code42 Crashplan

The Code42 crashplan infrastructure was upgraded to version 6.5.1 to apply security and performance improvements

Clients transparently reconnected to servers after they restarted

Now running on Code42 version 6.5.1

2017-11-20 9:00

2017-11-20 16:38

Nebula Openstack cluster

Nebula OpenStack cluster was unavailable for emergency hardware maintenance. A failing RAID controller from one of the storage nodes and a network switch were replaced.

Not all instances were impacted. Running Nebula instances that were affected by the outage were shut down, then restarted again after we finished maintenance.

Nebula is available.
No additional maintenance is needed for Tuesday, November 21.

2017-11-16 16:46

2017-11-20 12:40

NCSA JIRA

JIRA wasn't importing some email requests properly after the NCSA MySQL restart.

Some email sent to JIRA via help+ addresses wasn't being imported.

JIRA is now accepting email and all email sent while it was broken has now been imported as expected.

2017-11-16
08:30

2017-11-16
13:30

BW LDAP Master

(Blue Waters)

Scheduled maintenance

Updated LDAP lustre quotas to bytes and add archive quotas. IDDS will track and drive quota changes with acctd.

Production continued w/o interruption. BW LDAP master was isolated, lustre quotas changed to bytes with the addition of archive quotas. Replicas pulled updates w/o error.

2017-11-16 14:30

2017-11-16 16:52

Internal website (MIS Savanah)

A database table used by MIS tools became corrupted.

The website would become unresponsive every time the corrupted database table was accessed.

OS kernel and packages where updated during debugging. The MIS database table was restored and the website came back online.

2017-11-16 16:46

2017-11-16 16:48

NCSA MySQL

The NCSA MySQL server had to be restarted in order to delete the corrupted table used by MIS.

All services that use MySQL were down during the outage. This includes: Confluence, JIRA, RT, and lots of websites

MySQL was restarted successfully.

2017-11-16 0800

2017-11-16 1200

LSST

Monthly OS updates, plus first round of Puppet technical debt changes (upgrading to best design & coding practices)

All systems unavailable from 0800 - 1000 hrs.

GPFS unavailable from 0800 - 1000 hrs.

PDAC systems unavailable from 0800 - 1200 hrs.

Completed. OS kernel and package updates. Slurm upgrade to 17.02.

2017-11-15 13:30

2017-11-15 15:10

RSA Authentication Manager

RSA Authentication Manager were patched to fix cross site scripting vulnerabilities and other fixes

Nothing was affected by the update

RSA Authentication Manager is running 8.2 SP1 P6. Process worked as expected.

2017-11-15 - 13:30

2017-11-15 - 14:30

BW 10.5 Firewall Upgrade Part 2

The normal active, "A" unit, NCSA BW 10.5 Firewall will be upgraded and then normal fail-over status will be re-enabled.

The possibility of connection resets when the A unit comes back from being upgraded and state is being sycned.

Completed, process worked as expected.

2017-11-14 11:27

2017-11-14 11:33

LDAP

LDAP was unresponsive to requests.

Several services hung while authentication was unavailable.

LDAP services were killed and restarted.

2017-11-05
02:15

2017-11-06
17:11

ROGER Hadoop/Ambari

cg-hm12 and cg-hm13 took minor disk failures which crashed the node

Ambari was effectively off-line

rebooted node, and node ran fsck as part of its startup sequence, node booted properly

2017-10-31
17:22

2017-11-03
17:00

ROGER hadoop/ambari

hard drive failures on cg-hm10 and cg-hm17

certain ambari services and HDFS

cg-hm17 returned to service after power cycle and reboot, cg-hm10's hard drive didn't respond to a reboot

2017-11-11 16:58

2017-11-11 19:09

Blue Waters

Water leak from XDP4-8 causing high temperature to c12-7 and c14-7.

EPO on c12-7 and c14-7.

Scheduler was paused to place system reservations on compute nodes in affected cabinets, then resumed.

2017-11-10 14:00

2017-11-10 14:45

NCSA Open Source

Upgrade of the following software: Bamboo, JIRA, Confluence, and BitBucket

Updates will happen in place and will result in minimal downtime of components.

completed, minimal interruption of service

2017-11-10 - 08:00

2017-11-10 - 08:30

CA Firewall Upgrade - B unit

the stand-by, "B" unit, NCSA Certificate Service Firewall will be upgraded to same version as A unit.

Expect no impact to services

completed, no interruption of service

2017-11-08

16:30

2017-11-08

17:30

Netdot

netdot.ncsa.illinois.edu was migrated to Security's VMware infrastructure.

During the downtime users weren't able to activate or deactivate their network connections via Netact.

Migrated successfully. Netdot is up and running.

2017-11-08 06:00

2017-11-08 15:00

ITS vSphere vCenter

ITS vSphere was upgraded to the latest version of VMware vCenter. New access restrictions were also be put into place.

All VMs remained online during the maintenance, but management through vCenter was offline during the upgrade.

Upgrade completed successfully.

2017-11-08 09:30

2017-11-08 10:00

BW 10.5 Firewall Upgrade Part 1

the stand-by, "B" unit, NCSA BW 10.5 Firewall will be upgraded and then traffic redirected through it for load testing before the "A" unit is upgraded

Expect no impact to services

Upgrade completed successfully. Some states were reset when traffic switched to the B unit.

2017-11-07 7:00

2017-11-07 18:37

iForge

quarterly maintenance

Update OS image.
Update GPFS to version 4.2.3-5
Redistribute power drops.
Update TORQUE.
BIOS updates.

iForge (and associated clusters)

All production systems are back in service

2017-11-07 - 13:30

2017-11-07 - 15:00

CA Firewall Upgrade Part 2

The normal active, "A" unit, NCSA Certificate Service Firewall will be upgraded and then normal fail-over status will be re-enabled.

The possibility of connection resets when the A unit comes back from being upgraded and state is being sycned.

Completed upgrade

2017-11-06 15:28

2017-11-06 15:53

Blue Waters

EPO happened to c12-7 and c14-7.

HSN quiesced.

Scheduler was paused to place system reservations on compute nodes in affected cabinets, then resumed.

2017-11-03 16:21

2017-11-03 16:32

LDAP

LDAP was unresponsive to requests.

Several services hung while authentication was unavailable.

LDAP services was killed and restarted.

2017-11-02 09:00

2017-11-02 16:00

LSST

LSST had a GPFS server that was down and had failed over to the other server for NFS.

The GPFS client’s failed over automatically, and we manually failed over the NFS in the morning.

NFS exports were moved to an independent server. IBM was at NCSA and is continuing to debug the problems.

2017-10-31 17:11

2017-11-01 11:13

LSST

GPFS degraded/outage

most NCSA-hosted LSST resources experienced degraded GPFS performance

hosts with native mounts (PDAC) experienced an outage

A deadlock at 17:11 yesterday temporarily caused slow performance. Then one GPFS server went offline at 18:21 and services failed over. NFS mounts (qserv/sui) were reported as hanging by a user at 09:12 today but may have been degraded over night. Affected nodes were rebooted and NFS mounts recovered by 11:13. IBM is onsite diagnosing issues with the GPFS system and ordering repairs (including a network card on one server).

2017-10-31 15:30

2017-10-31 16:00

LSST

GPFS outage

most NCSA-hosted LSST resources

native mounts (e.g., lsst-dev01, verify-worker*) and NFS mounts (e.g., PDAC)

All disks in the GPFS storage system went offline temporarily and came back online by themselves. NFS services were restarted. Client nodes all recovered their mounts on their own. Logs have been sent to the vendor for analysis.

2017-10-31 - 13:30

2017-10-31 - 14:30

CA Firewall Upgrade Part 1

the stand-by, "B" unit, NCSA Certificate Service Firewall will be upgraded and then traffic redirected through it for load testing before the "A" unit is upgraded

Expect no impact to services

Upgrade completed successfully. Some states were reset when traffic switched to the B unit.

2017-10-30 18:36

2017-10-31 00:46

LSST

GPFS outage

most NCSA-hosted LSST resources

native mounts (e.g., lsst-dev01, verify-worker*) and NFS mounts (e.g., PDAC)

GPFS servers were rebooted. lsst-dev01 and most of the qserv-db nodes were also rebooted. Native GPFS and NFS mounts were recovered. May have been (unintentionally) caused by user processes but will continue to investigate..

2017-10-25 22:00

2017-10-26 11:20

LSST

full/partial GPFS outage

full outage for GPFS during 22:00 hour on 2017-10-25

outage for NFS sharing of GPFS (for qserv, sui) continued through the night

full outage for GPFS recurred 2017-10-26 around 08:44

All GPFS services and mounts have been restored.

2017-10-26 09:04

Various buildings across campus, including NPCF and NCSA

Issue with an Ameren line from Mahomet caused a bump/drop/surge in power that lasted 2ms

LSST had approximately 20 servers at both NPCF and NCSA buildings reboot

Was a momentary issue with minimal effect to most systems

2017-10-26 00:00

2017-10-26 08:00

ICCP

gpfs_scratch01 was filled by a very active user

Additional space in scratch wasn't available

Out of cadence purge was run to free 2TB, users jobs held in scheduler; user contacted

2017-10-25 06:00

2017-10-25 14:05

Blue Waters

Security Patching of CVE-2017-1000253 security vulnerability.

Restricted access to logins, scheduler and compute nodes. HPSS and IE nodes are not affected.

System was patched. Logins hosts are made available at 9am. The full system is returned to service at 14:05.

2017-10-24 09:50

2017-10-24 20:10

LSST

Network outage / GPFS outage

All LSST nodes from NCSA 3003 (e.g., lsst-dev01/lsst-dev7) and NCPF (verify-worker, PDAC) that connect to GPFS (as GPFS or NFS) lost their connections.

All LSST nodes at NPCF lost network during network stack troubleshooting and replacement of 3rd bad switch.

A 3rd bad switch was discovered and replaced. All nodes have network and GPFS connectivity once again.

2017-10-23 08:00

2017-10-24 05:00

Campus Cluster

Campus Cluster October maintenance.

Total outage of the cluster.

Replaced core ethernet switches from share services pod. Run new ethernet cables for share services pod. Moved DES rack from share services pod to ethernet only pod. Deployed new image with patched.

2017-10-21 17:15

2017-10-23 17:45

LSST

First one then two public/protected network switches went down in racks N76, O76 at NPCF

Mostly qserv-db[11-20] and verify-worker[25-48]; there was also shorter outage for qserv-master01, qserv-dax01, qserv-db[01-10], all of SUI, and the rest of the verify-worker nodes.

Two temporary replacement switches were swapped in. Maintenance and/or longer-term replacement switches is being procured for the original switches.

2017-10-18 13:00

2017-10-18 14:00

Networking

Replaced a linecard in one of our core switches due to hardware failure.

Any downstream switches were routed through the other core switch.

All work was completed successfully.

2017-10-19 08:00

2017-10-19 21:30

LSST

Outage and migration of qserv-master01: provisioning of new hardware, copying of data from old server to new.

qserv-master01 (and any services that depend on qserv-master01, which may include services provided by qserv-db*, qserv-dax01, and sui*)

UPDATE (2017-10-19 15:15) OS install took much longer than anticipated, completed at 15:00. Data sync is started. Extending outage till 22:00.

Completed

10-19 08:00

2017-10-19 12:00

LSST

Routine patching and reboots, pfSense firmware updates (NPCF), Dell server firmware updates (NPCF).

All NCSA-hosted resources except for Nebula.

Maintenance completed successfully. (qserv-master migration is ongoing, see separate status entry)

2017-10-18 14:45

2017-10-18 15:35

Campus Cluster

Restart of resource manager failed after removing all block array jobs.

Job submission

Opened case with Adaptive (#25796). Found more array jobs and bad jobs in jobs directories. Removed all of those.

2017-10-15 08:15

2017-10-15 08:30

Open Source

Emergency upgrade of Atlassian Bamboo.

Bamboo will be down for a few minutes during this outage window.

Bamboo upgraded to the latest version.

2017-10-14 22:15

2017-10-14 23:35

Campus Cluster

Scheduler crash

Job submission

Opened case with Adaptive, run diag and uploaded the output along with the core file. Restarted the moab.

2017-10-14 13:00

2017-10-14 15:23

Campus Cluster

Resource manager crash

Job submission

Applied patch from Adaptive, which help with faster recovery. Suspend/block all current and new array jobs until we have a resolution.

2017-10-06 09:00

2017-10-11 01:00

Nebula

Gluster and network issues

1) Gluster sync issues continue from 2017-10-05's Nebula incident.
2) At approximately 2017-10-06 16:10, a Nebula networking issue (unrelated to the Gluster issues) occurred resulting in host network drops within the Nebula infrastructure. This internal networking incident resulted in additional gluster and iscsi issues.
Many instances are broken because iSCSI is broken from the Nebula network issues. And any instances that were broken because of gluster are still broken.

All instances have been restarted and are in a state for admins to run. Some mounted file systems might require a fsk to verify. If there are other issues please send a ticket.

As the file system continues to heal we may see slower interaction.

2017-10-10 16:30

2017-10-10 19:10

Campus Cluster

Resource manager crash

Job submission

After removing problematic jobs from queue and we were able to restart the RM. Opened the case with Adaptive and forwarded those job scripts and core files.

2017-10-05 14:00

2017-10-05 17:00

Nebula

Gluster sync issues

One of the gluster storage servers within Nebula had to be restarted.

Approximately 100 VM instances experienced IO issues and were restarted.

2017-10-06 08:00

2017-10-06 17:00

NCSA direct peering with ESnet

A fiber cut between Peoria and Bloomington caused our ESnet direct peering to go down.

All traffic that would have taken the ESnet peering rerouted through our other WAN peers. As such there were no reported outages of connectivity to resources that users would normally access via this peering

The fiber cut has been repaired and the peering has been re-established.

2017-10-06 08:00

2017-10-06 10:00

LSST

Kernel and package updates to address various security vulnerabilities, including the PIE kernel vulnerability described in CVE-2017-1000253. This will involve an upgrade to CentOS 7.4 and updates to GPFS client software on relevant nodes.

All NCSA-hosted LSST resources except for Nebula (incl. LSST-Dev, PDAC, and verification/batch nodes) will be patched and rebooted.

Maintenance completed successfully. Pending updates to a couple of management nodes (adm01 and repos01) and one Slurm node that is draining (verify-worker11).

2017-10-4 07:40

2017-10-4 09:55

Campus Cluster

Resource Manager crash

Job submission

Failure on initial restart attempt. After looking through the core, decided to try a restart again without any change. This time it worked.

2017-10-03 13:00

2017-10-03 19:00

Campus Cluster

Resource Manager crash

Job submission

After removing ~30 problematic jobs from queue and we were able to restart the RM. Opened the case with Adaptive and forwarded those job scripts and core files.

2017-09-21 02:57

2017-09-21 09:40

Storage server (AFS, iSCSI, web, etc)

The parchment storage server stopped responding on the network.

Several websites were down, including the following: www.ncsa.illinois.edu, cybergis.illinois.edu, nationaldataservice.org, etc
iSCSI storage mounted to fileserver went offline.
Several AFS volumes, including some users' home directories were offline.

Replaced optical transceiver on the machine and networking restarted. Also updated kernel and AFS.

2017-09-20 08:00

2017-09-20 13:45

Campus Cluster

September Maintenance

Total cluster outage

Maintenance completed successfully.

2017-09-20 08:00

2017-09-20 11:30

NCSA Storage Condo

Normal maintenance --Firmware upgrade on Netapps so new disk trays could be attached for DSIL

total file system outage

The quarterly maintenance was complete

2017-09-18 11:20

2017-09-18 13:30

Active Data Storage

RAID Failure in NSD server and disk failure on secondary NSD server.

ADS service was unavailable

Recovered RAID configuration on NSD server and replaced failed disk on secondary NSD. ADS restored.

2017-09-15 06:20

2017-09-15 09:28

public-linux

OpenAFS storage was not running or mounted after rebooting to a new kernel.

AFS storage was not available from this server

Reinstalled the dkms-openafs package restarted the openafs-client. AFS is now working as expected.

2017-09-10 09:45

2017-09-10 11:30

NCSA Open Source

Upgrade of Bamboo, JIRA, Confluence, BitBucket, FishEye, Crowd

During the upgrade the services will be unavailable for a short amount of time.

All services upgraded successfully.

2017-08-31 11:07

2017-08-31 11:11

NCSA LDAP

NCSA LDAP Timeouts

NCSA LDAP was overloaded and timing out. Users were not able to authenticate via NCSA LDAP during that time.

NCSA LDAP stopped timing out at 11:11 am and authentication resumed.

2017-08-28 11:55

2017-08-28 12:59

NCSA GitLab

NCSA GitLab server ran out of disk space for the OS

The web interface at https://git.ncsa.illinois.edu wasn't working

Web interface is now working. Space freed up by clearing CrashPlan caches.

2017-08-24 13:00

2017-08-24 14:30

netact.ncsa.illinois.edu

Transient config issues from some system patching caused apache to not be able to start on the netact server

Network Activation

The issues were fixed and Network Activation is working again

2017-08-24 08:00

2017-08-24 15:30

LSST

Rack upgrades in NCSA 3003

Most LSST Developer services offline during upgrade

All LSST systems are back online with new racks and switches

2017-08-24 08:00

2017-08-24 09:30

LSST

monthly maintenance for NPCF (includes patching to address CESA-2017:1789 and CESA-2017:1793)

adm01, backup01, bastion01, monitor01, object*, qserv*, sui*, verify-worker*, test0*

Maintenance was successfully completed.

2017-08-23 09:21

2017-08-23 16;50

aForge/iForge

gpfs failed during an upgrade of GPFS on the iforge storage nodes. There was an IB hiccup at the time, but causality is unclear

all jobs on iforge were aborted, gpfs clients needed to be upgrade, all gpfs client nodes were rebooted

iForge went production shortly before 5:12pm. aforge went "production" at ~1630

2017-08-22 20:00

2017-08-22 30:00

Patching DHCP service

Patching OS and services on DHCP1.

Will need to reboot DHCP server a few times during this process. During the time dhcp will be unavailable. This is during the evening so I don't expect any direct issues from this.

Patching has been completed.

2017-08-16 08:00

2017-08-16 16:00

Campus Cluster

August Maintenance

Scheduler and resource manager down

Upgraded Moab 9.1.1 and Torque 6.1.1.

2017-08-16 08:00

2017-08-16 09:15

None

Replace Line Card in Core Switch

I believe all systems connected to this switch, are multihomed and will not experience an outage.

The line was has been successfully replaced.

2017-08-16 00:30

2017-08-16 02:30

Blue Waters

Two cabinets (c10 & c11) had EPO due to XDP control valve failure.

Scheduler was paused to isolate failing parts, resumed at 2:09.

Parts replaced and cabinets were returned to service.

2017-08-08 7:00

2017-08-09 3:00

iforge/cfdforge/aforge

Update OS image to RH 6.9

Update GPFS to version 4.2.3-2

Redistribute power drops

All four clusters were updated.

All items on checklist completed.

20170808 Maintenance for iforge

2017-08-03 06:45

2017-08-03 07:35

NCSA Jabber upgrade

Upgraded Openfire XMMP jabber software

NCSA Jabber was unavailable during the upgrade.

Jabber was upgraded to the latest version of Openfire

2017-07-28 17:00

2017-07-31 evening

Update - All of the production data has been migrated except for the largest object table. That is loading now, then the user space will be loaded. Should all hopefully be done by this evening. Migration of operational database to new hardware happening during the weekend.

DES old operational database

migration done successfully. Some other maintenance tasks that will give DES additional disk space was done, too and some performance improvements.

2017-07-27 11:00

2017-07-28 15:00

netact.ncsa.illinois.edu

The netact.ncsa.illinois.edu network activation server VM needed to be restored from backup

Network Activation service

The service has been fully restored

2017-07-25 02:36

2017-07-25 18:00

Campus Cluster / Scheduler down

Blip on mgmt1 causing GPFS drop and scheduler to crash

Scheduler offline

Still taking long time for Scheduler to initialize but jobs can start and run as usual. Opened case with Adaptive.

2017-07-20 09:00

2017-07-20 17:00

ROGER Ambari and OpenStack

Updates to openstack control node and the Ambari cluster

Ambari nodes (cg-hm08 - cg-hm18), OpenStack instances and servers

Openstack was back in service on time. Ambari had issues mounting hdfs was held out of service. HDFS was remounted on 25 July

2017-07-20 06:00

2017-07-20 10:00

All NCSA hosted LSST resources

Monthly OS patches (addressing issues including CESA-2017:1615 and CESA-2017:1680). Roll-out updated puppet modules. Batch nodes updated firmware.

All nodes in NCSA 3003 and NPCF (batch nodes) will reboot.

Overall success. Exceptions: verify-worker31 failed a firmware update and is out of comission (LSST-914) and there are connectivity issues for some VMs used by the NCSA DM team (IHS-365). adm01, backup01, and test[09-10] will be patched in the near future.

2017-07-19 08:00

2017-07-19 14:44

Campus Cluster

July Maintenance (applied security patch)

Cluster wide, except mwt2 nodes

Applied new kernel, glibc, bind patches and newest NVIDIA driver.

2017-06-29
1800

2017-06-30 0000

Blue Waters

Emergency maintenance to apply security patch addressing Stack Guard security vulnerability.

Compute, Login, Scheduler are offline.

Kernel and glibc library patched on all affected system.

2017-06-22 0800

2017-06-22 1200

All NCSA hosted LSST resources

CRITICAL kernel and package updates to address Stack Guard Page security vulnerability.

Systems will be patched and rebooted.

Outage was extended to last past 1000 until 1200. Systems were successfully patched as planned except for qserv-db12 and qserv-db27, which will not boot. We will follow up on those with a ticket.

2017-06-22 0800

2017-06-22 0930

LSST cluster nodes (verify-worker*, qserv*, sui*, bastion01, test*, backup01)

Deploy Unbound (local caching DNS resolver)

DNS resolving may have a short (~30 mins) delay.

Successfully deployed and all tests (including reverse DNS and intra-cluster SSH) pass.

2017-06-20
0930

2017-06-20
1100

Bluewaters

XDP shutting down causing EPO on cabinet c1-7 and c2-7.

Scheduler was paused to isolate the failing components, then resumed.

Warmswap of failing components, and returned them to service.

2017-06-20

0900

2017-06-20

1000

NCSA Open Source

Security upgrade needed for Bamboo, will also update the following components: Bamboo, JIRA, Confluence, BitBucket, FishEye

Most of the subcomponents of NCSA opensource will be down for a short time when the software is updated.

Upgraded Bamboo, JIRA, Confluence, BitBucket, FishEye to latest versions

2017-06-16

0900

2017-06-16

1100

ROGER Openstack nfs backend failed and was restarted

The primary CES server for the openstack backend failed and tried to fail over to the secondary server, which also failed. SET was notified and they had the CES nfs service back up by 1100

The RoGER openstack dashboard went down and needed a restart. Several VM's experienced "virtual drive errors" and will need to be restarted

SET is still investigating the cause of the GPFS CES service failover. CyberGIS is working with their users to get the affected VM's restarted

2017-06-15 0800

2017-06-15 0930

LSST cluster nodes (verify-worker*, qserv*, sui*, bastion01, test*, backup01)

Deploy unbound

DNS resolving may have a short (~30 mins) delay.

Updates deployed successfully via new puppet module. All tests passed.

EDIT 2017-06-15 1500 - Reverse DNS not working, which broke ssh to qserv* nodes. Disaabled unbound.

6/14/2017

8:00 a.m.

6/14/2017

10:00 p.m.

Network Core Switch

Network Engineering will be replacing a line card in one of our Core switches due to hardware issue.

All services should remain active. Any affected switch will have a second redundant link to the other core to pass traffic.

Line card was successfully replaced.

2017-06-08 12:00

2017-06-11 22:20

Campus Cluster (scheduler paused)

Disk Enclosure 3 failure on DDN 10K.

Lost redundancy and force us to drain the cluster.

Repair/replacement for controller can be time consuming so we took action to rebalance data out of failed enclosure. Scheduler was resumed as of 22:00.

2017-06-07 12:07

2017-06-07 12:42

NCSA LDAP

The NCSA LDAP service crashed

NCSA LDAP service was unavailable

LDAP software and OS were updated and server rebooted. LDAP is working normally.

2017-05-31 20:06

2017-05-31 20:36

NCSA LDAP

The NCSA LDAP service was timing out

NCSA LDAP service was unavailable

The root cause of LDAP timeouts is still being investigated.

2017-05-22

2017-05-26

Campus Cluster VMs

Network issue ESXI (hypervisor) Boxes after maintenance

Could no longer able to login to start VMs. License Server, nagios, all MWT2 VMs were down

The issue is fixed on 5/24. Restored license and Nagios service on 5/24. Moved MWT2 VMs to Campus Farm. All VMs return to service as of noon 5/26.

5/12/2017

5/18/2017

Condo/NFS partitions only

the NFS partition for the condo became extremely unstable after a replication (normal daily maintenance) was completed. Many iterations with FSCK and IBM on the phone got it resolved, and then 1.5 days restoring files that had been put in Lost and found.

UofI library was switched to the READONLY version on the ADS during this time

The root cause is still being investigated.

2017-05-23 14:05

2017-05-23 14:13

NCSA LDAP

The NCSA LDAP service was timing out

NCSA LDAP service was unavailable

The issue is still being investigated, but seems to be steadily available since the incident.

2017-05-22 15:41

2017-05-22 15:51

idp.ncsa.illinois.edu
oa4mp.ncsa.illinois.edu

Apache Tomcat out of memory

InCommon/SAML IdP and OIDC authentication services were unavailable.

Service restored by failing over to secondary server while memory is being increased on primary server.

05/20/2017 21:09

05/20/2017 23:37

DES nodes on Campus Cluster

Could not communicate outside the switch

All nodes connected to switch in POD22 Rack2 @ACB

Upgraded the code on the switch resolved the issue.

05/20/2017 05:00

05/20/2017 21:09

Campus Cluster and Active Data Storage (ADS)

Total power outage at ACB

All systems currently reside at ACB

Power was restored around 13:00hrs. We rotated ADS rack to align with Campus Cluster Storage Rack. Changed couple of VLAN IDs to reflect campus for future merger. ESXI boxes are down due to a configuration error after reboot. No major issue from output of FSCK from scratch02.

05/17/2017 02:00

05/17/2017 10:45

Internet2 WAN connectivity

Intermittent WAN connectivity. The outage was a result of Tech Services' DWDM system, which provides us with our physical optical path up to Chicago via the ICCN. Specifically, the Adva card that our 100G wave is on was seeing strange errors, which was causing input framing errors for traffic coming in on this interface.

General WAN connectivity to XSEDE sites, certain commodity routes, and other I2 AL2S connections.

The Adva card was rebooted and we stopped seeing the input framing errors. Tech Services is working with Adva to find the root cause of the issues on the card.

5/11/2017

5/12/2017

ESnet 100G connection

NCSA and ESnet will be moving their 100G connection to a different location in Chicago.

We have several diverse high speed paths to ESnet and DOE, traffic will be redirected to a secondary path.

2017-05-11
06:45

2017-05-11
07:33

NCSA Jabber upgrade

Upgraded Openfire XMMP jabber software

NCSA Jabber was unavailable during the upgrade.

Jabber was upgraded to the latest version of Openfire

2017-05-09

07:00

2017-05-09

18:15

iForge, GPFS, License Servers

iForge Planned Maintenance

iForge systems, including the ability to submit/run jobs.

Pm was completed early at 1815

2017-05-06 22:00

2017-05-06 23:00

NCSA Open Source

Upgrades of Atlassian software

NCSA Open Source BitBucket

BitBucket is upgraded.

2017-05-06 09:00

2017-05-06 10:00

NCSA Open Source

Upgrade of Atlassian Software

Most services hosted at NCSA Open Source were down for 5 minutes during rolling upgrades.

The following services were upgraded: HipChat, Bamboo, JIRA, Confluence, FishEye and CROWD.

2017-05-05 17:43

2017-05-05 20:02

ITS vSphere

A VM node panicked

Several VMs died when the node panicked and were restarted on other VM nodes. This included LDAP, JIRA, Help/RT, SMTP, Identity, and others.

All affected VMs were restarted on other VM nodes. Most restarted automatically.

2017-04-27 18:10

2017-04-27 18:55

Campus Cluster

Another GPFS interruption

Both Resource Manager and Scheduler went down along with hand full of compute nodes.

Restarted the RM and Scheduler and rebooted all down nodes.

2017-04-27 13:11

2017-04-27 14:20

Nebula

glusterfs crashed due to this bug, so no instances could access their filesystems

All instances running on Nebula

Needed to reboot the node that systems were mounting from, but took the opportunity to upgrade all gluster clients on other systems while waiting for a reboot. Version 3.10.1 fixes the bug. All instances with errors in their logs were restarted.

2017-04-27 11:20

2017-04-27 12:45

Campus Cluster

GPFS interruption

Both Resource Manager and Scheduler went down.

Torque serverdb file was corrupted. Restore the file from this morning snapshot and modified the data to match the current state.

2017-04-26 12:00

2017-04-26 18:30

Condo

A bug in the delete of a disk partition from GPFS. a problem within GPFS

DES, Condo partitions, and UofI Library.

Partitions had been up for 274 days, and many changes. The delete partition bug caused us to stop ALL operations on the condo and repair each disk through GPFS. Must have quarterly maintenance. Just too complicated to go a year without reseting things.

2017-04-19 16:54

2017-04-20 08:45

gpfs01, iforge

Filled-up metadata disks on I\O servers caused failures on gpfs01.

iforge clusters, including all currently running jobs.

Scheduling on iForge was paused for the duration of the incident. Running jobs were killed.13% metadata space was freed. Clusters were rebooted and scheduling resumed.

2017-04-19 08:00

2017-04-19 13:00

Campus Cluster

Merging xpacc data and /usr/local back to data01 (April PM)

Resource manager and Scheduler were unavailable during the maintenance.

Once again, /usr/local, /projects/xpacc and /home/<xpacc users> are mounting from data01. No more split cluster.

2017-04-04 (1330)

2017-04-04 (1600)

Networking

Some fiber cuts caused a routing loop inside one of the campus ISP's network.

Certain traffic that traversed this ISP would never make the final destination. Some DNS lookups would have also failed.

Campus was able to route around the problem, and the ISP also corrected their internal problem. The cut fiber was restored last night.

2017-03-28 (0000)

2017-03-29 (1600)

LSST

NPCF Chilled Water Outage

LSST - Slurm cluster nodes will be offline during the outage. All other LSST systems are expected to remain operational.

No issues. Slurm nodes restarted.

2017-03-28 (0000)

2017-03-29 (0230)

Blue Waters

NPCF Chilled Water Outage

Full system shutdown on Blue Waters (except Sonexion which is needed for fsck)

FSCK done on all lustre file systems, XDP piping works done (no leakage found), Software updates (PE, darshan) completed.

2017-03-25
10:15PM

2017-03-26
00:08AM

Blue Waters

BW scratch MDT failover, df hangs

BW scratch MDT failover, load on mds was 500+ delayed failover. Post FO had some issues that delayed RTS.

scheduler was paused

2017-03-25
4pm

2017-03-25
8Ppm

Blue Waters

BW login node ps hang

rebooted h1-h3, lost bw/h2ologin DNS record, had neteng recreate the record. Had to rotate login in and out of round-robins until all rebooted. User email sent (2).

Logins nodes rebooted
DNS round-robin changes

2017-03-23 (1000)

2017-03-23 (1500)

Nebula

NCSA Nebula Outage

Nebula will take an outage to balance and build a more stable setup for the file system. This will require a pause of all instances, and Horizon being unavailable.

File system online and stable. At this time all blocks were balanced and healed.

2017-03-16 (0630)

2017-03-16 (1130)

LSST

LSST monthly maintenance

GPFS filesystems will go offline for entire duration of outages. Some systems may be rebooted, especially those that mount one or more of the GPFS filesystems.

2017-03-15
15:11

2017-03-15
16:01

Blue Waters

Failure on cabinet c9-7, affecting HSN.

Filesystem hung for several minutes.

Scheduler was paused for 50 minutes.
Warmswap cabinet c9-7.
Nodes on c9-7 are reserved for further diagnosis.

2017-03-15 09:00

2017-03-15 12:47

Campus Cluster

UPS work at ACB.Reshuffling electrical drops on 10k controllers, storage IB switches and some servers.

Scheduler will be paused for regular jobs. MWT2 and DES will continue run on their nodes.

UPS work at ACB - incomplete (required additional parts)Redistributing power work done.Scheduler was paused for 3hrs 50 mins.

2017-03-10 13:00

2017-03-10 18:00

Campus Cluster

ICCP - We lost 10K controllers due to some type of power disturbance at ACB.

ICCP - Lost all filesystem and its a cluster wide outage.

Recovered missing LUNs and rebooted the cluster. Cluster was back in service at 18:00.

2017-03-09 0900

2017-03-09 1500

Roger

ROGER planned PM

batch, hadoop, data transfer services & Ambari

system out for 6hrs, DT services out until 0000

2017-03-08 19:41

2017-03-08 22:41

Blue Waters

XDP powered off that served the four cabinets
(c16-10, c17-10, c18-10, c19-10).

scheduler paused, four rack power cycled.
moab required a restart, too many down nodes
and itterations were stuck.

Scheduler paused
three hours

2017-03-03 1700

2017-03-03 2200

Blue Waters

BW hpss emergency outage to clean
up db2 database

ncsa#nearline, stores are failing with cache full

Resolved cache full errors

2017-02-28 1200

2017-02-28 1250

Campus Cluster

ICC Resource Manager down

User can't submit new jobs or start new jobs

Remove corrupted job file

2017-02-22 1615

2017-02-221815

Nebula

Nebula Gluster Issues

All Nebula instances paused while gluster repaired

Nebula is available.

2017-02-11 1900

2017-02-11 2359

NPCF

NPCF Power Hit

BW Lustre was down, xdp heat issues.

RTS 2017-02-11 2359

2017-02-15 0800

2017-02-15 1800

Campus Cluster

ICC Scheduled PM

Batch jobs and login nodes access

Child pages

Versions Compared

Old Version 367

New Version 368

Key

Previous Outages or Maintenance

Child pages

Page History

Versions Compared

Old Version 367

New Version 368

Key

Previous Outages or Maintenance