status.ncsa.illinois.edu

Watch this page in the wiki to subscribe to automatic updates to this status page.

Please do not refer to any NCSA Industry Partners on this page. Please use the iforge nomenclature for all of the *forge infrastructure.

Current Status

START	END	What System/Service is affected	What is happening?	What will be affected?	Contact Person

Report a problem

Upcoming Scheduled Maintenance

Start	End	What System/Service is affected	What is happening?	What will be affected?	Contact Person

2018-11-19 08:00	2018-11-19 20:00	ICCP	Monthly maintenance Split the filesystem Reformat with new v5 format	Total cluster outage.	help@campuscluster.illinois.edu
2018-11-29 08:00	2018-11-29 14:00	LSST	Monthly maintenance Puppet code changes disable CPU hyperthreading OS/Yum updates code upgrades on select service & management switches NPCF pfSense updates	ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters, and tus-ats01)	lsst-admin@ncsa.illinois.edu
2018-11-15 5.30PM	2018-11-15 7.30PM	NCSA building router in 2045	software upgrade on one of the building routers (2045-br)	Traffic will failover to redundant building router and no impact on network traffic is expected.	help+neteng@ncsa.illinois.edu

Previous Outages or Maintenance

Start	End	What System/Service was affected?	What happened?	What was affected?	Outcome	Contact Person
11/14/18 5:30PM	11/14/18 6:23PM	Blue Waters/Home filesystem	MDS issue	scheduler paused Logins impacted	Home file system RTS	Timothy Bouvet
2018-11-14 10:00am	2018-11-14 11:00am	idp.ncsa.illinois.edu	Upgrade Shibboleth IdP from v3.3.2 to v.3.4.1	ECP (command line) Duo authentication is now supported natively by Shib IdP software.	work completed a day early	Terrence Fleury
2018-11-14 10:45 am	2018-11-14 11:20	Blue Waters /Home filesystem	Investigation ongoing- suspect HSN quiesce	/home, and new job starts during the scheduler pause	back in service at 11:20	Timothy Bouvet
2018-11-06 06:00 am	2018-11-06 12:25 pm	Networking NetSure DC Distribution System Tape Library QBERT and DIGDUG iForge racks: Y121, Z121, AA121, CC121, DD121	De-energize distribution power panel DP-6C-020 to install new power panel PPC4	Loss of power to the core network DC Distribution panel (B Side), the network is 2N power feed, no impact on the network due to redundancy. Loss of power to two tape libraries, a temporary power feeds will be provided. iForge system will be powered down for quarterly maintenance.	work completed as expected	Mohammad Rantissi
2018-11-13 7:30 PM	2018-11-13 8:30 PM	LSST lspdev/Kubernetes	Cluster reboot	Memory performance on most k8s nodes was in degraded state as a result of a power event that occurred over the weekend. Reseating the nodes in their chassis slots resolves the issue.	Systems rebooted and memory performance is back to normal	lsst-admin@ncsa.illinois.edu
2018-11-10 ~04:40	2018-11-10 ~04:45	iForge (select compute nodes)	A power event caused some compute nodes to reboot	Select skylake platform compute nodes, including 7 nodes in the skylake queue. Jobs running on those nodes would have been impacted.	Systems rebooted and brought themselves back online.	iforge-admin@ncsa.illinois.edu
2018-11-10 ~04:40	2018-11-10 ~04:45	LSST (lspdev and select L1 hosts)	A power event caused some hosts to reboot: lspdev kubernetes cluster (3 nodes including master node did not come back on their own and were manually brought online around 09:30) some L1 nodes rebooted as well	lspdev/Kubernetes cluster was unavailable from ~04:40 until ~09:30 select L1 hosts rebooted	Systems should be back online and functioning. Users are asked to create tickets if there are lingering issues.	lsst-admin@ncsa.illinois.edu
2018-11-08 5.30PM	2018-11-08 7.30PM	NCSA building router in basement 07 (ncsa-07-br)	software upgrade on one of the building routers.	Traffic failed over to redundant NCSA building router. No impact on the network was observed	Maintenance was completed successfully without any issues	help+neteng@ncsa.illinois.edu
2018-11-07 16:50	2018-11-07 17:00	NCSA Jira	Jira was rebooted to increase RAM.	NCSA's Jira was offline while it's RAM configuration is upgraded.	Upgrade was completed successfully without any issues.	help+its@ncsa.illinois.edu
2018-11-06 06:00	2018-11-06 21:45	iForge / aForge	Quarterly Maintenance (20181106 Maintenance for iForge)	All systems were unavailable during the maintenance.	Maintenance was completed successfully: aForge returned to service at 21:15 iForge returned to service at 21:45 NOTE: OFED was updated to v4 on the clusters during the PM. Some MPI software may need to be recompiled due to changes in libraries (e.g., libpsm_infinipath is no longer present in OFED v4). Frequently used openmpi installations have been updated to accommodate this change. Software compiled against affected MPI software may also need to be recompiled.	iforge-admin@ncsa.illinois.edu
2018-11-06 07:00	2018-11-06 09:00	NCSA VPN Service	The VPN was upgraded.	The NCSA VPN service was down for maintenance	The NCSA VPN has been upgraded	help+neteng@ncsa.illinois.edu
2018-11-01 7.00PM	2018-11-01 7.30PM	wired networking on 4th floor in NCSA building	Software upgrade on network closet switches	Wired network, VOIP phones on 4th floor. NCSAnet Wireless remained available during maintenance window.	upgrade was completed successfully without any issues.	help+neteng@ncsa.illinois.edu
2018-11-02 3:30 AM	2018-11-02 6:10 AM	iforge cluster	GPFS issue. "ls /usr/local" hangs. direct access to some directories under /iusr/local was OK. ie. "ls /usrlocal/modules-3.2.9.iforge" was OK.	iforge login node is currently down. New ssh connections are hanging. There is the potential for issues with running jobs. Scheduler has been paused.	Something odd going on with iforge020 was causing hangs. Once iforge020 was rebooted, access to /usr/local was unlocked.	Jim Long jlong1s@illinois.edu
2018-10-30 9:00 p.m.	2018-10-30 11:00 p.m.	NCSA DHCP	Patches	The DHCP server will be unavailable periodically for reboots and patching. Possible timeouts for DHCP, but generally no interruptions are expected.	help+neteng@ncsa.illinois.edu
2018-10-25 5.30PM	2018-10-25 6.00PM	wired networking on 3rd floor in NCSA building	software upgrade on network closet switches	wired network, VOIP phones. NCSAnet Wireless remained available during maintenance window.	code upgrade completed successfully without any issues.	help+neteng@ncsa.illinois.edu
2018-10-22 12:00pm	2018-10-22 1:00pm	IDDS servers	Patches	XRAS admin/review/submit UIs, XDCDB Admin UI, NAPS	Patches complete	idds-admin@ncsa.illinois.edu
2018-10-18 08:00	2018-10-18 12:00	LSST	Monthly maintenance firmware update and reboot on monitor01 (monitoring collector) OS & Kernel updates on tus-ats01.lsst.ncsa.edu Puppet code changes	monitor01/InfluxDB (and likely the front-end Grafana monitoring, e.g., monitor-ncsa.lsst.org) will be unavailable for a short period of time tus-ats01 will be unavailable for OS & Kernel updates the Puppet changes are intended to be functional "no-ops" and should cause no outage, although we scheduled these changes during our monthly PM window in case something unexpected occurs	maintenance completed successfully	lsst-admin@ncsa.illinois.edu
2018-10-17 08:00	2018-10-17 18:00	ICCP	Monthly Maintenance Deploying new kernel with CVE-2018-14634 fix Switching to MTU9000 across GPFS 5.0.2 upgrade Firmware bug fixes applied to DDN SFA14KX	Total system outage	maintenance completed	help@campuscluster.illinois.edu
2018-10-17 15:40	2018-10-17 23:00	3rd Floor Networking	Portions of the third floor did not have network connectivity due to a switch malfunction.	Portions of the third floor are without network connectivity.	The issue has been resolved.	neteng@ncsa.illinois.edu
2018-10-15 08:00 AM	2018-10-15 08:50 PM	Blue Waters	Maintenance to apply security Patches	All services for Blue Waters will be down except for ncsa#Nearline	Outage extended for 2 hours due to unexpected power loss to 3 rows of equipment	bw-admin@ncsa.illinois.edu
2018-10-16 10:00 AM	2018-10-16 01:00 PM	DUO 2-Factor Auth	DUO Upstream vendor has reported issues with their service. https://status.duo.com/	NCSA systems that use DUO for 2FA might experience intermittent issues	help+security@ncsa.illinois.edu
2018-10-15 7:30 am	2018-10-15 11:00 pm	Nebula, File-server	Power Loss in the NCSA building is causing issues with systems	Nebula web services are turned off, File-server is unavailable	Systems we brought back online and repaired.	help+its@ncsa.illinois.edu
2018-10-15 07:35	2018-10-15 09:15	LSST	Power event -> host outage at NCSA 3003	affected: all physical LSST hosts (and VMs) at NCSA 3003: incl. lsst-dev, lsst-xfer, lsst-l1, lsst-daq, lsst-dev-db	most physical hosts rebooted themselves after the event, although a few L1 systems had to be manually powered on most VMs had to be manually started after the event	lsst-admin@ncsa.illinois.edu
2018-10-11 16:30	2018-10-11 17:00	crashplan backup service	crashplan was upgraded to code42 6.8.4	crashplan service was restarted and clients reconnected	crashplan service has fewer security vulnerabilities now.	crashplan@ncsa.illinois.edu
2018-10-09	2018-10-09	DHCP	Additional DHCP attributes will be passed to clients.	The Security Operations group has requested that the Web Proxy Auto-Discovery Protocol (WPAD) be set to blank via DHCP to better secure client workstations/laptops. This should not impact any users general network usage.	WPAD has been applied to all user networks at NCSA and NPCF (including wireless).	help+neteng@ncsa.illinois.edu
2018-10-08 17:00	2018-10-08 21:00	Wired networking on 2nd floor in NCSA building	ncsa-2045 Network switch software upgrade	Wired networking for desktop computers and VOIP phones. Wireless network remained available during maintenance	switch stack on second floor was upgraded. There were some issues during upgrade process due to which maintenance ran longer than expected. All networking services are restored back to normal.	help+neteng@ncsa.illinois.edu
2018-10-4-16:35	2018-10-4-16:35	jabber.ncsa.illinois.edu	The open fire jabber server stopped working correctly and was restarted.	Everyone using jabber reconnected.	Jabber rooms are working like they should again	help+its@ncsa.illinois.edu
2018-10-04 08:00	2018-10-04 09:15	LSST	Critical security patching	ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters) The following systems will remain online and unaffected: tus-ats01	Maintenance was successful.	lsst-admin@ncsa.illinois.edu
2018-10-03 06:00	2018-10-03 07:00	Campus Cluster - Networking	Maintenance was performed on the OmniPoP uplink on ur1carne, which is the upstream router for all ICCP based network traffic. Engineers worked to transition the link from old optical transport gear to new gear that is optically protected with automatic failover.	All traffic that would normally take this OmniPoP link will reroute through other WAN links on ur1carne. Downtime of < 15 min is expected within the hour window while engineers swing the fiber jumpers from the old optical gear to the new optical gear. There should be no impact to DES or any ICCP customers. Please contact NetEng if you notice any unexpected outages.	Maintenance was successful.	help+neteng@ncsa.illinois.edu
2018-10-02 17:00	2018-06-02 20:00	NPCF Networking DC Power System	Testing and maintenance of the DC power system and battery backup will be performed.	No outage.	Tests were completed without issue.	help+neteng@ncsa.illinois.edu
2018-09-26 11:00	2018-09-26 12:00	Campus Cluster - MWT2 Networking	Maintenance was performed on the Internet2 uplink on ur1carne, which is the upstream router for all ICCP/MWT2 based network traffic.	MWT2 lost connectivity to LHC1 but everything else rerouted, all of which was expected.	The maintenance was successful, no issues have been reported	help+neteng@ncsa.illinois.edu
2018-09-20	2018-09-24	OpenAFS servers	OpenAFS file and database servers were upgraded to 1.6.23	The OpenAFS servers were upgraded to the latest code without service interuption	Now running with latest security fixes in place	afs@ncsa.illinois.edu
2018-09-20 08:00	2018-09-22 16:50	LSST Qserv	qserv-master01 is having trouble booting after a motherboard replacement during planned maintenance.	Qserv in general, specifically qserv-master	RESOLVED	lsst-admin@ncsa.illinois.edu
2018-09-20 08:00	2018-09-20 14:40	LSST LSPdev	LSPdev kubernetes is having a gateway error after upgrading	LSPdev kubernetes	RESOLVED	lsst-admin@ncsa.illinois.edu
2018-09-20 08:00	2018-09-20 14:00	LSST	Monthly maintenance (Sep): Network switch firmware updates/reboots Lenovo firmware updates/reboots OS package updates/reboots ESXi hypervisor updates/reboots GPFS client changes and upgrade to 4.2.3-10 GPFS server upgrade to 4.2.3-10	All LSST systems and services will be unavailable for the duration of the maintenance period.	RESOLVED qserv-master01 and LSPdev are still having issues. These will be tracked as a separate incidents.	lsst-admin@ncsa.illinois.edu
2018-09-19 08:00	2018-09-19 22:00	Campus Cluster	Monthly maintenance Switching to CentOS 7.5 across cluster Upgrading gpfs to 4.2.3.10 (client only)	All compute and login nodes were down. The filesystems were also unavailable due to issues with the change to gpfs and RH7.5	The cluster was back in service at 2200	help@campuscluster.illinois.edu
2018-09-17 17:30	20018-09-19:30	Wired networking on 1st floor (ncsa-1045)	software upgrade on network switch for 1st floor.	Wired networking for users on 1st floor was unavailable as network engineering performed software upgrades on their equipment. Wireless network (NCSAnet) remained available during this time.	Maintenance was completed successfully. Users can contact neteng if they have any issues with their wired network connections.	help+neteng@ncsa.illinois.edu
2018-09-12 06:00	2018-09-12 09:00	DNS1, DNS2	DNS1 and DNS2 will be updated/upgraded	DNS servers will be undergoing routine maintenance. During this maintenance window, system and services will be restarted. One DNS server will always be responsive during the maintenance.	Updates have been applied.	help+neteng@ncsa.illinois.edu
2018-09-11 9:30 a.m.	2018-09-11 11:00 a.m.	Internet2 100G connection	ICCN engineers will be migrating our Internet2 connection to the new ICCN optical equipment.	Traffic will fail over to a secondary peering. We expect minimal impact to users. Direct peering will fall back to normal routing.	The migration has been completed.	help+neteng@ncsa.illinois.edu
2018-09-11 8:30 a.m.	2018-09-11 11:00 a.m.	ESnet 100G direct connection	We will be migrating our ESnet connection to the new ICCN optical equipment.	Traffic will fail over to a secondary peering. We expect no impact during this maintenance.	The migration has been completed.
2018-10-10 09:00	2018-10-10 17:00	netact.ncsa.illinois.edu	Multiple users reported they were unable to delete their activations or change networks within Netact.	netact.ncsa.illinois.edu	Fixed the bug and tested. Issue was resolved.	help+neteng@ncsa.illinois.edu
2018-09-06 11:00	2018-09-06 12:00	MREN Circuit Move	The MREN WAN circuit is being moved to an optical protection switch.	Traffic will be re-routed over an alternate peering during the test period.	neteng@ncsa.illinois.edu
2018-09-06 16:00	2018-09-06 16:40	RSA Authentication Manager	RSA Authentication Manager 8.2 SP 1 P 08 was applied	Both primary and replica servers were updated with the latest security patches	Running 8.2 SP1 P08	otp@ncsa.illinois.edu
2018-08-15 08:00	2018-08-15 20:08	Campus Cluster	Preventative Maintenance FSCK on filesystem Reseat and reset management modules on IB core switch BIOS updates on some nodes Upgrade Carne uplink to 2x100G	Total outage	Corrected bad inode on filesystem. Rebooted IB core switch 2x100G links are working	help@campuscluster.illinois.edu
2018-08-29 09:38	2018-08-29 10:21	Services that utilize Duo 2FA including bastions hosts and VPN.	latency issues with DUO1 as per https://status.duo.com/	any service that uses Duo for authentication including bastion hosts and VPN.	Service appears be to be returning to normal as per updates on https://status.duo.com/	help+security@ncsa.illinois.edu
2018-08-20 10:20	2018-08-20 12:00	sslvpn.ncsa.illinois.edu	intermittent login issues with DUO two factor authentication due to an outage on DUO's end.	Two factor authentication to sslvpn service.	Duo identified the issue and resolved the outage. Users can connect to sslvpn over Duo 2FA now.	help+neteng@ncsa.illinois.edu
2018-08-16 11:25	2018-08-16 12:41	Slack	Slack is reporting connectivity issues on their status page (https://status.slack.com/)	Slack reported, "connectivity issues impacting all workspaces"	Slack reported this resolved at 12:41, though NCSA users reported it working around 11:38.	feedback@slack.com
2018-08-15 08:00	2018-08-15 16:00	ISDA VM infrastructure	Upgrade of all VM servers as well as backend storage system	NCSA opensource, NCSA docker hub, ISDA VM servers	Upgrade was successful	kooper@illinois.edu
2018-08-15 0800	2018-08-15 1430	Storage Condo Maintenance	All servers were upgraded to gpfs 4.2.3.10 and the clustered nfs service was implemented as well.	Storage Condo	Upgrade was successful	ckerner@illinois.edu
2018-08-14 05:00	2018-08-14 09:00	NCSA Wiki	wiki.ncsa.illinois.edu will be upgraded to Confluence 6.10.1 and then to 6.10.2.	The wiki will be down intermittently during the upgrade. Read the banner at the top of wiki pages for current status.	Upgrade was successful	help+its@ncsa.illinois.edu
2018-08-07 07:00	2018-08-10 12:00	iForge ifdbpoc server	Hardware issues require migrating to new server; some signs indicate service was impacted prior to 2018-08-07 07:00 but no reports have confirmed	ifdbpoc	Admins migrated data and services to another server. Verification was performed by the apps team.
2018-08-08 -- 1430hrs	2018-08-10 -- 0730hrs	Blue Waters Nearline Endpoint	Due to very high demand for data retrieval from Nearline, a pause rule is in effect to allow manual task scheduling. You may submit tasks as normal and they will be run as quickly as possible.	Data storing and retrieving to/from the Nearline storage system.	Many tasks were manually scheduled and completed to help re-balance the system utilization. The endpoint pause rule was lifted and all tasks are running again.	hpssadmin@ncsa.illinois.edu
2018-08-07 07:00	2018-08-07 22:15	iForge / aForge	Quarterly Maintenance (20180807 Maintenance for iForge)	All systems will be unavailable during the maintenance.	In progress iForge was placed into production at 22:15 aForge was brought back online by 19:45	iforge-admin@ncsa.illinois.edu
2018-08-03 11:30	2018-08-03 13:30	NCSA VPN	A configuration issue caused some VPN users connection problems to some NCSA resources.	Some VPN users reported connectivity problems to some internal NCSA resources.	A configuration change was applied which corrected the routing issue.	help+neteng@ncsa.illinois.edu
2018-07-27 11:45	2018-07-27 13:45	NCSA Wiki	The wiki was being intermittently slow and unresponsive.	wiki.ncsa.illinois.edu	Upgraded several software packages and rebooted wiki server	help+its@ncsa.illinois.edu
2018-07-27 08:00	2018-07-27 08:15	NCSA VPN	The old NCSA VPN (vpn.ncsa.illinois.edu) was decommissioned. All users should be using the new VPN (sslvpn.ncsa.illinois.edu).	VPN was decommissioned.	The old VPN has been decommissioned and all users should be using the new VPN.	neteng@ncsa.illinois.edu
2018-07-26 14:00	2018-07-26 19:00	NCSA RT	The RT help site was being intermittently slow and unresponsive.	help.ncsa.illinois.edu	Upgraded several software packages and rebooted RT server	help+its@ncsa.illinois.edu
2018-07-25 14:30	2018-07-25 14:40	NCSA Wiki	Wiki Restart	Confluence service restarted		help+its@ncsa.illinois.edu
2018-07-24 13:20	2018-07-24 13:55	crashplan	crashplan was upgraded to 6.7.3 for latest feature and security updates. Client updates will push out to system automatically over the next few days.	all client paused backups for about 2 mins as servers restarted with new code.	now running Code42 6.7.3	crashplan@ncsa.illinois.edu
2018-07-22 19:14	2018-07-22 19:45	NCSA GitLab	NCSA GitLab server was updated.	Renewed SSL certificate Upgraded GitLab software Increased CPU & RAM	Completed	help+its@ncsa.illinois.edu
2018-07-19 18:44	2018-07-20 10:45:13	nebula	nebula controller experienced a fatal hardware error on 10gE nic	horizon interface to nebula https://nebula.ncsa.illinois.edu/ and all open stack command line tools are non-functional. Keystone authentication services are also off-line. Instances that were running should continue to run but restarting will probably fail until the controller is repaired. launching new instances will also fail.	Replaced card, nebula.ncsa.illinois.edu is now accessible again.	nebula@ncsa.illinois.edu
2018-07-19 12:00	2018-07-19 12:30	LSST: lsst-dev-db and dependent services, including kubernetes lspdev	Following the July 19 planned maintenance, MariaDB services on lsst-dev-db are unavailable along with dependent services, including: kubernetes lspdev	DB services on lsst-dev-db along with dependent services, including: kubernetes lspdev	Resolved	lsst-admin@ncsa.illinois.edu
2018-07-19 08:00	2018-07-19 12:00	LSST	Monthly maintenance (July): Dell firmware updates/reboots OS package updates/reboots including upgrades to CentOS 7.5 GPFS client changes and upgrade to 4.2.3.9 GPFS server upgrade to 4.2.3.9	ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters) The following systems will remain online and unaffected: lsst-daq lsst-l1-* tus-ats01	Maintenance was successfully completed, although the following resultant issue is being tracked in a separate status event: DB services on lsst-dev-db are unavailable along with dependent services, including: lspdev	lsst-admin@ncsa.illinois.edu
2018-07-16– 900	2018-07-16– 1938	Bluewaters	System was upgraded for security issues and to migrate to Cuda 9.1	Bluewaters compute and scheduler	Bluewaters is now updated	bwadmin@ncsa.illinois.edu David King
2018-07-09 – 1130	2018-07-10 – 1700	Campus Cluster Monitoring Webpage	SET is moving set-analytics to https. This should have been a simple change to a host name, but after the change the new value was not picked up.	The monitoring web page gave a loading circle that never resolved to anything.	Set up a Grafana instance for the display of the Campus Cluster monitoring.	help@campuscluster.illinois.edu
2018-06-28	2018-07-09	Nebula	Nebula was taken offline to repair the filesystem	All Nebula services	Nebula is performing well now	nebula@ncsa.illinois.edu
2018-06-29 -- 1300hrs	2018-07-08 – 1400hrs	Blue Waters Nearline Endpoint	Due to very high demand for data retrieval from Nearline, a pause rule is in effect to allow manual task scheduling. You may submit tasks as normal and they will be run as quickly as possible.	Tasks submitted to Globus will start in a paused state but will be released to run, at the earliest possible time, based on resource availability.	Backlog of file stages was cleared and endpoint pause rule removed.	hpssadmin@ncsa.illinois.edu
1800 2018-07-02	0600 2018-07-06	Access to NPCF	For the July 4th UIUC fireworks show, the parking Lots E14 and E14-shuttle will be closed from 6:00 p.m. Monday, July 2nd, through 6:00 a.m. Friday, July 6th. No parking will be allowed in these locations at any time during this period. Please do not park in the NPCF dock area - use the shuttle buses, or park in lot E46 (south on Oak St.).	Parking facilites for NPCF	Parking is back to normal
2018-05-03 14:30	2018-06-28 09:00	iForge gpu queue	both nodes in the general 'gpu' queue were offline due to issues with the GPUs	iForge 'gpu' queue could not be used	Tried driver updates and engaged with vendors; ultimately got one node working with 4 M40 GPUs rather than the previous 2 K80 GPUs; continue engaging with vendors to get the other node working but queue is now available.
0800 2018-07-02	1200 2018-07-02	Blue Waters Nearline	One tape library (of four) will be powered down for hardware maintenance (replacement of tape import/export module).	Access to tapes in the affected library will be blocked until the system returns to service. Users staging data may see delays in accessing data until the library is back online.	Work was completed with some delay (scheduled to complete by 0930) due to a failed SD card (used for storing and loading library geometry).	hpssadmin@ncsa.illinois.edu
2018-06-27 9:00	2018-06-27 1:00	LSST - k8s lspdev	kub001 unplanned reboot and kub004 ran out of memory.	lspdev JupyterHub	Nodes/Services rebooted. Kubernetes pods restarted.	lsst-admin@ncsa.illinois.edu
2018-06-27 08:30	2018-06-27 11:49	Slack	Slack is reporting connectivity issues on their status page (https://status.slack.com/)	Slack	Slack reports, "workspaces should be able to connect again"	feedback@slack.com
2018-06-23 19:44	2018-06-23 19:59	Blue Waters Scratch Filesystem	Top of Rack network switch died in rack 8. Cray onsite and performed a work around and will replace Monday. Sonexion rack 28 lost mind and was rebooted.	Partial scratch outage of ost169-179	bypassed faulty switch, rack 28 sonexion rebooted. faulty swich replaced Monday 25th.	bw-admin@illinois.edu tbouvet
2018-06-21 -- 1200hrs	2018-06-23 -- 1045hrs	Blue Waters Nearline Endpoint	Due to very high demand for data retrieval from Nearline, a pause rule is in effect to allow manual task scheduling. You may submit tasks as normal and they will be run as quickly as possible.	Tasks submitted to Globus will start in a paused state but will be released to run, at the earliest possible time, based on resource availability.	Many tasks were pushed through the system by manually ordering them to reduce tape drive competition. Endpoint pause rule removed and all tasks resumed.	hpssadmin@ncsa.Illinois.edu
2018-06-21 08:00	2018-06-21 09:35	LSST	Monthly maintenance (June): pfSense firewall update OS package updates/reboots for CentOS 6.9 servers (lsst-web, lsst-xfer, lsst-nagios) Slurm update (lsst-dev01, lsst-verify-worker*) Update host firewalls on GPFS servers iDRAC configuration updates on lsst-dev01 and ESXi hosts	CentOS 6.9 servers: lsst-web lsst-xfer lsst-nagios Slurm/verification cluster Other impact was not expected but unexpected issues could have lead to connectivity issues for other hosts or downtime for lsst-dev01 or hosted VMs	Maintenance was completed	lsst-sysadm@ncsa.illinois.edu
2018-06-20 14:00	2018-06-20 19:00	Campus Cluster	Rolling reboot of the core IO servers to move GPFS from 4.2.3.8 to 4.2.3.9 for CentOS 7.5 support; No downtime occurred	Successful Upgrade	Cluster now supports CentOS 7.5 clients	set@ncsa.illinois.edu
2018-06-18	2019-06-20 7pm	Nebula	Nebula was shut down to fix broken filesystems.	All Nebula services	Nebula is up and running again. Please contact nebula@ncsa.illinois.edu if you still see issues.	nebula@ncsa.illinois.edu
2018-06-19 08:00	2018-06-19 12:00	LSST L1 Test Stand	Scheduled Maintenance: BIOS firmware updates Puppet and firewall changes (including support of SAL unicast/multicast traffic) OS package updates (staying with CentOS 7.4)	Level One Test Stand, including: lsst-daq lsst-l1-*	Maintenance completed successfully	lsst-sysadm@ncsa.illinois.edu
2018-06-18 07:00	2018-06-18 09:30	vSphere & Various VMs	Two of our hosts went down with network interface errors.	Multiple VMs hosted on those nodes (incl. Fileserver, ncsa-print, and subversion)	Both hosts are back online as well as all VMs	help+its@ncsa.illinois.edu
2017-06-16 22:18:32	2017-06-17 08:10:00	cforge	PBSPro server was hung on cfsched	Job scheduling and job submission were failing.	restarted PBSPro server on cfsched	Jim Long
2018-06-15 1330hrs	2018-06-15 1530hrs	Blue Waters Nearline	Replacement of a tape robot transporter	This work is not expected to impact operations. The library system will continue to operate with a single transporter but mount times may be somewhat longer until the second unit is returned to service.	hpssadmin@ncsa.illinois.edu
2018-06-12 04:30	10:00	Blue Waters	Thunderstorms have resulted in a power interruption. This outage impacts both the compute nodes and all filesystems. Therefore, a full reboot will be necessary.Return to service is estimated to be approximately 10 am Central time.	Blue Water in total	Full reboot
2018-06-12 ~03:45	2018-06-12 ~06:00	Campus Cluster	Many compute nodes rebooted. No system on UPS was affected, and some compute nodes remained up. Facilities at ACB report that there were no power events this morning or last night, but this seems the most likely cause.	Many compute nodes, but not all. Jobs on the nodes that rebooted were lost.	Nodes rebooted at a similar time, and many returned in a state unsuitable to run jobs. Rebooting in smaller groups got everything working again.	help@campuscluster.illinois.edu
2018-06-12 ~03:40	2018-06-12 ~06:30	iForge	A storm caused a brief power event which impacted: big_mem queue skyake queue	All nodes in the big_mem and skylake queues were rebooted by the power event.	Nodes rebooted on their own and were marked back online in the scheduler by around ~6:30am.
2018-06-12 ~03:40	2018-06-12 09:00	LSST	Storm caused power event which impacted: Kubernetes Commons / lsst-lspdev 75% of verification cluster compute / Slurm	The following nodes rebooted because of the power event: all kub* nodes (causing outage of Kubernetes Commons / lsst-lspdev) 75% of verify-worker* nodes (partial outage of Slurm / verification cluster compute nodes	verify-worker nodes were put back online in Slurm around 06:10 Kubernetes Commons resumed service by around 09:00	lsst-sysadm@ncsa.illinois.edu
2018-06-11 08:30	2018-06-11 8:35	Campus Cluster ADS	Vlan changes on campus cluster	campus cluster - Active data storage (ADS)	Maintenance completed successfully	help+neteng@ncsa.illinois.edu
2018-06-07 06:30	2018-06-07 14:00	Blue Waters	The boot node crashed requiring the system to be rebooted. File system and ESLogins remain up.	All running jobs were lost, no new jobs were started until system is return to service, Torque was updated to ver. 6.1.2.	bw-admin@ncsa.illinois.edu
2018-06-01 00:50	2018-06-01 03:50	Blue Waters	/var space filled up by additional logging in Moab to troubleshoot job slide issue.	PBS server went down due to no space in /var	Zipped and moved old Moab logs to lustre file system to free up /var space, then restarted PBS server.	bw-admin@ncsa.illinois.edu
2018-05-31 14:00	2018-05-31 14:10	NCSA Open Source	Retirement of both HipChat and FishEye/Crucible	Services will be shutdown and archived.	Services are disabled and will be archived in a month.
2018-05-31 08:00	2018-05-31 11:55	NCSA ITS vSphere vCenter	ITS vSphere vCenter server will be upgraded to the latest VMware vCenter 6.7	All VMs will remain online during the maintenance, but management through vCenter will be offline during the upgrade.	Successful upgrade to VMware vCenter 6.7.	help+its@ncsa.illinois.edu
2018-05-23 06:55	2018-05-24, 1900hrs	Campus Cluster File System	A failure of both disk array controllers serving the CC file systems resulted in abrupt loss of access to the underlying storage. One array controller was identified as broken while the storage system was brought back up on the remaining controller for inspection and analysis. A thorough check of the file systems and storage devices was started. At 1100hrs May 24th the replacement array controller arrived and was installed. After further testing to assure system stability, the file systems were brought back online and released to the cluster admins.	All campus cluster file systems	Normal cluster operations were resumed. Investigation into the root cause is ongoing with the cooperation of the system manufacturer.
2018-05-21	DNS1/2	There were a few reports of intermittent DNS lookups failures/slowness		Firewall state tables resources were being exhausted. Limits for those state tables have been increased. This appears to have resolved the problem.	No further reports of the issue, after making the adjustment.	help+neteng@ncsa.illinois.edu
2018-05-24 10:55am	2018-06-24 11:08am	ifsm.ncsa.illinois.edu System is being upgraded and rebooted	No services should be affected	yum upgrade and reboot
2018-05-17 8:00	2018-05-17 15:00	NPCF-Core-East	The hardware and firmware on the core east router was be upgraded	Traffic rerouted through npcf-core-west during the maintenance window. There was an unexpected outage for about 10 mins which impacted network connectivity throughout NCSA.	Upgrade on core-east was completed successfully. No further network outages are expected.
2018-05-09 7:00	2018-05-09 17:40	dns1.ncsa.illinois.edu	Enabling BIND on ipv6 and enabling a firewall on the server	No impact is expected.	Maintenance was completed.
2018-05-17 08:00	2018-05-17 13:30	LSST	Monthly maintenance (May): GPFS server & client updates, plus nosuid mounting Physical firewall changes in NPCF for new vLANs BIOS firmware updates OS updates Update of puppet-stdlibs module	All systems (except lsst-daq, lsst-l1-*, & tus-ats01) were unavailable for maintenance.	Maintenance was extended until 13:30 and then completed. External Grafana monitoring (monitor-ncsa.lsst.org) was offline until 14:25 due to storage rebuild on lsst-monitor01.
2018-05-17 10:13	2018-05-17 10:18	Core Outage	During core router maintenance the incorrect core router was powered off.	Network connectivity across NCSA was affected.	The core router was powered back on, verified and brought back into service.
2018-05-16 08:00	2018-05-16 17:40	Campus Cluster	Monthly maintenance (May) GPFS upgrade to 4.2.3.8 FW upgrade on Juniper switches OS updates Add 4 more 40G cables for ccioe nodes for redundancy	Entire system was unavailable for maintenance.	Maintenance complete, all tasks complete.
2018-05-16 1100hrs	2018-05-16 1300hrs	ADS	Planned Campus Cluster network upgrades also impacted access to ADS	All ADS storage exports became unreachable	Eric has notified us that the networking maintenance is complete and ADS customers are able to access their storage again.
21 Mar 2018	14 May 2018	openxdmod.ncsa.illinois.edu	An update to Torque broke the updates of XDMoD. openxdmod.ncsa.illinois.edu was offline while the system it resided on was updated, all the dependency software was installed, and the latest version of XDMoD was installed. Then all the data had to be re-imported.	Software updating	Service restored with updated software.
2018-05-08 0000	2018-05-09 0015	NCSA Storage Condo	One node ran out of memory, causing a deadlock in GPFS. During deadlock recovery, GPFS shut down on multiple nodes. Upon restart of the cluster, a different metadata server had a check on its PCI bus, forcing another unmount. All file systems but one were recovered. While recovering the last one, one of the Roger NetApp storage arrays started throwing errors, requiring a power cycle of the controller and disks, prompting a final recovery of the last file system.	Condo file systems and services.	All file systems recovered and services restored.
2018-05-08 07:00	2018-05-08 07:40	iForge	Quarterly Maintenance (20180508 Maintenance for iForge)	All systems were unavailable during the maintenance.	Planned maintenance completed successfully
2018-05-08 8:00	2018-05-09 8:00	NPCF-Core-West	The hardware and firmware on the core router will be upgraded	Traffic will be rerouted through npcf-core-east during the maintenance window. No impact is expected.	The hardware and firmware was upgraded on npcf-core-west without incident. Traffic has been successfully failed back.
2018-05-03 08:45	2018-05-03 10:15	NCSA WIKI, JIRA, services that rely on NCSA LDAP	Large amount of connections from two particular servers were hitting LDAP, causing the slow-down that in term caused timeouts for various applications using LDAP authentication. Blocking the cuplrit servers remedied the situation	NCSA WIKI, NCSA JIRA, other applications that rely on NCSA LDAP authentication.	Culprit servers were blocked
9:00am	9:25am	syslog-sec.ncsa.illinois.edu	out of cycle patching of Security Syslog collectors to address CVE-2018-1000140	Load balance fail over to secondary collector, RELP will be buffered.	relay-01 was updated and loadbalancer failed back.
4/25 14:00	4/25 15:00	MREN WAN Circuit	WAN circuit testing.	Traffic will be re-routed over an alternate peering during the test period.	The MREN circuit was brought back in to production.
2018-04-24 12:30	2018-04-24 16:00	NCSA jabber service	jabber was down while we repaired its authorization configuration.	jabber.ncsa.illinois.edu wasn't accepting jabber logins	jabber working again.
2018-04-24 09:10	2018-04-24 09:50	LSST	increased LDAP timeout to 60 seconds in sssd.conf to fix problems with long login times and failure to start batch jobs	kub, verify-worker	sssd.conf updated, sssd restarted verify-worker nodes were drained during the change affected nodes may have slow LDAP response times for a short while (due to local cache needing rebuilt)
04/18/2018 10:30	04/18/2018 11:30	ICCP April Maintenance	Replaced 4x10G links from cc-core0 to carne. Updated BIOS on remaining parts of Cluster nodes.	No outage.	Completed without any outage.
04/18/2018 10:30	04/18/2018 11:30	ICCP core switches	One of the 4x10G links from cc-core0 to carne had incrementing errors and has been administratively down to prevent those errors from affecting traffic. There was a scratched fiber that earlier diagnosis had revealed, so we replaced the fiber during this ICCP PM.	Nothing, all traffic rerouted through cc-core1	The errors are still incrementing, but we've narrowed down the remaining options for what might be going on.
4/12 0930	4/12 1830	ADS NFS/Samba	The ESXi Hypervisor server had an error on it: 'A PCI error requiring a reboot has occurred.'.	ADS NFS/Samba/Gridftp	The server was rebooted, the error cleared and all systems/services were restarted.
4/11 03:00 p.m.	4/11 03.15 pm	Netact	Netact code was updated. Going forward new office activation names will have "-ofc" appended to them.	No service impact to Netact.	Change was successfully implemented. Netact remained in service during and after the change.
4/11 9:00	4/11 10:00	LSST NPCF Firewall	Primary firewall will be upgraded to use FRR instead of openBGP.	No impact is expected. The firewalls do not need to be failed over and no interruption in traffic flow is anticipated.	Firewall was successfully migrated. No downtime occurred.
4/10 17:00	4/10 18:00	dns1.ncsa.illinois.edu	OS Patching and BIND updates	dns1 (secondary DNS server) will be rebooted to apply patches. DNS2 will remain up.	DNS1 OS patching is completed. BIND was upgraded to 9.11. BIND is only bound currently to its ipv4 interface.
4/10 15:00	4/10 16:00	dns2.ncsa.illinois.edu	OS Patching and BIND updates	dns2 (secondary DNS server) will be rebooted to apply patches. DNS1 will remain up. An IPv6 address will also be added to system in preparation for a broader IPv6 DNS rollout.	DNS2 OS was patched. BIND was upgrade to 9.11. IPv6 Address was also enabled on the server and BIND is listening on that address.
4/04/2018 16:00	4/04/2018 17:00	MREN WAN Circuit	Port Move	Traffic will be re-routed over an alternate peering during the maintenance.	The port was moved and the circuit was brought back into service without issue.
04/04/2018 16:17:00	04/04/2018 16:42:00	LDAP	LDAP process crashed	Authentication to LDAP-backed services	LDAP was upgraded and restarted
4/04/2018 16:00	4/04/2018 17:00	MREN WAN Circuit	Port Move	Traffic will be re-routed over an alternate peering during the maintenance.	The port was moved and the circuit was brought back into service without issue.
	3/29/2018 17:00	MREN WAN Circuit	WAN circuit testing.	Traffic will be re-routed over an alternate peering during the test period.	Testing was completed and the circuit was brought back into service.
2018-03-21 08:00	2018-03-21 17:30	Campus Cluster manage server and compute nodes except DES and MWT2	Deploying new management server, upgrading to Torque 6.1.2 and Moab 9.1.2. Bios update. Configuration changes on GPFS servers. Tech Service CARNE code upgrade.	Scheduler down. User access disabled	New management server is up with Centos7. Installed Torque 6.1.2 and Maob 9.1.2. Bios update are done on most nodes. Configuration changes on GPFS done. Tech services CARNE code upgrade done.
2018-03-16 1:00pm	2018-03-16 5:45pm	ISDA + NCSA OpenSource	Security patches of VM servers as well as backend filesystem Updates of Bamboo, JIRA, Confluence, BitBucket and CROWD	All systems will be unavailable for a brief period of time. During updates of OpenSource services part of OpenSource will be offline for up to an hour.	Updated fileserver (brief struggle with zfs and kernel updates). Updates of proxmox servers, Updated JIRA, Confluence, ButBucket and CROWD. Bamboo will be done later this weekend.
2018-03-12 9:00am	2018-03-12 5:00pm	Nebula Openstack cluster	Security and filesystem patches	All instances and Nebula services were unavailable	Filesystem updates and security patches were applied. Filesystem is more responsive, but ~20 instances are repairing from problems that occurred before the outage.
2018-03-15 12:20	2018-03-15 16:20	LSST	Lingering issues on select nodes following March PM	lsst-qserv-master01 - cannot mount local /qserv volume lsst-xfer - issue w/ sshd lsst-dts - issue w/ sshd lsst-l1-cl-dmcs - unknown issue lsst7 - issue w/ sshd	Following resolved by 13:23: lsst-qserv-master01 lsst-xfer lsst-dts lsst-l1-cl-dmcs Resolved by 16:20: lsst7
2018-03-15 08:00	2018-03-15 12:20	LSST	March maintenance: GPFS server updates and configuration of additional NFS/Samba services Urgent Firmware updates Increase size of /tmp on lsst-dev01 Hardware maintenance/memory increases on select servers/VMs Release of refactored Puppet code for NCSA 3003 servers OS updates Recabling servers in NCSA 3003 to new switches	All systems were unavailable for maintenance.	Completed and most systems back online. Lingering issues for lsst-qserv-master01, lsst7, lsst-xfer, lsst-dts, and lsst-l1-cl-dmcs are being tracked in a separate status event.
2018-03-14 12:00	2018-03-14 12:35	Remote Access VPN	An issue with authentication for the VPN has occurred.	Any new connections will not be established. Existing connections are unaffected.	Authentication services were restored.
2018-03-09 10:08am	2018-03-09 11:00am	Campus Cluster	According to IBM, cc-mgmt1 was a culprit on halting communication across the cluster during the GPFS snapshot process.	User can't login or access to filesytem.	Rebooted cc-mgmt1 and restarted services (RM & Scheduler).
2018-03-09 06:05	2018-03-09 08:00	public-linux, www.ncsa.illinois.edu, & events.ncsa.illinois.edu	A routine kernel upgrade resulted in failure of the OpenAFS client on these servers.	OpenAFS storage was unavailable on these servers, resulting in the website failures.	Resolved. Packages were updated and OpenAFS reinstalled.
2018-03-07 15:00	2018-03-07 16:10	LSST	qserv-db12 had one failed drive in the OS mirror replaced but the other was presenting errors as well so the RAID could not rebuild. The Qserv system would have been unavailable during this maintenance.	qserv-db12	The node was taken down for replacement of the 2nd disk, to rebuild the RAID in the OS volume, and to reinstall the OS.
2018-03-07 14:00	2018-03-07 14:40	ESnet Peering	The connection servicing our direct peering with ESnet will be moved during this window.	Connections will be rerouted over a redundant peering. No service impact is expected.	The connection was successfully migrated and the peering with ESnet was brought back into service without issue.
2016-03-06 0100	2016-03-06 1040	WAN Connectivity Degraded	The router servicing several of our WAN connections is currently in a degraded state.	Traffic has been gracefully rerouted. No user facing connectivity issues have been reported.	Graceful failover to the backup routing engine cleared a fault condition and affected peerings were re-established.
2018-02-27 07:15	2018-02-27 09:10	Campus Cluster scheduler	Scheduler become unresponsive	Job submission & starting new jobs	Rebooted the node, restarted RM & Scheduler.
2018-02-26 06:00	2018-02-27 01:35	All Blue Waters Services	Security Patch CLE, SU26 Lustre patch	All Blue Waters resources are unavailable	Blue Waters returned to service at 1:35AM 27th Feb, with HPSS returned earlier at 10PM 26th Feb.
2018-02-23 16:30	2018-02-23 16:30	Kerberos Admin service	KDC configuration was modified to allow creation of service principles that can create and modify host and service principles.	kadmin service was unavailable for 1 second while new config was read.	We can now delegate to group or users the ability to create and manage host keys and service principles.
2018-02-23 08:00	2018-02-23 09:00	LSST Puppet Changes	Rolled out significant logic and organization of the Puppet resources in NCSA 3003 data center in order to standardize between LSST Puppet environments at NCSA. We had done extensive testing and did not expect any outages or disruption of services.	No interruption of services. Changes being applied to: lsst-dev01, lsst-dev-db, lsst-web, lsst-xfer, lsst-dts, lsst-demo, L1 test stand, DBB test stand, elastic test stand.	Updated successfully with no interruption of availability or services.
2018-02-21 13:30	2018-02-22 00:39	ESnet 100G Peering Down	There is a suspected fiber cut between Urbana and Peoria on ICCN optical equipment. Our 100G direct WAN path to ESnet rides over this optical path and is thus currently down. The fiber vendor has identified the source of the problem (high water caused the fiber to be pulled out of a splice case)	Nothing. All traffic destined for ESnet or resources that would normally take the ESnet WAN path will reroute through our other WAN paths	Repaired.
2018-02-21 08:00	2018-02-21 20:00	Campus Cluster	Campus Cluster February Maintenance Applying security patches & OFED upgrade Testing/tuning metadata performance Troubleshoot/upgraded code on cc-core switches	All systems were unavailable	Completed partially and following items are reschedule for next maintenance. Deploying new scheduler (due to a system stability) Upgrading Torque 6.1.2 and Moab 9.1.2 (not enough time for testing after release) Maintenance on CARNE router (bug in the code)
2018-02-05 10:45	2018-02-21 13:30	ICCP Networking - Outbound	A hardware failure on one of the two core switches for ICCP caused that switch to enter a degraded service mode and eventually fail completely. This was also combined with software bugs that caused looping of packets between the two cores in the MC-LAG. The other core was still functioning properly and was providing connectivity for all ICCP/ADS/DES systems normally for the duration of the degraded service time period. A hardware replacement RMA was initiated. The hardware came in but the hardware alone did not fix the issue. We then waited until a ICCP PM where we could test things without interruption of service and we upgraded the code and put in some bug mitigation configuration changes. These things combined solved the issues.	Nothing as far as production. During the period where cc-core0 was down, aggregate bandwidth outbound was 40Gbps instead of the normal 80Ghps.	As of now the cores are both in production and stable.
2018-02-16 12:00	2018-02-16 12:30	IPSEC VPN	The appliance servicing various IPSEC VPN connections was patched.	Nothing	Patch was successful utilizing the failover capability of the VPN cluster to mitigate any service interruptions
2018-02-15 08:00	2018-02-15 13:00	LSST	February maintenance: Updating GPFS mounts to access new storage appliance Rewire 2 PDUs at NCSA 3003 Switch stack configuration changes at NCSA 3003 Routine system updates Firewall maintenance NPCF Updates to system monitoring	All systems were unavailable.	Completed and all systems back online.
2018-02-13 08:00	2018-02-06 09:00	Certificate System Firewall 2	Upgrade software to current production version. No interruptions to service expected	CA services	FW upgraded - services were interrupted due to failed routing service.
2018-02-13 06:00	2018-02-13 06:30	AnyConnect VPN	Patches are being applied to the AnyConnect VPN appliance	Access to the NCSA AnyConnect VPN will be unavailable.	The VPN has been patched and client connections have been re-established.
2018-02-10 02:00	2018-02-10 10:35	Campus Cluster	GPFS snapshot hang and lock the filesystem	All systems were inaccessible. Lost running jobs.	Gather information for IBM, bounce the filesystem and reboot the cluster
2018-02-06 07:00	2018-02-06 17:35	iForge	Quarterly Maintenance (20180206 Maintenance for iForge)	All Systems were unavailable during the maintenance.	Planned maintenance completed successfully
2018-02-06 08:00	2018-02-06 09:00	Certificate System Firewall 1	Upgrade software to current production version. It is expected that current connections will be interrupted and a retry will be required.	cilogon.org idp.ncsa.illinois.edu idp.xsede.org NCSA TFCA Myproxy XSEDE Myproxy	Completed
2018-02-01 16:30	2018-02-01 16:45	sslvpn.ncsa.illinois.edu	We are rebooting our VPN appliances to mitigate a critical security vulnerability that allows for remote code execution exploits. That vulnerability is described here: https://tools.cisco.com/security/center/content/CiscoSecurityAdvisory/cisco-sa-20180129-asa1	Certain Industry partners' site-to-site VPNs	VPN rebooted without incident. Service was restored at 4:34PM.
2018-02-01 16:30	2018-02-01 16:45	vpn.ncsa.illinois.edu	We are rebooting our VPN appliances to mitigate a critical security vulnerability that allows for remote code execution exploits. That vulnerability is described here: https://tools.cisco.com/security/center/content/CiscoSecurityAdvisory/cisco-sa-20180129-asa1	Certain Industry partners' site-to-site VPNs and the NCSA remote access VPN service will be down during the maintenance. Any users connected to the NCSA VPN at the time of the maintenance will lose connectivity.	VPN rebooted without incident. Service was restored at 4:34PM.
2018-01-29 10:05	2018-01-29 10:10	LSST verify worker nodes and lsst-dev	A network flap on the LSST network caused GPFS ejection of some nodes. Network and security is investigating	a few of the LSST nodes for 2-5 minutes and 2 jobs	Qualys scan time frame changed and investigatino continues.
2018-01-29 12:27	2018-01-29 12:31	NCSA Jabber service	Jabber service was restarted to install a new SSL certificate.	NCSA Jabber was down momentarily	NCSA Jabber restarted with new SSL certificate
2018-01-26 13:00	2018-01-26 13:15	LSST NFS service slowdown	A cron for lenovo system cleanup was run, and caused the lenovo box to showdown services. The NFS service was starved.	lsst-dev NFS showed stale mounts	cron deleted, and re-written.
Wed 1/24/2018 13:35	Wed 1/24/2018 14:55	LSST NFS service	We were notified by NCSA security team that there was a stale NFS mount on one of the LSST test nodes. NFS services stopped working	All NFS mounts for LSST systems such as lsst-demo and lsst-SUI were not working	NFS server was rebooted.
Tue 1/23/2018 23:00	Wed 1/24/2018 01:25	Condo storage services	Hit a known bug in GPFS 4.2.0.4 for quota management.	All Condo services from 11pm to 1:25 am	Need to upgrade to a newer level of GPFS, but for now we have lowered frequency of the check_fileset_inodes script
2018-01-22 07:00	2018-01-22 13:05	Blue Waters Compute Nodes	Blue Waters compute nodes were bounced to resolve issues caused by previous home file system outage (due to bad OST)	Compute nodes were down, scheduler was paused.	Compute nodes were bounced successfully and returned to full service.
2018-01-21 08:42	2018-01-21 11:30	Netsec-vc switch stack - FPC 4	Switch member 4 of the netsec switch stack was down. Severe filesystem corruption occurred on the primary partition.	Any hosts connected to member 4 of that switch that were not redundantly connected to other switches in the stack.	The switch was repaired by doing a full reformat/reinstall of JunOS. Everything is back into production.
2018-01-20 22:00	2018-01-21 0300	Condo file systems	Bringing the Roger disk into the condo, commands executed from the Roger GPFS servers caused the cluster to arbitrate for GPFS servers.	All condo file systems mounted on nodes.	The SSH configuration was changed on the Roger GPFS servers to include the Condo GPFS server IP's. All file systems were returned to normal with no other problems and no remounts required.
2018-01-18 17:00	2018-01-19 15:00	ISDA Hypervisors, NCSA Open Source	Hypervisor updates.	All systems were down for short amount of times as hypervisors rebooted	All patches applied.
2018-01-18 00:00	2018-01-18 24:00	Campus Cluster	Copying all data to new filesystem. Deploying new Storage (14K). Dividing cluster into two (IB & Ethernet). Upgrading GPFS to 4.2.3.6. Deploying new management node and new image server (if time permit). Applying Security patches to compute nodes(no FW update at this time).	All systems unavailable.	New Storage System was brought online, additional capacity and performance was added.
2018-01-18 18:40	2018-01-18 23:00	LSST	LSST Firewall outage in NPCF. Both pfSense firewalls were accidentally powered off.	PDAC (Qserv & SUI) and verification clusters were inaccessible, as well as introducing GPFS issues across many services, e.g. lsst-dev01.	The pfSense firewall appliances were power cycled and services restored.
2018-01-18 12:58	2018-01-18 14:10	Code42 Crashplan backup system	Code42 Crashplan server were upgraded to latest JDK and Code42 6.5.2.	Clients were unable to perform restores or push files into backup archive from roughly 13:35 - 13:55	Code42 servers are now running latest security updates to the crashplan service.
2018-01-18 08:00	2018-01-18 10:00	LSST	Monthly OS updates, network switch updates, firmware updates, etc.	All dev systems unavailable. Qserv and SUI nodes will remain available.	COMPLETE
2018-01-17 10:35	2018-01-17 13:00	RSA Authentication Manager Servers	Upgraded to Authentication Manager 8.1sp1p7	No systems should have seen any impact	Latest security patches are applied.
2018-01-12 06:00	2018-01-12 10:00	Decommission NCSA Rocket.chat	The old NCSA Rocket.chat service was shutdown.	Any archived conversations or content are no longer be available to users.	NCSA Rocket.chat service was shutdown and redirected to NCSA @ Illinois Slack.
Friday, Jan 12th, 0000-0600 CST	Internet2	Engineers from Internet2 will be migrating our BGP peering with I2's Commercial Peering Service (CPS) to a new location. Small disruptions may occur with the maintenance for the CPS service, but no user traffic disruptions should occur.	None, Alternatives routes are present.	none	Maintenance was completed successfully.
2018-01-11 08:00	2018-01-11 13:30	LSST	Critical patches on lsst-dev systems (incl. kernel updates).	All systems unavailable.	COMPLETE
Thursday, Jan 11th, 0000 CST	Thursday, Jan 11th, 0400 CST	Connectivity to Internet2 and backup LHCONE peerings - ICCP and MWT2 respectively	Engineers from Internet2 performed maintenance that affected certain BGP peerings that exist on the device that is ICCP/MWT2's upstream router, CARNE. Specifically, both the 100G Internet2 peering and the Internet2 LHCONE peering on CARNE were disrupted during this timeframe. MWT2 currently gets to LHCONE through CARNE's ESnet peering, which was fully functional. They also were able to get to UChicago through CARNE's OmniPoP 100G peering. As for ICCP, traffic to/from Internet2 based routes rerouted through the ICCN.	Nothing was reported to be service impacting by this maintenance from neither ICCP nor MWT2.	Successful maintenance was completed.
2018-01-08 10:47	2018-01-08 11:30	Nebula	Storage nodes lost networking	All nebula instances	Storage nodes were brought back online, instances were rebooted
2018-01-02 09:00	2018-01-05 17:00	Nebula	Nebula was shut down for hardware and software maintenance from January 2nd, 2018 at 9am until January 5th, 2018 at 5pm. Spectre and Meltdown patches were applied, as well as all firmware updates, OS/distribution updates, and the filesystem was upgraded.	All systems were unavailable.	Faster system that is now homogenous, so OpenStack upgrades are now possible.
2018-01-04 17:00	2018-01-05 20:00	Blue Waters	One OST hosting the home file system has three drives failed simultaneously.	Portion of home file system (with data on the affected OST) are not accessible.	Repair works were carried out on the failed OST. Scheduler continued to operate but restricting only jobs not affected by the failed OST to start. Full operation resumed after successful recovery of the failed OST.
2017-12-20 08:00	2017-12-20 10:00	LSST	(1) Firewall maintenance (08:00-09:00) and (2) migration of NFS services (08:00-10:00).	Firewall maintenance: There should be no noticeable effect but scope of service includes most systems at NPCF (including PDAC, SUI, and Slurm/batch/verify nodes). Migration of NFS services: SUI and lsst-demo* nodes.	Maintenance completed without issues.
2017-12-14 06:00	2017-12-14 20:30	LSST	Monthly OS updates, network switch updates, firmware updates, etc.	All systems unavailable.	All systems back online. We ran into issues with the policy based routing on the LSST aggregate switches in NPCF that caused the outage to be extended longer than planned.
2017-12-13 09:00	2017-12-13 11:00	JIRA Upgrade	Upgraded JIRA to version 7.6 from 7.0	NCSA Jira	Succesfully upgraded
2017-12-13 06:30	2017-12-13 07:39	NCSA Jabber	Attempted to upgrade Openfire XMMP jabber software.	NCSA Jabber was unavailable during the upgrade.	The upgrade failed. Jabber is available, but still running the old version. The upgrade will be rescheduled.
2017-12-11 10:00	2017-12-11 16:00	Unused AFS fileserver were upgraded to 1.6.22	After moving all volumes to servers updated on 2017-12-07, the now unused AFS servers were upgraded to OpenAFS 1.6.22.	No impact to other systems as they were unused at the time they were upgraded.	All of NCSA's afs cell is running on OpenAFS 1.6.22
2017-12-09 03:00	2017-12-09 07:42	BlueWaters Portal	The BlueWaters portal software crashed. Automated monitoring processes did not restart it correctly.	The BlueWaters portal website was unavailable.	The BlueWaters portal service was manually restarted and the website is available.
2017-12-09 1000hrs	2017-12-09 1400hrs	Globus Online (Globus.org)	Please be advised that the Globus service will be unavailable on Saturday, December 9, 2017, between 10:00am and 2:00pm CST while we conduct scheduled upgrades. Active file transfers will be suspended during this time and they will resume when the Globus service is restored. Users trying to access the service at globus.org (or on your institution's branded Globus website) will see a maintenance page until the service is restored.	All NCSA Globus endpoints.
2017-12-07	2017-12-07	Unused AFS file servers were upgraded to 1.6.22	Three unused AFS fileserver were upgraded to the latest 1.6.22 release of OpenAFS	No impact to other systems as they were unused.	These AFS fileserver can no longer be crashed by malicious clients.
2017-12-07	2017-12-07	AFS database servers were upgraded to 1.6.22	The three database servers were upgraded to the latest 1.6.22 release of OpenAFS	No modern clients noticed the staggered updates.	These servers can no longer be crashed by malicious clients.
2017-12-05 16.00	2017-12-05 16:20	dhcp.ncsa.illinois.edu	NCSA Neteng will be migrating the DHCP server VM to Security team's VMware infrastructure.	- Hosts on the NCSAnet wireless network might be impacted. - Any activated hosts that might be on the roaming range might be impacted. + Illinoisnet and Illinois_Guest wireless will be available at ALL times. + Wired network connection will be available throughout the maintenance window.	Maintenance was completed successfully and services are running as expected.
2017-12-02 09:30	2017-12-02 11:45	NCSA opensource	Upgrade of Bamboo, JIRA, Confluence, BitBucket FishEye, and CROWD	Sub services of opensource can be down for a short time.	All services upgraded and running as normal.
2017-11-20 18:21	2017-11-29 14:30	ROGER OpenStack cluster	I/O issues highlighted that GPFS CES NFS servers probably shouldn't run 400+ days without reboot	ROGER's OpenStack and the various services which were hosted therein, including JupyterHub Server	reboot of all nodes, including CES servers as well as the reboot of all hypervisors (with the fallout being one node required fsck and second reboot and another node/hypervisor is still unavailable) cleared most of the problems. I/O contention was felt as many instances were simultaneously attempting to start/restart. instances that were housed on the unavailable node are being migrated to another hypervisor
2017-11-21 9:00	2017-11-22 14:00	Open Source ISDA servers	Update the fileserver that hosts VM's all the XEN servers.	NCSA Open Source unavailable Most of ISDA servers unavailable	Network issues delayed updates All hosts updated and everything back to normal.
2017-11-21 16:00	2017-11-21 16:40	Code42 Crashplan	The Code42 crashplan infrastructure was upgraded to version 6.5.1 to apply security and performance improvements	Clients transparently reconnected to servers after they restarted	Now running on Code42 version 6.5.1
2017-11-20 9:00	2017-11-20 16:38	Nebula Openstack cluster	Nebula OpenStack cluster was unavailable for emergency hardware maintenance. A failing RAID controller from one of the storage nodes and a network switch were replaced.	Not all instances were impacted. Running Nebula instances that were affected by the outage were shut down, then restarted again after we finished maintenance.	Nebula is available. No additional maintenance is needed for Tuesday, November 21.
2017-11-16 16:46	2017-11-20 12:40	NCSA JIRA	JIRA wasn't importing some email requests properly after the NCSA MySQL restart.	Some email sent to JIRA via help+ addresses wasn't being imported.	JIRA is now accepting email and all email sent while it was broken has now been imported as expected.
2017-11-16 08:30	2017-11-16 13:30	BW LDAP Master (Blue Waters)	Scheduled maintenance	Updated LDAP lustre quotas to bytes and add archive quotas. IDDS will track and drive quota changes with acctd.	Production continued w/o interruption. BW LDAP master was isolated, lustre quotas changed to bytes with the addition of archive quotas. Replicas pulled updates w/o error.
2017-11-16 14:30	2017-11-16 16:52	Internal website (MIS Savanah)	A database table used by MIS tools became corrupted.	The website would become unresponsive every time the corrupted database table was accessed.	OS kernel and packages where updated during debugging. The MIS database table was restored and the website came back online.
2017-11-16 16:46	2017-11-16 16:48	NCSA MySQL	The NCSA MySQL server had to be restarted in order to delete the corrupted table used by MIS.	All services that use MySQL were down during the outage. This includes: Confluence, JIRA, RT, and lots of websites	MySQL was restarted successfully.
2017-11-16 0800	2017-11-16 1200	LSST	Monthly OS updates, plus first round of Puppet technical debt changes (upgrading to best design & coding practices)	All systems unavailable from 0800 - 1000 hrs. GPFS unavailable from 0800 - 1000 hrs. PDAC systems unavailable from 0800 - 1200 hrs.	Completed. OS kernel and package updates. Slurm upgrade to 17.02.
2017-11-15 13:30	2017-11-15 15:10	RSA Authentication Manager	RSA Authentication Manager were patched to fix cross site scripting vulnerabilities and other fixes	Nothing was affected by the update	RSA Authentication Manager is running 8.2 SP1 P6. Process worked as expected.
2017-11-15 - 13:30	2017-11-15 - 14:30	BW 10.5 Firewall Upgrade Part 2	The normal active, "A" unit, NCSA BW 10.5 Firewall will be upgraded and then normal fail-over status will be re-enabled.	The possibility of connection resets when the A unit comes back from being upgraded and state is being sycned.	Completed, process worked as expected.
2017-11-14 11:27	2017-11-14 11:33	LDAP	LDAP was unresponsive to requests.	Several services hung while authentication was unavailable.	LDAP services were killed and restarted.
2017-11-05 02:15	2017-11-06 17:11	ROGER Hadoop/Ambari	cg-hm12 and cg-hm13 took minor disk failures which crashed the node	Ambari was effectively off-line	rebooted node, and node ran fsck as part of its startup sequence, node booted properly
2017-10-31 17:22	2017-11-03 17:00	ROGER hadoop/ambari	hard drive failures on cg-hm10 and cg-hm17	certain ambari services and HDFS	cg-hm17 returned to service after power cycle and reboot, cg-hm10's hard drive didn't respond to a reboot
2017-11-11 16:58	2017-11-11 19:09	Blue Waters	Water leak from XDP4-8 causing high temperature to c12-7 and c14-7.	EPO on c12-7 and c14-7.	Scheduler was paused to place system reservations on compute nodes in affected cabinets, then resumed.
2017-11-10 14:00	2017-11-10 14:45	NCSA Open Source	Upgrade of the following software: Bamboo, JIRA, Confluence, and BitBucket	Updates will happen in place and will result in minimal downtime of components.	completed, minimal interruption of service
2017-11-10 - 08:00	2017-11-10 - 08:30	CA Firewall Upgrade - B unit	the stand-by, "B" unit, NCSA Certificate Service Firewall will be upgraded to same version as A unit.	Expect no impact to services	completed, no interruption of service
2017-11-08 16:30	2017-11-08 17:30	Netdot	netdot.ncsa.illinois.edu was migrated to Security's VMware infrastructure.	During the downtime users weren't able to activate or deactivate their network connections via Netact.	Migrated successfully. Netdot is up and running.
2017-11-08 06:00	2017-11-08 15:00	ITS vSphere vCenter	ITS vSphere was upgraded to the latest version of VMware vCenter. New access restrictions were also be put into place.	All VMs remained online during the maintenance, but management through vCenter was offline during the upgrade.	Upgrade completed successfully.
2017-11-08 09:30	2017-11-08 10:00	BW 10.5 Firewall Upgrade Part 1	the stand-by, "B" unit, NCSA BW 10.5 Firewall will be upgraded and then traffic redirected through it for load testing before the "A" unit is upgraded	Expect no impact to services	Upgrade completed successfully. Some states were reset when traffic switched to the B unit.
2017-11-07 7:00	2017-11-07 18:37	iForge	quarterly maintenance Update OS image. Update GPFS to version 4.2.3-5 Redistribute power drops. Update TORQUE. BIOS updates.	iForge (and associated clusters)	All production systems are back in service
2017-11-07 - 13:30	2017-11-07 - 15:00	CA Firewall Upgrade Part 2	The normal active, "A" unit, NCSA Certificate Service Firewall will be upgraded and then normal fail-over status will be re-enabled.	The possibility of connection resets when the A unit comes back from being upgraded and state is being sycned.	Completed upgrade
2017-11-06 15:28	2017-11-06 15:53	Blue Waters	EPO happened to c12-7 and c14-7.	HSN quiesced.	Scheduler was paused to place system reservations on compute nodes in affected cabinets, then resumed.
2017-11-03 16:21	2017-11-03 16:32	LDAP	LDAP was unresponsive to requests.	Several services hung while authentication was unavailable.	LDAP services was killed and restarted.
2017-11-02 09:00	2017-11-02 16:00	LSST	LSST had a GPFS server that was down and had failed over to the other server for NFS.	The GPFS client’s failed over automatically, and we manually failed over the NFS in the morning.	NFS exports were moved to an independent server. IBM was at NCSA and is continuing to debug the problems.
2017-10-31 17:11	2017-11-01 11:13	LSST	GPFS degraded/outage	most NCSA-hosted LSST resources experienced degraded GPFS performance hosts with native mounts (PDAC) experienced an outage	A deadlock at 17:11 yesterday temporarily caused slow performance. Then one GPFS server went offline at 18:21 and services failed over. NFS mounts (qserv/sui) were reported as hanging by a user at 09:12 today but may have been degraded over night. Affected nodes were rebooted and NFS mounts recovered by 11:13. IBM is onsite diagnosing issues with the GPFS system and ordering repairs (including a network card on one server).
2017-10-31 15:30	2017-10-31 16:00	LSST	GPFS outage	most NCSA-hosted LSST resources native mounts (e.g., lsst-dev01, verify-worker*) and NFS mounts (e.g., PDAC)	All disks in the GPFS storage system went offline temporarily and came back online by themselves. NFS services were restarted. Client nodes all recovered their mounts on their own. Logs have been sent to the vendor for analysis.
2017-10-31 - 13:30	2017-10-31 - 14:30	CA Firewall Upgrade Part 1	the stand-by, "B" unit, NCSA Certificate Service Firewall will be upgraded and then traffic redirected through it for load testing before the "A" unit is upgraded	Expect no impact to services	Upgrade completed successfully. Some states were reset when traffic switched to the B unit.
2017-10-30 18:36	2017-10-31 00:46	LSST	GPFS outage	most NCSA-hosted LSST resources native mounts (e.g., lsst-dev01, verify-worker*) and NFS mounts (e.g., PDAC)	GPFS servers were rebooted. lsst-dev01 and most of the qserv-db nodes were also rebooted. Native GPFS and NFS mounts were recovered. May have been (unintentionally) caused by user processes but will continue to investigate..
2017-10-25 22:00	2017-10-26 11:20	LSST	full/partial GPFS outage	full outage for GPFS during 22:00 hour on 2017-10-25 outage for NFS sharing of GPFS (for qserv, sui) continued through the night full outage for GPFS recurred 2017-10-26 around 08:44	All GPFS services and mounts have been restored.
2017-10-26 09:04	2017-10-26 09:04	Various buildings across campus, including NPCF and NCSA	Issue with an Ameren line from Mahomet caused a bump/drop/surge in power that lasted 2ms	LSST had approximately 20 servers at both NPCF and NCSA buildings reboot	Was a momentary issue with minimal effect to most systems
2017-10-26 00:00	2017-10-26 08:00	ICCP	gpfs_scratch01 was filled by a very active user	Additional space in scratch wasn't available	Out of cadence purge was run to free 2TB, users jobs held in scheduler; user contacted
2017-10-25 06:00	2017-10-25 14:05	Blue Waters	Security Patching of CVE-2017-1000253 security vulnerability.	Restricted access to logins, scheduler and compute nodes. HPSS and IE nodes are not affected.	System was patched. Logins hosts are made available at 9am. The full system is returned to service at 14:05.
2017-10-24 09:50	2017-10-24 20:10	LSST	Network outage / GPFS outage	All LSST nodes from NCSA 3003 (e.g., lsst-dev01/lsst-dev7) and NCPF (verify-worker, PDAC) that connect to GPFS (as GPFS or NFS) lost their connections. All LSST nodes at NPCF lost network during network stack troubleshooting and replacement of 3rd bad switch.	A 3rd bad switch was discovered and replaced. All nodes have network and GPFS connectivity once again.
2017-10-23 08:00	2017-10-24 05:00	Campus Cluster	Campus Cluster October maintenance.	Total outage of the cluster.	Replaced core ethernet switches from share services pod. Run new ethernet cables for share services pod. Moved DES rack from share services pod to ethernet only pod. Deployed new image with patched.
2017-10-21 17:15	2017-10-23 17:45	LSST	First one then two public/protected network switches went down in racks N76, O76 at NPCF	Mostly qserv-db[11-20] and verify-worker[25-48]; there was also shorter outage for qserv-master01, qserv-dax01, qserv-db[01-10], all of SUI, and the rest of the verify-worker nodes.	Two temporary replacement switches were swapped in. Maintenance and/or longer-term replacement switches is being procured for the original switches.
2017-10-18 13:00	2017-10-18 14:00	Networking	Replaced a linecard in one of our core switches due to hardware failure.	Any downstream switches were routed through the other core switch.	All work was completed successfully.
2017-10-19 08:00	2017-10-19 21:30	LSST	Outage and migration of qserv-master01: provisioning of new hardware, copying of data from old server to new.	qserv-master01 (and any services that depend on qserv-master01, which may include services provided by qserv-db, qserv-dax01, and sui)	UPDATE (2017-10-19 15:15) OS install took much longer than anticipated, completed at 15:00. Data sync is started. Extending outage till 22:00. Completed
10-19 08:00	2017-10-19 12:00	LSST	Routine patching and reboots, pfSense firmware updates (NPCF), Dell server firmware updates (NPCF).	All NCSA-hosted resources except for Nebula.	Maintenance completed successfully. (qserv-master migration is ongoing, see separate status entry)
2017-10-18 14:45	2017-10-18 15:35	Campus Cluster	Restart of resource manager failed after removing all block array jobs.	Job submission	Opened case with Adaptive (#25796). Found more array jobs and bad jobs in jobs directories. Removed all of those.
2017-10-15 08:15	2017-10-15 08:30	Open Source	Emergency upgrade of Atlassian Bamboo.	Bamboo will be down for a few minutes during this outage window.	Bamboo upgraded to the latest version.
2017-10-14 22:15	2017-10-14 23:35	Campus Cluster	Scheduler crash	Job submission	Opened case with Adaptive, run diag and uploaded the output along with the core file. Restarted the moab.
2017-10-14 13:00	2017-10-14 15:23	Campus Cluster	Resource manager crash	Job submission	Applied patch from Adaptive, which help with faster recovery. Suspend/block all current and new array jobs until we have a resolution.
2017-10-06 09:00	2017-10-11 01:00	Nebula	Gluster and network issues	1) Gluster sync issues continue from 2017-10-05's Nebula incident. 2) At approximately 2017-10-06 16:10, a Nebula networking issue (unrelated to the Gluster issues) occurred resulting in host network drops within the Nebula infrastructure. This internal networking incident resulted in additional gluster and iscsi issues. Many instances are broken because iSCSI is broken from the Nebula network issues. And any instances that were broken because of gluster are still broken.	All instances have been restarted and are in a state for admins to run. Some mounted file systems might require a fsk to verify. If there are other issues please send a ticket. As the file system continues to heal we may see slower interaction.
2017-10-10 16:30	2017-10-10 19:10	Campus Cluster	Resource manager crash	Job submission	After removing problematic jobs from queue and we were able to restart the RM. Opened the case with Adaptive and forwarded those job scripts and core files.
2017-10-05 14:00	2017-10-05 17:00	Nebula	Gluster sync issues	One of the gluster storage servers within Nebula had to be restarted.	Approximately 100 VM instances experienced IO issues and were restarted.
2017-10-06 08:00	2017-10-06 17:00	NCSA direct peering with ESnet	A fiber cut between Peoria and Bloomington caused our ESnet direct peering to go down.	All traffic that would have taken the ESnet peering rerouted through our other WAN peers. As such there were no reported outages of connectivity to resources that users would normally access via this peering	The fiber cut has been repaired and the peering has been re-established.
2017-10-06 08:00	2017-10-06 10:00	LSST	Kernel and package updates to address various security vulnerabilities, including the PIE kernel vulnerability described in CVE-2017-1000253. This will involve an upgrade to CentOS 7.4 and updates to GPFS client software on relevant nodes.	All NCSA-hosted LSST resources except for Nebula (incl. LSST-Dev, PDAC, and verification/batch nodes) will be patched and rebooted.	Maintenance completed successfully. Pending updates to a couple of management nodes (adm01 and repos01) and one Slurm node that is draining (verify-worker11).
2017-10-4 07:40	2017-10-4 09:55	Campus Cluster	Resource Manager crash	Job submission	Failure on initial restart attempt. After looking through the core, decided to try a restart again without any change. This time it worked.
2017-10-03 13:00	2017-10-03 19:00	Campus Cluster	Resource Manager crash	Job submission	After removing ~30 problematic jobs from queue and we were able to restart the RM. Opened the case with Adaptive and forwarded those job scripts and core files.
2017-09-21 02:57	2017-09-21 09:40	Storage server (AFS, iSCSI, web, etc)	The parchment storage server stopped responding on the network.	Several websites were down, including the following: www.ncsa.illinois.edu, cybergis.illinois.edu, nationaldataservice.org, etc iSCSI storage mounted to fileserver went offline. Several AFS volumes, including some users' home directories were offline.	Replaced optical transceiver on the machine and networking restarted. Also updated kernel and AFS.
2017-09-20 08:00	2017-09-20 13:45	Campus Cluster	September Maintenance	Total cluster outage	Maintenance completed successfully.
2017-09-20 08:00	2017-09-20 11:30	NCSA Storage Condo	Normal maintenance --Firmware upgrade on Netapps so new disk trays could be attached for DSIL	total file system outage	The quarterly maintenance was complete
2017-09-18 11:20	2017-09-18 13:30	Active Data Storage	RAID Failure in NSD server and disk failure on secondary NSD server.	ADS service was unavailable	Recovered RAID configuration on NSD server and replaced failed disk on secondary NSD. ADS restored.
2017-09-15 06:20	2017-09-15 09:28	public-linux	OpenAFS storage was not running or mounted after rebooting to a new kernel.	AFS storage was not available from this server	Reinstalled the dkms-openafs package restarted the openafs-client. AFS is now working as expected.
2017-09-10 09:45	2017-09-10 11:30	NCSA Open Source	Upgrade of Bamboo, JIRA, Confluence, BitBucket, FishEye, Crowd	During the upgrade the services will be unavailable for a short amount of time.	All services upgraded successfully.
2017-08-31 11:07	2017-08-31 11:11	NCSA LDAP	NCSA LDAP Timeouts	NCSA LDAP was overloaded and timing out. Users were not able to authenticate via NCSA LDAP during that time.	NCSA LDAP stopped timing out at 11:11 am and authentication resumed.
2017-08-28 11:55	2017-08-28 12:59	NCSA GitLab	NCSA GitLab server ran out of disk space for the OS	The web interface at https://git.ncsa.illinois.edu wasn't working	Web interface is now working. Space freed up by clearing CrashPlan caches.
2017-08-24 13:00	2017-08-24 14:30	netact.ncsa.illinois.edu	Transient config issues from some system patching caused apache to not be able to start on the netact server	Network Activation	The issues were fixed and Network Activation is working again
2017-08-24 08:00	2017-08-24 15:30	LSST	Rack upgrades in NCSA 3003	Most LSST Developer services offline during upgrade	All LSST systems are back online with new racks and switches
2017-08-24 08:00	2017-08-24 09:30	LSST	monthly maintenance for NPCF (includes patching to address CESA-2017:1789 and CESA-2017:1793)	adm01, backup01, bastion01, monitor01, object, qserv, sui, verify-worker, test0*	Maintenance was successfully completed.
2017-08-23 09:21	2017-08-23 16;50	aForge/iForge	gpfs failed during an upgrade of GPFS on the iforge storage nodes. There was an IB hiccup at the time, but causality is unclear	all jobs on iforge were aborted, gpfs clients needed to be upgrade, all gpfs client nodes were rebooted	iForge went production shortly before 5:12pm. aforge went "production" at ~1630
2017-08-22 20:00	2017-08-22 30:00	Patching DHCP service	Patching OS and services on DHCP1.	Will need to reboot DHCP server a few times during this process. During the time dhcp will be unavailable. This is during the evening so I don't expect any direct issues from this.	Patching has been completed.
2017-08-16 08:00	2017-08-16 16:00	Campus Cluster	August Maintenance	Scheduler and resource manager down	Upgraded Moab 9.1.1 and Torque 6.1.1.
2017-08-16 08:00	2017-08-16 09:15	None	Replace Line Card in Core Switch	I believe all systems connected to this switch, are multihomed and will not experience an outage.	The line was has been successfully replaced.
2017-08-16 00:30	2017-08-16 02:30	Blue Waters	Two cabinets (c10 & c11) had EPO due to XDP control valve failure.	Scheduler was paused to isolate failing parts, resumed at 2:09.	Parts replaced and cabinets were returned to service.
2017-08-08 7:00	2017-08-09 3:00	iforge/cfdforge/aforge	Update OS image to RH 6.9 Update GPFS to version 4.2.3-2 Redistribute power drops	All four clusters were updated.	All items on checklist completed. 20170808 Maintenance for iforge
2017-08-03 06:45	2017-08-03 07:35	NCSA Jabber upgrade	Upgraded Openfire XMMP jabber software	NCSA Jabber was unavailable during the upgrade.	Jabber was upgraded to the latest version of Openfire
2017-07-28 17:00	2017-07-31 evening		Update - All of the production data has been migrated except for the largest object table. That is loading now, then the user space will be loaded. Should all hopefully be done by this evening. Migration of operational database to new hardware happening during the weekend.	DES old operational database	migration done successfully. Some other maintenance tasks that will give DES additional disk space was done, too and some performance improvements.
2017-07-27 11:00	2017-07-28 15:00	netact.ncsa.illinois.edu	The netact.ncsa.illinois.edu network activation server VM needed to be restored from backup	Network Activation service	The service has been fully restored
2017-07-25 02:36	2017-07-25 18:00	Campus Cluster / Scheduler down	Blip on mgmt1 causing GPFS drop and scheduler to crash	Scheduler offline	Still taking long time for Scheduler to initialize but jobs can start and run as usual. Opened case with Adaptive.
2017-07-20 09:00	2017-07-20 17:00	ROGER Ambari and OpenStack	Updates to openstack control node and the Ambari cluster	Ambari nodes (cg-hm08 - cg-hm18), OpenStack instances and servers	Openstack was back in service on time. Ambari had issues mounting hdfs was held out of service. HDFS was remounted on 25 July
2017-07-20 06:00	2017-07-20 10:00	All NCSA hosted LSST resources	Monthly OS patches (addressing issues including CESA-2017:1615 and CESA-2017:1680). Roll-out updated puppet modules. Batch nodes updated firmware.	All nodes in NCSA 3003 and NPCF (batch nodes) will reboot.	Overall success. Exceptions: verify-worker31 failed a firmware update and is out of comission (LSST-914) and there are connectivity issues for some VMs used by the NCSA DM team (IHS-365). adm01, backup01, and test[09-10] will be patched in the near future.
2017-07-19 08:00	2017-07-19 14:44	Campus Cluster	July Maintenance (applied security patch)	Cluster wide, except mwt2 nodes	Applied new kernel, glibc, bind patches and newest NVIDIA driver.
2017-06-29 1800	2017-06-30 0000	Blue Waters	Emergency maintenance to apply security patch addressing Stack Guard security vulnerability.	Compute, Login, Scheduler are offline.	Kernel and glibc library patched on all affected system.
2017-06-22 0800	2017-06-22 1200	All NCSA hosted LSST resources	CRITICAL kernel and package updates to address Stack Guard Page security vulnerability.	Systems will be patched and rebooted.	Outage was extended to last past 1000 until 1200. Systems were successfully patched as planned except for qserv-db12 and qserv-db27, which will not boot. We will follow up on those with a ticket.
2017-06-22 0800	2017-06-22 0930	LSST cluster nodes (verify-worker, qserv, sui, bastion01, test, backup01)	Deploy Unbound (local caching DNS resolver)	DNS resolving may have a short (~30 mins) delay.	Successfully deployed and all tests (including reverse DNS and intra-cluster SSH) pass.
2017-06-20 0930	2017-06-20 1100	Bluewaters	XDP shutting down causing EPO on cabinet c1-7 and c2-7.	Scheduler was paused to isolate the failing components, then resumed.	Warmswap of failing components, and returned them to service.
2017-06-20 0900	2017-06-20 1000	NCSA Open Source	Security upgrade needed for Bamboo, will also update the following components: Bamboo, JIRA, Confluence, BitBucket, FishEye	Most of the subcomponents of NCSA opensource will be down for a short time when the software is updated.	Upgraded Bamboo, JIRA, Confluence, BitBucket, FishEye to latest versions
2017-06-16 0900	2017-06-16 1100	ROGER Openstack nfs backend failed and was restarted	The primary CES server for the openstack backend failed and tried to fail over to the secondary server, which also failed. SET was notified and they had the CES nfs service back up by 1100	The RoGER openstack dashboard went down and needed a restart. Several VM's experienced "virtual drive errors" and will need to be restarted	SET is still investigating the cause of the GPFS CES service failover. CyberGIS is working with their users to get the affected VM's restarted
2017-06-15 0800	2017-06-15 0930	LSST cluster nodes (verify-worker, qserv, sui, bastion01, test, backup01)	Deploy unbound	DNS resolving may have a short (~30 mins) delay.	Updates deployed successfully via new puppet module. All tests passed. EDIT 2017-06-15 1500 - Reverse DNS not working, which broke ssh to qserv* nodes. Disaabled unbound.
6/14/2017 8:00 a.m.	6/14/2017 10:00 p.m.	Network Core Switch	Network Engineering will be replacing a line card in one of our Core switches due to hardware issue.	All services should remain active. Any affected switch will have a second redundant link to the other core to pass traffic.	Line card was successfully replaced.
2017-06-08 12:00	2017-06-11 22:20	Campus Cluster (scheduler paused)	Disk Enclosure 3 failure on DDN 10K.	Lost redundancy and force us to drain the cluster.	Repair/replacement for controller can be time consuming so we took action to rebalance data out of failed enclosure. Scheduler was resumed as of 22:00.
2017-06-07 12:07	2017-06-07 12:42	NCSA LDAP	The NCSA LDAP service crashed	NCSA LDAP service was unavailable	LDAP software and OS were updated and server rebooted. LDAP is working normally.
2017-05-31 20:06	2017-05-31 20:36	NCSA LDAP	The NCSA LDAP service was timing out	NCSA LDAP service was unavailable	The root cause of LDAP timeouts is still being investigated.
2017-05-22	2017-05-26	Campus Cluster VMs	Network issue ESXI (hypervisor) Boxes after maintenance	Could no longer able to login to start VMs. License Server, nagios, all MWT2 VMs were down	The issue is fixed on 5/24. Restored license and Nagios service on 5/24. Moved MWT2 VMs to Campus Farm. All VMs return to service as of noon 5/26.
5/12/2017	5/18/2017	Condo/NFS partitions only	the NFS partition for the condo became extremely unstable after a replication (normal daily maintenance) was completed. Many iterations with FSCK and IBM on the phone got it resolved, and then 1.5 days restoring files that had been put in Lost and found.	UofI library was switched to the READONLY version on the ADS during this time	The root cause is still being investigated.
2017-05-23 14:05	2017-05-23 14:13	NCSA LDAP	The NCSA LDAP service was timing out	NCSA LDAP service was unavailable	The issue is still being investigated, but seems to be steadily available since the incident.
2017-05-22 15:41	2017-05-22 15:51	idp.ncsa.illinois.edu oa4mp.ncsa.illinois.edu	Apache Tomcat out of memory	InCommon/SAML IdP and OIDC authentication services were unavailable.	Service restored by failing over to secondary server while memory is being increased on primary server.
05/20/2017 21:09	05/20/2017 23:37	DES nodes on Campus Cluster	Could not communicate outside the switch	All nodes connected to switch in POD22 Rack2 @ACB	Upgraded the code on the switch resolved the issue.
05/20/2017 05:00	05/20/2017 21:09	Campus Cluster and Active Data Storage (ADS)	Total power outage at ACB	All systems currently reside at ACB	Power was restored around 13:00hrs. We rotated ADS rack to align with Campus Cluster Storage Rack. Changed couple of VLAN IDs to reflect campus for future merger. ESXI boxes are down due to a configuration error after reboot. No major issue from output of FSCK from scratch02.
05/17/2017 02:00	05/17/2017 10:45	Internet2 WAN connectivity	Intermittent WAN connectivity. The outage was a result of Tech Services' DWDM system, which provides us with our physical optical path up to Chicago via the ICCN. Specifically, the Adva card that our 100G wave is on was seeing strange errors, which was causing input framing errors for traffic coming in on this interface.	General WAN connectivity to XSEDE sites, certain commodity routes, and other I2 AL2S connections.	The Adva card was rebooted and we stopped seeing the input framing errors. Tech Services is working with Adva to find the root cause of the issues on the card.
5/11/2017	5/12/2017	ESnet 100G connection	NCSA and ESnet will be moving their 100G connection to a different location in Chicago.	We have several diverse high speed paths to ESnet and DOE, traffic will be redirected to a secondary path.
2017-05-11 06:45	2017-05-11 07:33	NCSA Jabber upgrade	Upgraded Openfire XMMP jabber software	NCSA Jabber was unavailable during the upgrade.	Jabber was upgraded to the latest version of Openfire
2017-05-09 07:00	2017-05-09 18:15	iForge, GPFS, License Servers	iForge Planned Maintenance	iForge systems, including the ability to submit/run jobs.	Pm was completed early at 1815
2017-05-06 22:00	2017-05-06 23:00	NCSA Open Source	Upgrades of Atlassian software	NCSA Open Source BitBucket	BitBucket is upgraded.
2017-05-06 09:00	2017-05-06 10:00	NCSA Open Source	Upgrade of Atlassian Software	Most services hosted at NCSA Open Source were down for 5 minutes during rolling upgrades.	The following services were upgraded: HipChat, Bamboo, JIRA, Confluence, FishEye and CROWD.
2017-05-05 17:43	2017-05-05 20:02	ITS vSphere	A VM node panicked	Several VMs died when the node panicked and were restarted on other VM nodes. This included LDAP, JIRA, Help/RT, SMTP, Identity, and others.	All affected VMs were restarted on other VM nodes. Most restarted automatically.
2017-04-27 18:10	2017-04-27 18:55	Campus Cluster	Another GPFS interruption	Both Resource Manager and Scheduler went down along with hand full of compute nodes.	Restarted the RM and Scheduler and rebooted all down nodes.
2017-04-27 13:11	2017-04-27 14:20	Nebula	glusterfs crashed due to this bug, so no instances could access their filesystems	All instances running on Nebula	Needed to reboot the node that systems were mounting from, but took the opportunity to upgrade all gluster clients on other systems while waiting for a reboot. Version 3.10.1 fixes the bug. All instances with errors in their logs were restarted.
2017-04-27 11:20	2017-04-27 12:45	Campus Cluster	GPFS interruption	Both Resource Manager and Scheduler went down.	Torque serverdb file was corrupted. Restore the file from this morning snapshot and modified the data to match the current state.
2017-04-26 12:00	2017-04-26 18:30	Condo	A bug in the delete of a disk partition from GPFS. a problem within GPFS	DES, Condo partitions, and UofI Library.	Partitions had been up for 274 days, and many changes. The delete partition bug caused us to stop ALL operations on the condo and repair each disk through GPFS. Must have quarterly maintenance. Just too complicated to go a year without reseting things.
2017-04-19 16:54	2017-04-20 08:45	gpfs01, iforge	Filled-up metadata disks on I\O servers caused failures on gpfs01.	iforge clusters, including all currently running jobs.	Scheduling on iForge was paused for the duration of the incident. Running jobs were killed.13% metadata space was freed. Clusters were rebooted and scheduling resumed.
2017-04-19 08:00	2017-04-19 13:00	Campus Cluster	Merging xpacc data and /usr/local back to data01 (April PM)	Resource manager and Scheduler were unavailable during the maintenance.	Once again, /usr/local, /projects/xpacc and /home/<xpacc users> are mounting from data01. No more split cluster.
2017-04-04 (1330)	2017-04-04 (1600)	Networking	Some fiber cuts caused a routing loop inside one of the campus ISP's network.	Certain traffic that traversed this ISP would never make the final destination. Some DNS lookups would have also failed.	Campus was able to route around the problem, and the ISP also corrected their internal problem. The cut fiber was restored last night.
2017-03-28 (0000)	2017-03-29 (1600)	LSST	NPCF Chilled Water Outage	LSST - Slurm cluster nodes will be offline during the outage. All other LSST systems are expected to remain operational.	No issues. Slurm nodes restarted.
2017-03-28 (0000)	2017-03-29 (0230)	Blue Waters	NPCF Chilled Water Outage	Full system shutdown on Blue Waters (except Sonexion which is needed for fsck)	FSCK done on all lustre file systems, XDP piping works done (no leakage found), Software updates (PE, darshan) completed.
2017-03-25 10:15PM	2017-03-26 00:08AM	Blue Waters	BW scratch MDT failover, df hangs	BW scratch MDT failover, load on mds was 500+ delayed failover. Post FO had some issues that delayed RTS.	scheduler was paused
2017-03-25 4pm	2017-03-25 8Ppm	Blue Waters	BW login node ps hang	rebooted h1-h3, lost bw/h2ologin DNS record, had neteng recreate the record. Had to rotate login in and out of round-robins until all rebooted. User email sent (2).	Logins nodes rebooted DNS round-robin changes
2017-03-23 (1000)	2017-03-23 (1500)	Nebula	NCSA Nebula Outage	Nebula will take an outage to balance and build a more stable setup for the file system. This will require a pause of all instances, and Horizon being unavailable.	File system online and stable. At this time all blocks were balanced and healed.
2017-03-16 (0630)	2017-03-16 (1130)	LSST	LSST monthly maintenance	GPFS filesystems will go offline for entire duration of outages. Some systems may be rebooted, especially those that mount one or more of the GPFS filesystems.
2017-03-15 15:11	2017-03-15 16:01	Blue Waters	Failure on cabinet c9-7, affecting HSN.	Filesystem hung for several minutes.	Scheduler was paused for 50 minutes. Warmswap cabinet c9-7. Nodes on c9-7 are reserved for further diagnosis.
2017-03-15 09:00	2017-03-15 12:47	Campus Cluster	UPS work at ACB.Reshuffling electrical drops on 10k controllers, storage IB switches and some servers.	Scheduler will be paused for regular jobs. MWT2 and DES will continue run on their nodes.	UPS work at ACB - incomplete (required additional parts)Redistributing power work done.Scheduler was paused for 3hrs 50 mins.
2017-03-10 13:00	2017-03-10 18:00	Campus Cluster	ICCP - We lost 10K controllers due to some type of power disturbance at ACB.	ICCP - Lost all filesystem and its a cluster wide outage.	Recovered missing LUNs and rebooted the cluster. Cluster was back in service at 18:00.
2017-03-09 0900	2017-03-09 1500	Roger	ROGER planned PM	batch, hadoop, data transfer services & Ambari	system out for 6hrs, DT services out until 0000
2017-03-08 19:41	2017-03-08 22:41	Blue Waters	XDP powered off that served the four cabinets (c16-10, c17-10, c18-10, c19-10).	scheduler paused, four rack power cycled. moab required a restart, too many down nodes and itterations were stuck.	Scheduler paused three hours
2017-03-03 1700	2017-03-03 2200	Blue Waters	BW hpss emergency outage to clean up db2 database	ncsa#nearline, stores are failing with cache full	Resolved cache full errors
2017-02-28 1200	2017-02-28 1250	Campus Cluster	ICC Resource Manager down	User can't submit new jobs or start new jobs	Remove corrupted job file
2017-02-22 1615	2017-02-221815	Nebula	Nebula Gluster Issues	All Nebula instances paused while gluster repaired	Nebula is available.
2017-02-11 1900	2017-02-11 2359	NPCF	NPCF Power Hit	BW Lustre was down, xdp heat issues.	RTS 2017-02-11 2359
2017-02-15 0800	2017-02-15 1800	Campus Cluster	ICC Scheduled PM	Batch jobs and login nodes access

Child pages

NCSA Status Home

status.ncsa.illinois.edu

Current Status

Report a problem

Upcoming Scheduled Maintenance

Previous Outages or Maintenance