status.ncsa.illinois.edu


Previous Outages or Maintenance

2022-06-28 07002022-06-28 1900RadiantUnexpected complications during Radiant Maintenance

Minimally disruptive, brief interruptions to OpenStack services, such as the Horizon dashboard

Longer than expected outages of controller service. Instances that had floating IPs had no networking connectivity. Horizon dashboard and API was down (cannot launch new instances, etc).

radiant-admin@ncsa.illinois.edu

RESOLVED

06-11-22 140006-14-2022 1630Granite Tape ArchiveFS was locked up due to a bug alert setting;Ingest or retrieval of data from the clusterbdickin2@illinois.edu  slack-id: briandi

RESOLVED

2022-06-02 18002022-06-07 1830NCSA Wiki ServcieDue to a critical security vulnerability announced  by Atlassian we have been forced to restrict access to the NCSA Wiki to NCSA internal networks. This restriction will remain in place until Atlassian is able to provide a patch or mitigation for the vulnerability.No remote access is allowed to the NCSA Wiki. Use the NCSA VPN for remote access. More information about using the VPN can be found here: https://users.ncsa.illinois.edu/clausen/NCSA_VPN_instructions_202206.pdfhelp@ncsa.illinois.edu

COMPLETE

2022-06-22 14302022-06-22
1900

NCSA LDAP1

replica is down

LDAP1 database server is failed. The IAM team is investigating.Only servers using ldap1 and should use ldap2tbouvet@illinois.edu

RESOLVED

2022-06-22 14302022-06-22
1600
NCSA LDAP central replicas (ldap2-3) and any services that rely on them.LDAP database servers are failed. The IAM team is investigating.Any service, such as the internal web server and Jira and Confluence servers, that rely on LDAP for user identification data may be affected.help@ncsa.illinois.edu

RESOLVED

 1700

 1830

Confluence (Wiki)Patching to address a security flawConfluence will not be accessiblehelp+service@ncsa.illinois.edu

COMPLETE

2022-06-02 06002022-06-02 0615NCSA GitLabGitLab was updated to latest versionAll GitLab services were unavailable for a few minutes.help+service@ncsa.illinois.edu

COMPLETE

 1700

 1900

JiraUpgradeJira will not be availablehelp+service@illinois.edu

COMPLETE

2022-06-01 09002022-06-01 1015Facility UPSReplace two batteries,All system with UPS feed, the UPS will stay online supporting loads but at reduced capacity and no outage expected.rantissi@illinois.edu

COMPLETE

2022-05-25 22302022-05-26 16:15Delta

3 HSN switches were experiencing problems

switches were updated and reconfigured

  • Slurm scheduler was paused to prevent new jobs from starting
  • Taiga remained unmounted
  • various nodes had no connectivity to the HSN
  • most services were experiencing some amount of degradation
help@ncsa.illinois.edu

COMPLETE

2022-05-25 18002022-05-25 
2230
Taiga - CenterWide FSPartial outage. Some projects asked to temporary unmount /taigadeltaChristopher Heller

COMPLETE

2022-05-18 07002022-05-18 1400NightingaleNightingale Planned MaintenanceAll Nightingale Serviceshelp@ncsa.illinois.edu

COMPLETE

2022-05-12 17002022-05-12 1800Jira & WikiChange to puppet configsDowntime expected on each system for 1 to 5 minuteshelp+service@ncsa.illinois.edu

COMPLETE

2022-05-10 0700

2022-05-10 1900

iForge / vForge / license serversQuarterly Planned Maintenanceall nodes and services will be unavailablehelp@ncsa.illinois.edu

COMPLETE

2022-05-10 08002022-05-10 0815cilogon.orgUpdate to OA4MP v5.2.6Improvements in the back-end servicehelp@cilogon.org

COMPLETE

2022-05-09 18002022-05-09 2130NCSA File & Print ServersScheduled Windows Server MaintenanceFile & Print Shares were unavailable during maintenance.  Users were unable to access shares on Fileserver (e.g. home, busnoff, hr, etc.), and printing was unavailable.help+service@ncsa.illinois.edu

COMPLETE

2022-05-04 10002022-05-04 1015IDDS Accounting ServicesPlanned Maintenance All IDDS services (APIs, acctd, etc)help+idds@ncsa.illinois.edu, tolbert@illinois.edu

COMPLETE

2022-05-04 06002022-05-04 0622NCSA GitLabGitLab was updated to latest versionAll GitLab services were unavailable for a few minutes.help+service@ncsa.illinois.edu

COMPLETE

2022-04-19 12:002022-04-19 12:01RadiantRestarted the AMQP service to put in some performance changesNew instance or virtual network changes that were submitted during the five-second restart may have failedradiant-admin@ncsa.illinois.edu

COMPLETE

2022 04-16 06002022 04-16 0630CILogonSeveral cilogon.org services will be updatedhttps://cilogon.org , https://crl.cilogon.org , https://demo.cilogon.org , ldaps://ldap.cilogon.orghelp@cilogon.org

COMPLETE

2022-04-14 2100

 0915

JiraNew tickets cannot be created due to the user license limit being reachedCreation of new tickets.https://www.ncsa.illinois.edu/expertise/user-services/user-support/

RESOLVED

2022 04-14

0800

2022 04-14 0830Wifi, VoIP, CCTV and FS networks at NCSA.Tech services will be replacing their building router at NCSA.  They expect a 10 mins outage.  Services may see a temporary interruption as cables are being changed.help+neteng@ncsa.illinois.edu

SCHEDULED

2022 04-09 0600

2022 04-09 0700Internet2 / ESnet WAN connections.
During a few minute outage, some of our WAN circuits will be migrated.  Traffic will be automatically re-routed. help+neteng@ncsa.illinois.edu

SCHEDULED

2022-03-17 09002022-04-12
1030
jiraldap auths have been sporadically failing.  This service is being monitored to determine a root cause.Jira logins breakhelp+service@ncsa.illinois.edu

RESOLVED

2022-04-12 09002022-04-12 0930vsphere.ncsa.illinois.eduvcenter security updates are being installed vm management interface will be unavailable for 15 mins.help@ncsa.illinois.edu

COMPLETE

2022-04-07 19002022-04-07 1950NCSA VPNSoftware Upgrades / SSL CertificateThe appliances hosting the NCSA VPN were patched and receive an updated SSL certificate. Users will experience a brief disconnect as load is failed over between the appliances.neteng@ncsa.illinois.edu

RESOLVED

2022-04-06 22002022-04-07 0000Some office ports on the second floor. Once of the switches on the second floor is experiencing a software problem and is currently down.  Code updates are being applied.One of the six switches on the second floor is down.  Users who are connected to this port, might not receive link.help+neteng@ncsa.illinois.edu

RESOLVED

2022-04-06 15302022-04-07 0630All systems which mount/utilize TaigaA bug involving the multirail functionality caused constant reboots with one of the metadata servers. This resulted in cluster de-stabilization and loss of function.All lustre/NFS mountpoints to Taiga, Globus to Taiga.help@ncsa.illinois.edu

RESOLVED

2022-04-04 09302022-04-04 1000NCSA LDAPInstantiation of Delta resource OU branch in the NCSA LDAP database with replication testing.No impacts to properly configured systems or searches is expected.help@ncsa.illinois.edu

COMPLETE

2022-04-01 06002022-04-01 0700NCSA GitLabGitLab was updated to latest versionAll GitLab services was unavailable for a few minutes.help+service@ncsa.illinois.edu

COMPLETE

2022-03-17 09002022-04-12
1030
jiraldap auths have been sporadically failing.  This service is being monitored to determine a root cause.Jira logins breakhelp+service@ncsa.illinois.edu

RESOLVED

2022-04-12 09002022-04-12 0930vsphere.ncsa.illinois.eduvcenter security updates are being installed vm management interface will be unavailable for 15 mins.help@ncsa.illinois.edu

COMPLETE

2022-04-07 19002022-04-07 1950NCSA VPNSoftware Upgrades / SSL CertificateThe appliances hosting the NCSA VPN were patched and receive an updated SSL certificate. Users will experience a brief disconnect as load is failed over between the appliances.neteng@ncsa.illinois.edu

RESOLVED

2022-04-06 22002022-04-07 0000Some office ports on the second floor. Once of the switches on the second floor is experiencing a software problem and is currently down.  Code updates are being applied.One of the six switches on the second floor is down.  Users who are connected to this port, might not receive link.help+neteng@ncsa.illinois.edu

RESOLVED

2022-04-06 15302022-04-07 0630All systems which mount/utilize TaigaA bug involving the multirail functionality caused constant reboots with one of the metadata servers. This resulted in cluster de-stabilization and loss of function.All lustre/NFS mountpoints to Taiga, Globus to Taiga.help@ncsa.illinois.edu

RESOLVED

2022-04-04 09302022-04-04 1000NCSA LDAPInstantiation of Delta resource OU branch in the NCSA LDAP database with replication testing.No impacts to properly configured systems or searches is expected.help@ncsa.illinois.edu

COMPLETE

2022-04-01 06002022-04-01 0700NCSA GitLabGitLab was updated to latest versionAll GitLab services was unavailable for a few minutes.help+service@ncsa.illinois.edu

COMPLETE

2022-03-23 10002022-03-23 1600Email ListsEmail lists (lists.ncsa.illinois.edu) are not functioning

Ability to send to email lists.

Note: Bounced emails will need to be resent.

help+service@ncsa.illinois.edu

COMPLETE

2022-03-22
0730hrs
2022-03-22
0915hrs
ldap - NCSA primary serverOS updates and replication changesNCSA LDAP primary server will be unavailable, replicas should remain accessible

COMPLETE

2022-03-21 08002022-03-21 0830cilogon.orgMigrate CILogon Services to AWScilogon.org , demo.cilogon.org , crl.cilogon.orghelp@cilogon.org

COMPLETE

2022-03-19 01002022-03-19 1500Campus ClusterCooling units at ACB stopped functioning, temperatures in the datacenter soared to cause machines to power off due to high temps. By the time ICI was informed, cooling had resumed at ACB. ICI then restored serviceAll of Campus Clusterhelp@campuscluster.illinois.edu

RESOLVED

2022-03-17 11002022-03-17 1123ASD and ACHE vsphere clusters and ldap1 and ldap2certs on ldap1 and ldap2 were updatedlogins to ASD and ACHE vsphere were down for 23 minutes.help@ncsa.illinois.edu

COMPLETE

2022-03-17
09:08

2022-03-17
10:01
JiraLogins are slow or unsuccessfulJira login

RESOLVED

2022-03-16 17002022-03-16 1800DNS1Hardware replacement on DNS1 server.DNS lookups will be on own the primary DNS server while the hardware is being swapped.  DNS2 will remain up.help+neteng@ncsa.illinois.edu

COMPLETE

2022-03-14 18002022-03-15 23:45NCSA File & Print ServersScheduled Windows Server MaintenanceFile & Print Shares were unavailable during maintenance.  Users were able to access shares on Fileserver (e.g. home, busnoff, hr, etc.), and printing was unavailable.help+service@ncsa.illinois.edu

COMPLETE

2022-03-10 0700hrs2022-03-10 1500hrsDistribution panel DP-5C-1020. Power feed C to the north east corner power panelsDe-energizing electrical distribution panel  DP-5C-1020 to tie in power cables to Holl-I system

Known resources impacted:

Granite: already planned to be offline for maintenance

iForge: cluster offline for the duration

Radiant: cluster online, without power redundancy

help@ncsa.illinois.eduCOMPLETE


2022-03-09 07002022-03-09 0810linux.ncsa.illinois.edu
(aka public-linux)
Upgrade server to RHEL 8 and add NCSA Duo 2FA authenticationServer was unavailable during maintenance.help+service@ncsa.illinois.edu

COMPLETE

2022-03-02
930
2022-03-07
1715
ICC

Emergency PM

We are seeing some network issues on the cluster. In order to resolve these issues, we need to upgrade code on our infiniband infrastructure


UPDATE: We are currently experiencing unforeseen technical issues with the cluster. We are investigating and expect resolution and restoration of all Campus Cluster services by March 3rd 12PM

UPDATE2: We are still experiencing issues where the compute clients will not properly mount storage. We are engaged with vendor support and continue to work on the situation. Thank you for your patience. We have moved expected return to service to March 4th, 12PM

UPDATE3: Campus cluster is experiencing SLURM job failures in certain pods(sections) of the cluster. Investigations continue and there is a partial return to service with login nodes, storage, and data transfer services still operational. New full return of service date: Monday, March 7th, 12PM.

ICCP filesystem will be offline. Most projects will be impacted. Special arrangements have been made with some to be able to operate to some degree during the outage.help@campuscluster.illinois.edu

COMPLETE

2022-03-02 1237

2022-03-02 1715

iforge (iforge.ncsa.illinois.eduGPFS issue with interruption of filesystem leading to scheduler pause1 running job was aborted, and any new jobs paused during the interruptionhelp@ncsa.illinois.edu 

COMPLETE

2022-03-02
0600
2022-03-02
0630
Jira

Adding Ram
to improve performance

Jira will be unavailable druning maintenance

COMPLETE

2022-03-01
1800
2022-03-01
1810
ldap2 server clients of
NCSA LDAP

on-line maintenance

restart rsyslog and Ldap after relocating /var/logs clients should have redundant servers configured

COMPLETE

2022-02-28
1800
2022-02-28
1830
ldap1 server clients of
NCSA LDAP

on-line maintenance

Had to restart rsyslog and Ldap after relocating /var/log

slow response from ldap1 but clients should have redundant servers configured

COMPLETE

2022-02-28
0900
2022-02-28
1030
CMDBV1.7.20220228 ReleaseMDB database will be unavailable. ITSM's openDCIM will be down for a short period (~ 5 minutes) while the data is reloaded.

kimber7@illinois.edu

COMPLETE

2022-02-26 07302022-02-26 0750NCSA GitLabGitLab was updated to latest versionAll GitLab services were unavailablehelp+service@ncsa.illinois.edu

COMPLETE

2022-02-25-10:002022-02-25-13:00Taiga - CenterWide FSFull file system outageAll clients mounting Taiga

COMPLETE

2022-02-09 1400

2022-02-25 1030Jira, Internal/Savannah, LDAP, POP, Hosted web servers, virtual classroom, vcenter

The NCSA VMWare cluster is experiencing storage performance issues.

-- Update: Adjustments have been made to storage used by the LDAP servers and other non-essential VM instances have been disabled. Testing is indicating that response times have improved and services are working normally again.

We monitoring services. Please report any issues to help@ncsa.illinois.edu

RESOLVED FOR NOW

2022-02-24 10002022-02-24 1115

cerberus2.ncsa.illinois.edu, tg-kdc1.security.ncsa.illinois.edu, bwbh2.ncsa.illinois.edu

One of the IRST ESXi machines unexpectedly shutdown.The listed hosts are currently unavailable

COMPLETE

2022-02-23 17002022-02-23 1900DNS2DNS2 hardware will be replaced.There will be a brief outage of DNS2, while IP's are migrated to the new server.help+neteng@ncsa.illinois.edu

COMPLETE

2022-02-22: 08252022-02-22: 1324Slack

Info from Slack (https://status.slack.com/)

We've resolved the issue, and all impacted customers should now be able to access Slack. You may need to reload Slack (Cmd/Ctrl + Shift + R) to see the fix on your end. If that doesn't work, try clearing cache (Help > Troubleshooting > Clear Cache and Restart from the app menu). Thanks for bearing with us and we apologize for the disruption to your work day!

Feb 22, 1:24 PM CST

We're seeing signs of improvement. Please try reloading Slack, and if not a cache reset. We’re still monitoring the situation. We’ll confirm once this issue is fully resolved.

Feb 22, 11:07 AM CST

Slack is not loading for some users. We are continuing to investigate the cause and will provide more information as soon as it's available.

Feb 22, 9:23 AM CST

We're still working towards a full resolution. We'll be back with another update soon. Thank you for your patience.

Feb 22, 8:44 AM CST

We’re investigating the issue where Slack is not loading for some users. We’re looking into the cause and will provide more information as soon as it's available.

Feb 22, 8:25 AM CST

Various issues accessing and using Slackhelp@ncsa.illinois.edu

COMPLETE

2022-02-18 12:10PM

2022-02-18
2PM


Jira

Reboot to add ram/swap

This is to improve stability


Jira tickets unavailable

COMPLETE

2022-02-10 10302022-02-18 3:55pmNgale filesystem

The Lustre filesystem is not loading correctly. The support team has been contacted.

Still in progress. MDT0001 is partially recovered. Vendor is attempting to fully restore.

Near completion: Working with vendor on additional configuration changes. Hope to complete final validation and return to service by close of business 2022-02-18.

/ngale filesystem is not accessible. 

COMPLETE

2022-02-18 12:10PM

2022-02-18
2PM


Jira

Reboot to add ram/swap

This is to improve stability


Jira tickets unavailable

COMPLETE

2022-02-14

1PM

2022-02-14

4:15PM

All NCSA LDAP serversExpanding schema and restarting serverssystems will reconnect to LDAP server after restart

COMPLETE

2022-02-09

1000

2022-02-09

1200

Facility UPSUPS DC voltage calibrationUPS will be taken to maintenance bypass and all connected  systems will be fed from unprotected power source (no power interruption).rantissi@illinois.edu

COMPLETE

2022-02-09 09002022-02-09 0940Line card failure in Core-EastLine card failure in Core-east, which is resulting in connectivity issues for some infrastructure in NCSA 3003.DNS2, and LSST systems in 3003 were down until the uplinks could be migrated to a new port on Coreshelp+neteng.ncsa.illinois.edu

COMPLETE

2022-02-01
8AM
2022-02-01
4PM
Jira/ldap-auth1login issuesJira Access
2022-02-09 05342022-02-09 0811

LDAP (and dependent services, incl. Jira)

vSphere/ICI VMware

Authorization timeouts/failures in dependent services.

ICI staff are investigating.

LDAP (and dependent services, incl. Jira)

vSphere/ICI VMware

Cause of most severe issues was power fluctuations around 0555, but certain LDAP servers showed degraded slightly earlier.


COMPLETE

2022-02-09 06002022-02-09 0645NCSA MySQL

MySQL database servers need to be synchronized to bring replicated database servers online.

NOTE: The MySQL database is back up, but users may experience issues due to an LDAP issue.

Wiki, JIRA, Savannah/Internal, Identity, and some web sites will stop working. More details are linked here.

help+service@ncsa.illinois.edu

COMPLETE

2022-02-08
7AM

22-02-08

3:15PM

iforge / vforge / license serversRegular Maintenanceiforge, vforge, license servers

COMPLETE

2022-02-08 10002022-02-08 1245CMDBV1.6.20220207 ReleaseCMDB database will be unavailable. ITSM's openDCIM will not be impacted.kimber7@illinois.edu

COMPLETE

2022-02-04 06002022-02-04 0640NCSA GitLabGitLab was updated to latest versionAll GitLab services were unavailablehelp+service@ncsa.illinois.edu

COMPLETE

2022-02-01 08002022-02-01 0900cilogon.orgUpdate to OA4MP v5.2.4Improvements in the back-end servicehelp@cilogon.org

COMPLETE

2022-01-252022-01-25Facility UPSReplace UPS batteriesAll systems with facility UPS feedrantissi@illinois.edu

COMPLETE

2022-01-24 18002022-01-24 20:00NCSA File & Print ServersScheduled Windows Server MaintenanceFile & Print Shares will be unavailable during maintenance.  Users will not able to access shares on Fileserver (e.g. home, busnoff, hr, etc.), and printing will be unavailable.help+service@ncsa.illinois.edu

COMPLETE

2022-01-24  04002022-01-24 0630Failed line card on neo-hpc-1 switch

Line card failure is affecting devices that are plugged into Neo-hpc-1 aggregation switch.  We've migrated links off the failed card, to other ports on the same switch.

No services are currently impacted.

help+neteng@ncsa.illinois.edu

IN PROGRESS

2022-01-19 08002022-01-19 2000ICCICC Quarterly MaintenanceAll ICC services

help@campuscluster.illinois.edu

COMPLETE

2022-01-18 08002022-01-18 0830cilogon.orgUpgrade MyProxy CA servers to CentOS 7Upgrade back-end MyProxy CA VMs from CentOS 6 to CentOS 7. No downtime is expected.help@cilogon.org

COMPLETE

2022-01-14 06002022-01-14 1715Business IT database had bad data.A database that NCSA mirrors from campus changed without notice breaking our MIS system. Business IT isolated the issue and corrected the data.Multiple complex systems have been affected by this data corruption issue.help+service@ncsa.illinois.edu

RESOLVED

2022-01-14 08002022-01-14 1720NCSAnet wirelessNCSAnet Wireless was unavailable due to bad data in ldapUsers couldn't connect to the NCSAnet wireless networkhelp+neteng@ncsa.illinois.edu

RESOLVED

2022-01-05 11002022-01-05 1145CMDBVersion V1.5.20211223 releaseCMDB database will be unavailable for a few moments; openDCIM will be unavailable for a few moments.kimber7@illinois.edu

COMPLETE

2021-12-20 18302021-12-20 2030JIraVersion Upgrade to address security issueJira will be unavailablehelp+service@ncsa.illinois.edu

COMPLETE

2021-12-17 13002021-12-17 1340CMDBVersion V1.4.20211217 releaseCMDB database will be unavailable for a few moments; openDCIM will not  be affected.

kimber7@illinois.edu

COMPLETE

2021-12-17 06002021-12-17 0622NCSA GitLabThe server was updated with some new Puppet configurations.GitLab services was unavailable for a few minutes as the SSL certificate for the service was updated.help+service@ncsa.illinois.edu

COMPLETE

2021-12-16 14002021-12-16 1430HTTP web proxy: httpproxy.ncsa.illinois.eduNCSA's general purpose HTTP web proxy server was rebuilt.HTTP web proxying through httpproxy was unavailable.help+service@ncsa.illinois.edu

COMPLETE

2021-12-10 07002021-12-10 1345iForgeInfiniBand switch maintenanceAll systems unavailableiforge-admin@lists.ncsa.illinois.edu

COMPLETE

2021-12-10 09002021-12-10 1000Bastion Hosts (Production group B)Patching out of cycleBastion Hosts (Production group B) were individually unavailable during reboothelp+security@ncsa.illinois.edu

COMPLETE

2021-12-09 09002021-12-09 0931Bastion Hosts (Production group A)Patching out of cycleBastion Hosts (Production group A) were individually unavailable during reboot

COMPLETE

2021-12-09 08002021-12-09 0900All IDDS servicesIDDS Postgres and Ruby on Rails upgradesAll IDDS servicestolbert@illinois.edu

COMPLETE

2021-12-09 06002021-12-09 0613NCSA GitLabGitLab was updated to latest versionAll GitLab services were unavailable for about 5 minuteshelp+service@ncsa.illinois.edu

COMPLETE

2021-12-07
1400
2021-12-07
1443
LSST

Kubernetes on NTS is not working properly after updates

Kubernetes on NCSA Test Standlsst-admin@ncsa.illinois.edu

RESOLVED

2021-12-07
0800
2021-12-07
1400
LSST

LSST Quarterly Maintenance

All LSST services hosted at NCSAlsst-admin@ncsa.illinois.edu

COMPLETE

2021-12-07

0930

2021-12-07

1030

ACHE Firewallssoftware maintenanceFirewalls will be upgraded using fail over procedures  - no traffic impact expectedJames Eyrich - eyrich on slack

COMPLETE

2021-11-30 0900

2021-11-30

1100

TechServices connectivity at NPCF (wireless, facilities, IRIS, Prox scanners).Tech Services will be replacing several network devices at NPCF that will impact a variety of services at NPCF.  The Tech Services will be replacing 3 devices at NPCF.  Along with sporadic wireless outages, some facilities networks (such as IRIS and card readers) will be offline while some equipment is replaced.  The main router replacement should only take 5 mins or so.  The wireless switches will take 15-20 mins each.help+neteng@ncsa.illinois.edu

COMPLETE

2021-11-30 18002021-12-01 00:15NCSA File & Print ServersScheduled Windows Server MaintenanceFile & Print Shares were unavailable during maintenance.  Users were not able to access shares on Fileserver (e.g. home, busnoff, hr, etc.), and printing was unavailable.help+service@ncsa.illinois.edu

COMPLETE

2021-11-19 12:522021-11-19 13:22lsst-esx08server crashed

The following VMs rebooted:

Idap-Isst-ncsa3
Isst-condordev-cm01
Isst-condordev-sub01
Isst-git
Isst-influxdb-0
Isst-kubh02
Isst-kubh05
Isst-kubh08
Isst-login03
Isst-logintest01
Isst-ora-dbm01
Isst-pup-npcf
Isst-ss-cfg02
Isst-telegraf-0

lsst-admin@ncsa.illinois.edu

RECOVERED

2021-11-18 14002021-11-18 1750ICI Metrics & AlertsMigration to RHEL 8, ASD Puppet control, & CILogon authenticationThe viewing of ICI dashboards and the firing of ICI alerts was unavailable during this migrationmalone12@illinois.edu, bglick@illinois.edu

COMPLETE

2021-11-11 09252021-11-11 0940NCSA websiteCommunications launched the newly redesigned NCSA site.During launch, you may experience some down-time between while NCSA’s technical team re-points the URL to the new site.communications@lists.ncsa.illinois.edu

COMPLETE

2021-11-09 07002021-11-09 1545iForgeQuarterly MaintenanceAll systems unavailableiforge-admin@lists.ncsa.illinois.edu

COMPLETE

2021-11-03

0000

2021-11-04Netdot SSL CertificateThe SSL certificate for Netdot expired and network engineering replaced it with a new one. SSL certificate expired. Service remained available throughout the periodhelp+neteng@ncsa.illinois.eduCOMPLETE


2021-11-03

1100

2021-11-03

1400

ESnet 100G link migration. ESnet engineers will be migrating NCSA's 100G link to the new ESnet6 infrastructure. The link will be down during the migration.  Traffic will fall back to alternative paths. help+neteng@ncsa.illinois.edu

COMPLETE

2021-11-03

1100

2021-11-03

1120

NCSA GitLabGitLab was updated to latest version.All GitLab services were be unavailablehelp+service@ncsa.illinois.edu

COMPLETE

2021-11-03 10002021-11-03 1020Core Router Linecard ReplacementNeteng replaced a linecard in one of the core routersAll connections to this linecard are redundant and no outage has been reported.neteng@ncsa.illinois.edu

COMPLETE

2021-11-02 15:202021-11-02 16:37Production version of DCIM for CMDB (https://ncsa-cmdb.ncsa.illinois.edu)Invalid certificate issue(Fixed) 
The production version of CMDB will be unavailable until new certificate is received and applied. 

In the interim, the test server (https://ncsa-cmdb-test.ncsa.illinois.edu) has been made available for use, with all current data.
Kimber Blum (kimber7@illinois.edu)

COMPLETE

2021-11-02 08002021-11-02 0900cilogon.orgUpdate to OA4MP v5.2.3Address several small issues in the back-end servicehelp@cilogon.org

COMPLETE

0600

0710

JiraJira UpgradeJirahelp+service@illinois.edu

COMPLETE

2021-10-25 18002021-10-26 0018NCSA File & Print ServersScheduled Windows Server MaintenanceFile & Print Shares will be unavailable during maintenance.  Users will not be able to access shares on Fileserver (e.g. home, busnoff, hr, etc.), and printing will be unavailable.help+service@ncsa.illinois.edu

COMPLETE

2021-10-20 08002021-10-20 1800ICCP

ICCP Quarterly Maintenance

  • VLAN Change for IPMI network
  • OS update
ICCP Cluster nodes onlyhelp@campuscluster.illinois.edu

COMPLETE

2021-10-20 07002021-10-20 0715IDDSIDDS maintenance (puppet changes)All IDDS servicesidds-admin@ncsa.illinois.edu

COMPLETE

2021-10-15 12302021-10-15 0713NCSA GitLabServer ran out of disk spaceAll GitLab services were unavailablehelp+service@ncsa.illinois.edu

RESOLVED

2021-10-11 08002021-10-11 1900Nightingale, ACHEPlanned maintenance on the Nightingale cluster and the ache-dist switchThere was an outage for the following services during the maintenance:
  • ALL Nightingale hosts/services
  • ALL firewalled traffic in/out of ACHE, which includes admin access & monitoring in/out of ALL of ACHE (this portion was complete by 1140)
    • network access to ALL of the ache-esxi-hosted VMs, including ache- and ngale-bastion hosts
    • ACHE FW IPMI interfaces
help+service@ncsa.illinois.edu

COMPLETE

2021-10-04 10002021-10-04 1005www.ncsa.illinois.edu per-user web directoriesPer-user web directories on the main NCSA website are being redirected to a new website dedicated to per-user web directories.URLs like www.ncsa.ncsa.illinois.edu/People/* are redirected to their new home at https://users.ncsa.illinois.edu/*.help+service@ncsa.illinois.edu

COMPLETE

2021-09-30
0800
2021-09-30
1200
LSST

LSST Quarterly Maintenance

  • OS updates
  • K8S updates
All LSST services hosted at NCSAlsst-admin@ncsa.illinois.edu

COMPLETE

2021-09-29 08002021-09-29 0900cilogon.orgUpdate to OA4MP v5.2.2Update Java database libraries, and address several small issueshelp@cilogon.org

COMPLETE

2021-09-29 08002021-09-29 0813CMDB / openDCIMInstalling/upgrading to CMDB release Sep2021The openDCIM front end of CMDB will be down for 15-30 minutes

COMPLETE

2021-09-28 07002021-09-28 1554NPCF work on facility powerDeenergizing power to transformer TX-4C-1020, pulling and terminating busduct cabling from transformer to room 2020. One third of Sonexion racks will lose source 1 power (Feed C) and will continue to operate on source2 degrading reliability by losing power redundancy.

COMPLETE

2021-09-28 07002021-09-28 0900Blue WatersA rack of scratch lost power during the power outage.Scratch was partially unavailable due to TOR power resiliency issue.

COMPLETE

2021-09-28 08002021-09-28 0900idp.ncsa.illinois.eduAssert eduPersonAssurance Cappuccino profile for NCSA StaffNCSA Staff logging in with the NCSA Identity Provider will be able to get Silver CA certificates from cilogon.orghelp+idp@ncsa.illinois.edu

COMPLETE

2021-09-21-14:502021-09-21-15:02vcenter appliance controlling ASD vspherevcenter appliance was upgradedvsphere.ncsa.illinois.edu was off-line for 12 minutes.help+service@ncsa.illinois.edu

COMPLETE

2021-09-21 07002021-09-20 1115Blue WatersPower Work caused non redundant switches and misconfigured servers to shutoffBlue Waters Compute, Login and Schedulerbw-admin@ncsa.illinois.edu

COMPLETE

2021-09-20 1800

2021-09-20 2130

NCSA File & Print ServersScheduled Windows Server MaintenanceFile & Print Shares were unavailable during maintenance.  Users were not be able to access shares on Fileserver (e.g. home, busnoff, hr, etc.), and printing was unavailable.help+service@ncsa.illinois.edu

COMPLETE

2021-09-14 00002021-09-14 0600Internet2 WAN circuitInternet2 will be migrating our WAN circuit to new hardware. Traffic over that path will reroute while the change happens.  We anticipate the migration to take less than 30 mins.help+neteng@ncsa.illinois.edu

COMPLETE

 0600

 0900

WikiUpgrade to next versionWiki will be unavailable

help+service@ncsa.illinois.edu

COMPLETE

2021-09-09 06002021-09-09 0700NCSA VPNSoftware UpgradesThe appliances hosting the NCSA VPN will be patched. Users will experience a brief disconnect as load is failed over between the appliances.help+neteng@ncsa.illinois.edu

COMPLETE

2021-09-08 13002021-09-08 1400Group prod_b Bastion hostsOut of cycle patchingBastion hosts in group prod_b will be patched and rebooted. (see MOTD for group assignment)help+security@ncsa.illinois.edu

COMPLETE

2021-09-08 09002021-09-08 1000Group prod_a Bastion hostsOut of cycle patchingBastion hosts in group prod_a will be patched and rebooted. (see MOTD for group assignment)help+security@ncsa.illinois.edu

COMPLETE

2021-09-02 9:30 AM2021-09-02 1PMPDU in rack AA81We are replacing a PDU in NPCF rack AA81All systems in the rack have redundant power connections.  No service outages are expected from this workhelp+service@ncsa.illinois.edu

COMPLETE

2021-09-01 07002021-09-01 0800cilogon.orgUpdate to OA4MP v5.2.1Device Authorization Grant Flow transactions will be stored in database rather than in memoryhelp@cilogon.org

COMPLETE

 1200

 1205

WikiSecurity patch is being appliedWiki will be downhelp+service@ncsa.illinois.edu

COMPLETE

2021-08-25 9:00am2021-08-25 6:45pmBlue Waters System reboot due to blade fallout coinciding with HSN reroute and SMW not recovering.All jobs interruptedjenos@illinois.edu

COMPLETE

2021-08-19 05382021-08-19 0700IRST systems hosted on IRST Node 2Storage controller failure, all VMs taken offlinesome prod_b systems, and non-redundant services.eyrich@illinois.edu

RESOLVED

2021-08-19 5:342021-08-19 6:20cilogon.orgStorage controller failure in IRST VM farmcilogon.org was unreachable until we initiated fail-over to our backup servers at NICS.help@cilogon.org

COMPLETE

2021-08-18 11362021-08-18 1156NCSA WikiTest instance caused interference.NCSA Wikihelp+service@ncsa.illinois.edu

COMPLETE

2021-08-17 05002021-08-17 0700NCSA/NPCF Wide Area NetworkBetween 5:00AM and 7:00 AM CDT on 08/17/2021, Campus ICCN Engineers will be upgrading firmware on the ICCN router 710rtr at the Starlight facility in Chicago.Our peerings with MREN and OmniPoP will go down. All traffic destined for those peerings will reroute via other peerings, so no production impact is expected.help+neteng@ncsa.illinois.edu

COMPLETE

2021-08-16 18002021-08-17 0000NCSA File & Print ServersScheduled Windows Server MaintenanceFile & Print Shares will be unavailable during maintenance.  Users will not able to access shares on Fileserver (e.g. home, busnoff, hr, etc.), and printing will be unavailable.help+service@ncsa.illinois.edu 

COMPLETE

2021-08-12 9:542021-08-12 1012JiraAttempted snapshot of Jira in vSphere was too intensive for the systemJirahelp+service@illinois.edu

COMPLETE

2021-08-10
2000
2021-08-011
0000
Radiant API and Web access

Radiant cluster name change.During this time access to the API endpoints and the Horizon web dashboard will be intermittently unavailable.  Instances will continue to run and be available over the network with no interruptions.

radiant-admin@ncsa.illinois.edu

COMPLETE

2021-08-10 07:002021-08-10 17:10iForgeQuarterly MaintenanceAll systems unavailableiforge-admin@lists.ncsa.illinois.edu

COMPLETE

2021-08-09 14212021-08-09 1440NCSA WikiDB conflict configuration with Wiki & Wiki-TestNCSA Wiki was unaccessiblehelp+service@ncsa.illinois.edu

COMPLETE

2021-08-05 10002021-08-05 1030NPCF Core Router - Linecard RebootA problem was identified on one of the line cards in our core router requiring a reboot of the linecard. The linecard was successfully rebooted and we will continue monitoring the hardware for further issues.All connections to this linecard are redundant and there was no impact to users.neteng@ncsa.illinois.edu

COMPLETE

2021-08-05
0800
2021-08-05
1000
LSST

LSST Emergency OS Patching

LSST services hosted at NCSA except:

  • NTS will remain up (has already been patched)
lsst-admin@ncsa.illinois.edu

COMPLETE

2021-08-04
0800
2021-08-04
1700
Radiant API and Web access

Installation of new Radiant cluster

Cluster name changes are starting at 1100; This will make the horizon dashboard unreachable.
During this time access to the API endpoints and the Horizon web dashboard will be intermittently unavailable.  Instances will continue to run and be available over the network with no interruptions.

radiant-admin@ncsa.illinois.edu

COMPLETED

2021-08-04 07002021-08-04 0800cilogon.orgUpdate to OA4MP v5.2.0Added support for Device Authorization Grant Flow (RFC 8628)help@cilogon.org

COMPLETED

2021-08-03
0800
2021-08-03
1700
Radiant API and Web access

Installation of new Radiant cluster


During this time access to the API endpoints and the Horizon web dashboard will be intermittently unavailable.  Instances will continue to run and be available over the network with no interruptions.

radiant-admin@ncsa.illinois.edu

COMPLETED

2021-08-03 9:00 am2021-08-03 11:30 amRadiant ClusterA change was made to the firewall that unintentionally restricted access for instances and other internal cluster communication.Access to instances and workloadradiant-admin@ncsa.illinois.edu

RESOLVED

2021-07-31 06002021-07-31 0630CILogon hosted servicesInfrastructure maintenanceDuring this time each service hosted by CILogon including COmanage Registry, LDAP, Grouper, SAML proxy, and MDQ will become unavailable for a short time. Each individual service outage will last less than 5 minutes. Services that will not be impacted include: * OIDC clients that do not query LDAP for resolving attributes * X.509 certificate issuance and certificate revocation lists * LIGO and GW-Astronomy serviceshelp@cilogon.org

COMPLETE

2021-07-29 13002021-07-29 1400IRST-run bastion hosts (pool B)Security patchingHosts managed by IRST will be patched and rebooted. Only hosts in pool B will be patched at this timehelp+security@ncsa.illinois.edu

COMPLETE

2021-07-29 09002021-07-29 1000IRST-run bastion hosts (pool A)Security patchingHosts managed by IRST will be patched and rebooted. Only hosts in pool A will be patched at this timehelp+security@ncsa.illinois.edu

COMPLETE

2021-07-28 10002021-07-28 1050LSSTOS Updates on only NCSA Test Stand (NTS)Only the LSST NCSA Test Stand (NTS) services hosted at NCSAlsst-admin@ncsa.illinois.edu

COMPLETE

2021-07-27 06002021-07-27 0900JiraUpgradeJira will be unavailable

help+serverice@ncsa.illinois.edu

COMPLETE

2021-07-26 18002021-07-27 0000NCSA File & Print ServersScheduled Windows Server MaintenanceFile & Print Shares were unavailable during maintenance.  Users were not able to access shares on Fileserver (e.g. home, busnoff, hr, etc.), and printing was unavailable.help+service@ncsa.illinois.edu 

COMPLETE

2021-07-21
0800
2021-07-21
2900
ICCP

ICCP Quarterly Maintenance

  • TBD
All ICCP services

help@campuscluster.illinois.edu


COMPLETE

2021-07-21 15:242021-07-21 21:50ASD Vshpere cluster in 3003One of the 4 hypervisors in the cluster paniced.  Unscheduled preventative maintenance is being preformed on it and the other 3 nodes in the cluster.after the initial outage at 15:24, there should be no additional outages.help+service@ncsa.illinois.edu

COMPLETE

2021-07-13 07002021-07-13 0800cilogon.orgUpdate to OA4MP v5.1.4.The OAuth2/OIDC backend of the CILogon Service will be updated to OA4MP v5.1.4.help@cilogon.org

COMPLETE

2021-07-08 08002121-07-08 1000OpenAFSThe remaining OpenAFS database servers were upgraded.No service impacts were seenhelp+service@ncsa.illinois.efu

COMPLETE

2021-07-07 06002021-07-07 0800CILogon AWS Hosted ServicesUpgrading AWS RDS Aurora MySQL v5.6 to v5.7COmanage Registry and Grouper services hosted by CILogon will be unavailablehelp@cilogon.org

COMPLETE

2021-07-01

2140

2021-07-01

1430

Horizon dashboard access was down for the entire period. Cluster networking was down from 1200 to1430.Investigations into Horizon  dashboard accessibility issues resulted in the application of an incorrect default network gateway for the cluster around noon. This was corrected and networking functionality restored around 1400. Instances began recovering soon thereafter.Radiant admins believe running instances have recovered on their own but we advise everyone to check their systems and report any issues they see to the help desk.
help@ncsa.illinois.edu

RESOLVED

2021-07-01

0247

2021-07-01

1300

Various systems in NPCF, ACB, NCSA

There was a power event in the Champaign-Urbana area at around 2:47AM today. Details about the cause are currently unknown.  This event caused disruptions to systems at the NCSA building, NPCF and ACB. Known issues have generally been resolved but there may be unidentified issues lingering. If you encounter any problems, please notify NCSA help desk staff (help@ncsa.illinois.edu).

Multiple systems/services were impacted. All have been recovered and return to normal operations is complete.NCSA help desk

RESOLVED

2021-06-29 22:00

2021-06-29 23:59

NCSA 4th Floor Office networkRebooting one or more of the office switches on the NCSA Building 4th floor to resolve a phone issue.Office port connectivity will be intermittent during the maintenance window.

Matt Kollross

help+neteng@ncsa.illinois.edu

RESOLVED

2021-06-24
0800
2021-06-24
1345
LSST
  • Updates are being applied on Prod/Stable k8s, rebuild of some ingress nodes
Prod/Stable K8Slsst-admin@ncsa.illinois.edu

RESOLVED

2021-06-24
0800
2021-06-24
1200
LSST

LSST Quarterly Maintenance

  • OS updates on all servers

All LSST services hosted at NCSA

EXCEPT Prod/Stable K8S

lsst-admin@ncsa.illinois.edu

COMPLETE

2021-06-22 0000

2021-06-22 0400

Internet2 WAN linkInternet2 will be migrating NCSA's physical port to their new next generation infrastructure.During the maintenance, our I2 connection will be down.  Traffic will reroute to other connections.  Some point to point connections maybe unavailable for period of time.  The maintenance window is not expected to take all 4 hours.

Matt Kollross

help+neteng@ncsa.illinois.edu

COMPLETE

2021-06-21 18002021-06-22 0000NCSA File & Print ServersScheduled Windows Server MaintenanceFile & Print Shares were unavailable during maintenance.  Users were not able to access shares on Fileserver (e.g. home, busnoff, hr, etc.), and printing was unavailable.help+service@ncsa.illinois.edu 

COMPLETE

2021-06-17-07002021-06-17-0820OpenAFSThe OpenAFS database server kaskaskia was upgradedNo service outages were observed or reported.help+service@ncsa.illinois.exdu

COMPLETE

2021-06-12 22002021-06-15 1500LSST FirewallThe NPCF secondary firewall was offline due to a hard drive failure.No impact occurred to production services as the primary firewall stayed online.

RESOLVED

2021-06-14 17002021-06-15 0958NCSA GitLabAttempt to fix an authentication bug for a particular user accidentally broke all authentication through the web interface,Authentication through the web interface did not work.help+service@ncsa.illinois.edu

RESOLVED

2021-06-112021-06-11 0905NCSA JiraJira email problemJira is not accepting issues via email, you can still create issue directly via Jira GUI

RESOLVED

2021-06-10 07002021-06-10 0800cilogon.orgUpdate to OA4MP v5.1.3.The OAuth2/OIDC backend of the CILogon Service will be updated to OA4MP v5.1.3.help@cilogon.org

COMPLETE

 1000

 1030

Jira.ncsa.illinois.eduConfiguration change to address a vulnerabilityThere should not be any service interruption, but as with all things, it is possiblehelp+service@ncsa.illinois.edu

RESOLVED

2021-06-022021-06-02NetdotNetdot web access now requires 2FA via SSL VPN, or Cerberus proxy. Security requested that Netdot require 2FA, in order to access the web interface.  To accommodate that request, the Netdot firewall has limited web access to the VPN subnet or via proxy from the Cerberus jump hosts. 

Matt Kollross

help+neteng@ncsa.illinois.edu

RESOLVED

2021-05-252021-05-26vcenters for ache and ASDemergency security updates were applied.the administrative interface was off-line for about 20 minutes as the updates were installed.help+service@ncsa.illinois.edu

RESOLVED

2021-05-26

1000

2021-05-26

1030

VoIP phones at NPCFMigrating the VoIP networks to a campus IP to enable future migrations by tech services.After the networks are migrated, a reboot all phones at the NPCF building will be performed.

Matt Kollross

neteng+help@ncsa.illinois.edu

RESOLVED

2021-05-21

1800

2021-05-21

1900

VoIP phones at the NCSA buildingMigrating the VoIP networks to a campus IP to enable future migrations by tech services.After the networks are migrated, a reboot all phones at the NCSA building will be performed.

Matt Kollross

neteng+help@ncsa.illinois.edu

RESOLVED

2021-05-20 05:402021-05-20 08:45LSST

ESXi host outage causing degradation of select services.


Degradation of select services:

  • data backbone gateway (lsst-dbb-gw01 down)
  • HTCondor (Central Manager nodes down for Prod & DAC)
  • login (lsst-login01 is down)

Also loss of redundancy for some underlying services, including auth/access & k8s head nodes.

lsst-admin@ncsa.illinois.eduRESOLVED


2021-05-15
0600
2021-05-15
0800
CILogon hosted services including COmanage Registry, LDAP, SAML proxy, SAML AA, MDQMaintenanceAll CILogon hosted services were temporarily unavailable.help@cilogon.org

COMPLETE

2021-05-12 07:00

2021-05-12 08:00

internal.ncsa.illinois.edu

NCSA Internal Web Server Upgrade
(aka Savannah or MIS Tools)
Updates were made that will affect the availability of the NCSA internal website and Savannah system. The system was be unavailable during this time.

help+service@ncsa.illinois.edu

COMPLETE

2021-05-11

07:00

2021-05-11

19:00

iForgeQuarterly MaintenanceAll systems unavailable

iforge-admin@lists.ncsa.illinois.edu

COMPLETE

2021-05-06 09002021-05-06 0945WAN Link MigrationNCSA Neteng migrated the WAN link to Internet 2 to new hardware.

Traffic was automatically re-routed to redundant paths during the link outage. Any connections relying on layer-2 connections over AL2S saw a brief blip as the connection is cut over. Affected parties were contacted in advance.

help+neteng@ncsa.illinois.edu

COMPLETE

2021-05-03
0600
2021-05-03
0630
CILogon Multi-tenant COmanage RegistryUpgrade to version 3.3.2The service at https://registry.cilogon.org  was unavailablehelp@cilogon.org

COMPLETE

2021-04-29 16002021-04-29 1700
  • HTCondor Prod
  • HTcondor DAC
Add new nodes into Condor service pools
  • HTCondor Prod
  • HTcondor DAC
lsst-admin@ncsa.illinois.edu

COMPLETE

2021-04-21 08:002021-04-21 20:00ICCPICCP Quarterly MaintenanceThe scheduler will be down.  All compute nodes will be converted to rhel7.9 with RedHat IB.

iccp-admins@campuscluster.illinois.edu

COMPLETE

2021-04-15 16002021-04-15 1700NCSA OpensourceUpgrade of OS on all machines related to opensourcejira, wiki, git etc hosted at https://opensource.ncsa.illinois.edu/kooper@illinois.edu

COMPLETE

2021-04-15

12:25

2021-04-15

14:45

ICI vmware

Several hosts on the vmware service were experiencing timeouts

  • bluewaters
  • bluewaters-test
  • internal
  • its-nagios
  • ldap1
  • vcenter
no or intermittent connectivity to these hostshelp+service@ncsa.illinois.edu

RESOLVED

Root cause is still being investigated.

2021-04-15
0900
2021-04-15
0942
CMDBApplying new certificates and restarting servicesCMDB, including web interface, will be down briefly during the update.ncsagroup+org_itsm@ncsa.illinois.edu

RESOLVED

2021-04-15 09002021-04-15 0920WAN Link MigrationNCSA Neteng will migrated the WAN link to ESnet to new hardware.Traffic was automatically re-routed to redundant paths during the link outage.help+neteng@ncsa.illinois.edu

RESOLVED

2021-04-14 15:002021-04-14 15:00git.ncsa.illinois.eduUsers can no longer access repositories from git clients over HTTPS using their NCSA password.

NCSA passwords can not access repositories with Git clients. Instead use ssh keys over SSH or personal access tokens over HTTPS.

We thought this went into effect during git changes on Nov 2, 2020 but discovered it was still working until we made changes to GitLab to fully remove LDAP functionality.

help+service@ncsa.illinois.edu

COMPLETE

2021-04-13 14152021-04-13 1845git.ncsa.illinois.eduThe GitLab website at git.ncsa.illinois.edu was having issues with authentication. The LDAP server that it uses was timing out.
  • Login to the Git web interface was timing out.
  • Access from git clients continued to work during the outage.
help+service@ncsa.illinois.edu

RESOLVED

2021-04-13 0800

2021-04-13 0830

cilogon.orgUpdate to OA4MP v5.1.1.The OAuth2/OIDC backend of the CILogon Service will be updated to OA4MP v5.1.1.help@cilogon.org

COMPLETE

2021-04-12 18002021-04-12 2245File & Print ServersMonthly Windows File & Print Server MaintenanceWindows File Shares such as HR, Business Office, Home, etc. and printing in the NCSA & NPCF buildings were unavailable.help+service@ncsa.illinois.edu 

COMPLETE

2021-04-10
0600
2021-04-10
0800
CILogon hosted COmanage, Grouper, SATOSA, LDAPOn Saturday, April 10, the CILogon team will perform maintenance on the infrastructure used for hosted services.As part of the maintenance all COmanage Registry, LDAP, Grouper, SAML proxy, SAML attribute authority, and MDQ services hosted by CILogon may experience brief outages. We do not expect that any specific service outage will last for more than a minute.help@cilogon.org

COMPLETE

2021-04-08 09002021-04-08 1045WAN Link MigrationNCSA Neteng migrated the WAN link to ICCN Node-1 to new hardware.Traffic was automatically re-routed to redundant paths during the link outage. Issues were noticed by users during the outage and are currently being investigated in cooperation with our upstream provider.help+neteng@ncsa.illinois.edu

COMPLETE

2021-04-08 07302021-04-08 0734NCSA WikiNCSA's Wiki service was restartedNCSA's Wiki service was restarted to apply a new SSL certificate and renewed Confluence license. The wiki was not available for 4 minutes while it reloaded.help+service@ncsa.illinois.edu 

COMPLETE

2021-04-07 1610

2021-04-07 1733Internal Savannah/MIS websiteThe Savannah/MIS website would not load due to a corrupted MySQL database table referenced across all of the Savannah tools.Internal/Savannahhelp+service@ncsa.illinois.edu

RESOLVED

1st report 7:30am Monday8:19am MondayNCSA LDAP2ldap2 is not responsive to authentication requestsNCSA Jira, any systems using LDAP2 as its only source.help+service@ncsa.illinois.edu

RESOLVED

2021-03-30

0800

2021-03-30

0845

DNS1A software issue was causing BIND to fail. DNS was not able to resolve during the period of time.  DNS2 remained operational. neteng+help@ncsa.illinois.edu

RESOLVED

2021-03-23

2000

2021-03-23

2025

NCSA VPNThe standby VPN hardware was replaced and transitioned into the current VPN cluster. Failover went as expected and firmware was upgraded on the primary after load was shifted to the new standby VPN.Failover between the appliances occurred without issue and there was no impact to users.neteng@ncsa.illinois.edu

RESOLVED

2021-03-18 12301255JiraSome functionality will be limited due to user limit being reachedJirahelp@service@ncsa.illinois.edu

RESOLVED

~16:4017:58AnyConnect VPN Service

An issue with SSL on the VPN service has caused an issue that has disconnected all users. Network engineering is looking into the issue.


Due to a hardware failure and the VPN not failing over properly to the standby users were unable to connect to the VPN. This was due to an issue with syncing certificates.

During the outage, expect that you won't be able to connect/maintain a connection to the VPNhelp+neteng@ncsa.illinois.edu

RESOLVED

2021-03-16 09502021-03-16 1000CMDBWill be applying updates per security vettingCMDB, including web interface, will be down briefly during the update.ncsagroup+org_itsm@ncsa.illinois.edu

RESOLVED

2021-03-11
0900

2021-03-11
0930

WAN Link MigrationNCSA Neteng migrated the link to ICCN to new hardware.Traffic was automatically re-routed to redundant paths during the link outage.help+neteng@ncsa.illinois.edu

RESOLVED

2021-03-04
0900

2021-03-04
0905

WAN Link MigrationNCSA Neteng migrated the 100G link to MREN to new hardware.Traffic was automatically re-routed to redundant paths during the link outage.help+neteng@ncsa.illinois.edu

RESOLVED

2021-03-01 22:112021-03-01 22:47NCSA vSphereAbout 40 VMs lost connection to their NFS storage.Several VM-based services were timing out during the issue, including: vSphere management, a kerberos replica, a ldap replica, httpproxy, license servers, NCSA fileserver, Identity message queuing, monitoring. That triggered some of those VMs to switch to use read-only disk, needing to be rebooted later.service@ncsa.illinois.edu

RESOLVED

2021-04-29 16002021-04-29 1700
  • HTCondor Prod
  • HTcondor DAC
Add new nodes into Condor service pools
  • HTCondor Prod
  • HTcondor DAC
lsst-admin@ncsa.illinois.edu

COMPLETE

2021-04-21 08:002021-04-21 20:00ICCPICCP Quarterly MaintenanceThe scheduler will be down.  All compute nodes will be converted to rhel7.9 with RedHat IB.

iccp-admins@campuscluster.illinois.edu

COMPLETE

2021-04-15 16002021-04-15 1700NCSA OpensourceUpgrade of OS on all machines related to opensourcejira, wiki, git etc hosted at https://opensource.ncsa.illinois.edu/kooper@illinois.edu

COMPLETE

2021-04-15

12:25

2021-04-15

14:45

ICI vmware

Several hosts on the vmware service were experiencing timeouts

  • bluewaters
  • bluewaters-test
  • internal
  • its-nagios
  • ldap1
  • vcenter
no or intermittent connectivity to these hostshelp+service@ncsa.illinois.edu

RESOLVED

Root cause is still being investigated.

2021-04-15
0900
2021-04-15
0942
CMDBApplying new certificates and restarting servicesCMDB, including web interface, will be down briefly during the update.ncsagroup+org_itsm@ncsa.illinois.edu

RESOLVED

2021-04-15 09002021-04-15 0920WAN Link MigrationNCSA Neteng will migrated the WAN link to ESnet to new hardware.Traffic was automatically re-routed to redundant paths during the link outage.help+neteng@ncsa.illinois.edu

RESOLVED

2021-04-14 15:002021-04-14 15:00git.ncsa.illinois.eduUsers can no longer access repositories from git clients over HTTPS using their NCSA password.

NCSA passwords can not access repositories with Git clients. Instead use ssh keys over SSH or personal access tokens over HTTPS.

We thought this went into effect during git changes on Nov 2, 2020 but discovered it was still working until we made changes to GitLab to fully remove LDAP functionality.

help+service@ncsa.illinois.edu

COMPLETE

2021-04-13 14152021-04-13 1845git.ncsa.illinois.eduThe GitLab website at git.ncsa.illinois.edu was having issues with authentication. The LDAP server that it uses was timing out.
  • Login to the Git web interface was timing out.
  • Access from git clients continued to work during the outage.
help+service@ncsa.illinois.edu

RESOLVED

2021-04-13 0800

2021-04-13 0830

cilogon.orgUpdate to OA4MP v5.1.1.The OAuth2/OIDC backend of the CILogon Service will be updated to OA4MP v5.1.1.help@cilogon.org

COMPLETE

2021-04-12 18002021-04-12 2245File & Print ServersMonthly Windows File & Print Server MaintenanceWindows File Shares such as HR, Business Office, Home, etc. and printing in the NCSA & NPCF buildings were unavailable.help+service@ncsa.illinois.edu 

COMPLETE

2021-04-10
0600
2021-04-10
0800
CILogon hosted COmanage, Grouper, SATOSA, LDAPOn Saturday, April 10, the CILogon team will perform maintenance on the infrastructure used for hosted services.As part of the maintenance all COmanage Registry, LDAP, Grouper, SAML proxy, SAML attribute authority, and MDQ services hosted by CILogon may experience brief outages. We do not expect that any specific service outage will last for more than a minute.help@cilogon.org

COMPLETE

2021-04-08 09002021-04-08 1045WAN Link MigrationNCSA Neteng migrated the WAN link to ICCN Node-1 to new hardware.Traffic was automatically re-routed to redundant paths during the link outage. Issues were noticed by users during the outage and are currently being investigated in cooperation with our upstream provider.help+neteng@ncsa.illinois.edu

COMPLETE

2021-04-08 07302021-04-08 0734NCSA WikiNCSA's Wiki service was restartedNCSA's Wiki service was restarted to apply a new SSL certificate and renewed Confluence license. The wiki was not available for 4 minutes while it reloaded.help+service@ncsa.illinois.edu 

COMPLETE

2021-04-07 1610

2021-04-07 1733Internal Savannah/MIS websiteThe Savannah/MIS website would not load due to a corrupted MySQL database table referenced across all of the Savannah tools.Internal/Savannahhelp+service@ncsa.illinois.edu

RESOLVED

1st report 7:30am Monday8:19am MondayNCSA LDAP2ldap2 is not responsive to authentication requestsNCSA Jira, any systems using LDAP2 as its only source.help+service@ncsa.illinois.edu

RESOLVED

2021-03-30

0800

2021-03-30

0845

DNS1A software issue was causing BIND to fail. DNS was not able to resolve during the period of time.  DNS2 remained operational. neteng+help@ncsa.illinois.edu

RESOLVED

2021-03-23

2000

2021-03-23

2025

NCSA VPNThe standby VPN hardware was replaced and transitioned into the current VPN cluster. Failover went as expected and firmware was upgraded on the primary after load was shifted to the new standby VPN.Failover between the appliances occurred without issue and there was no impact to users.neteng@ncsa.illinois.edu

RESOLVED

2021-03-18 12301255JiraSome functionality will be limited due to user limit being reachedJirahelp@service@ncsa.illinois.edu

RESOLVED

~16:4017:58AnyConnect VPN Service

An issue with SSL on the VPN service has caused an issue that has disconnected all users. Network engineering is looking into the issue.


Due to a hardware failure and the VPN not failing over properly to the standby users were unable to connect to the VPN. This was due to an issue with syncing certificates.

During the outage, expect that you won't be able to connect/maintain a connection to the VPNhelp+neteng@ncsa.illinois.edu

RESOLVED

2021-03-16 09502021-03-16 1000CMDBWill be applying updates per security vettingCMDB, including web interface, will be down briefly during the update.ncsagroup+org_itsm@ncsa.illinois.edu

RESOLVED

2021-03-11
0900

2021-03-11
0930

WAN Link MigrationNCSA Neteng migrated the link to ICCN to new hardware.Traffic was automatically re-routed to redundant paths during the link outage.help+neteng@ncsa.illinois.edu

RESOLVED

2021-03-04
0900

2021-03-04
0905

WAN Link MigrationNCSA Neteng migrated the 100G link to MREN to new hardware.Traffic was automatically re-routed to redundant paths during the link outage.help+neteng@ncsa.illinois.edu

RESOLVED

2021-03-01 22:112021-03-01 22:47NCSA vSphereAbout 40 VMs lost connection to their NFS storage.Several VM-based services were timing out during the issue, including: vSphere management, a kerberos replica, a ldap replica, httpproxy, license servers, NCSA fileserver, Identity message queuing, monitoring. That triggered some of those VMs to switch to use read-only disk, needing to be rebooted later.service@ncsa.illinois.edu

RESOLVED

2021-02-25
0800
2021-02-25
1200
LSST

LSST Quarterly Maintenance

  • GPFS appliance UPS battery replacements (requires GPFS downtime)
  • OS updates
  • Kubernetes update from 1.17 to 1.18
All LSST services hosted at NCSAlsst-admin@ncsa.illinois.edu

COMPLETE

2021-02-
0900

2021-02-25
0915

WAN Link MigrationNCSA Neteng migrated the 100G link to CARNE to new hardware.The link to campus through the CARNE router was migrated to new hardware. Traffic was automatically re-routed to redundant paths during each link outage.help+neteng@ncsa.illinois.edu

COMPLETE

2021-2-18 2:30 pm2021-2-18 6pmvsphere.ncsa.illinois.edulogins were broken due to a cert caching issue on vsphere.login to the administrative interface is availableservice@ncsa.illinois.edu

COMPLETE

2021-02-18 2pm2021-02-18 2:30pmldap The certs on several ldap servers were set to expire tomorrow and next week, they were refreshed.ldap server certs were refreshed prior to their expirations.help+service@ncsa.illinois.edu

COMPLETE

2021-02-17. 10:00 a.m.2021-02-17. 10:30 a.m.Netdot maintpatchingNetdot may be unavailable during this time.help+neteng@ncsa.illinois.edu

COMPLETE

2021-02-16 08002021-02-16 1000cilogon.orgUpdate to OA4MP v5.1 had problems.Several clients reported issues with OA4MP 5.1, so we reverted to OA4MP 4.4.5 at noon.help@cilogon.org

CANCELED

2021-02-15 18002021-02-17 00:40 File & Print ServersMonthly Windows File & Print Server MaintenanceWindows File Shares such as HR, Business Office, Home, etc. and printing in the NCSA & NPCF buildings will be unavailable.help+service@ncsa.illinois.edu 

COMPLETE

2021-02-11 00:002021-02-11 04:00ICCP - MWT2

“OmniPoP is doing maintenance for their hardware refresh on February 11 (600 W Chicago) between midnight and 4 a.m.CST.  This will mean that the CARNE 100G OmniPoP connection will go down for a time during the Feb 11 window. Most of the traffic that would take this link will reroute to using other links. The only ICCP user that may be impacted is MWT2 because their primary path to UChicago is over this circuit.  However, we do have a backup UChicago peering over the CARNE Internet2 100G circuit, so that path will be taken assuming that UChicago's backup path doesn't go through the 6WC OmniPoP switch.  The tertiary path to UChicago would be through the ESnet LHCONE peering which goes over the CARNE I2 AL2S 100G. MWT2 shouldn’t need to do anything on their side to prepare for this work.”

All traffic should reroute during this maintenance, but MWT2 may experience brief connectivity issues to UChicagoneteng@ncsa.illinois.edu

COMPLETE

2021-02-09

07:00

2021-02-09

15:35

iforgeftp1public interface is downS3 connectioniforge-admin@lists.ncsa.illinois.edu

COMPLETE

2021-02-09

07:00

2021-02-09

17:00

iForge clusterQuarterly MaintenanceAll systems unavailableiforge-admin@lists.ncsa.illinois.edu

COMPLETE

2021-01-28
10:00am
2021-01-28
12:00pm
RadiantSecurity updates and API endpoint hardeningThe web interface to Radiant and the API interfaces will be unavailable during the maintenance period.radiant-admin@lists.ncsa.illinois.edu

COMPLETE

2021-01-27

8:00 am

2021-01-27

8:00 pm

Open Storage Network PODUpdating Ceph to containerized implementationSee Previous Columnbdickin2@illinois.edu

COMPLETE

2021-01-252021-01-26NCSA LDAPUsers with /bin/csh had their shells changed to /bin/bashUsers logging into systems that don't override the /bin/csh data already will find they are using /bin/bash when they login.help+service@ncsa.illinois.edu

COMPLETE

2021-01-26 13:452021-01-26 14:05administrative interface to vsphere.ncsa.illinosi.eduadministrative interface to vsphere.ncsa.illinosi.edu was upgraded to current patch levelAdministrative interfaces to vm's were unavailable for about 20 minutes.help+service@ncsa.illinois.edu

COMPLETE

2021-01-20 08:002021-01-20 20:00ICCP

ICCP Quarterly Maintenance

  • Replacing IB cards on 134 nodes (EDR to HDR)
  • Installing additional PDU in POD19 Rack5
  • Redistributing power from WallPanel C3A3
  • New image with GPFS 5.1.0.1
  • Clean up IB cables from POD19 Rack[1,2 & 3]
Cluster-wide outagehelp@campuscluster.illinois.edu

COMPLETE

2021-01-12 07:002021-01-12 8:00JIRA

JIRA Upgrade to 8.13.2

All JIRA usershelp+its@ncsa.illinois.edu

COMPLETE

2021-01-04 09:142021-01-04 12:10SlackSlack service issuesAll Slack systemshttps://status.slack.com/OUTAGE
2020-12-21 18:002020-12-22 0700File & Print ServersMonthly Windows File & Print Server MaintenanceWindows File Shares such as HR, Business Office, Home, etc. and printing in the NCSA & NPCF buildings were unavailable.help+its@ncsa.illinois.edu 

COMPLETE

2020-12-182020-12-18NCSA LDAPNew user accounts will have their shell in ldap set to /bin/bashNew users will have /bin/bash as their default shellhelp+service@ncsa.illinois.edu

COMPLETE

2020-12-10
0800
2020-12-10
1400
LSST

LSST Quarterly Maintenance

  • Firmware/OS updates
  • Kubernetes/Docker updates
  • GPFS SSD firmware updates
All LSST services hosted at NCSAlsst-admin@ncsa.illinois.edu

COMPLETE

2020-12-09 08002020-12-10 1000HALHAL Quarterly MaintenanceHAL clusterhelp+isl@ncsa.illinois.edu

COMPLETED

2020-11-24 10:302020-11-24 17:35Blue Waters computeBoot RAID lost fiber channel connection for reasons not understoodFull system outagejenos@illinois.edu

COMPLETED

2020-11-19

11:00 a.m.

2020-11-19

11:30 a.m.

DNS1 MigrationNeteng will be migrating DNS1 to a new switch.  We need to physical move the cable to DNS1 which will cause momentary outage for dns queries to DNS1.  DNS2 will not be affected by the migration.  help+neteng@ncsa.illinois.edu

COMPLETED

2020-11-18

09:00

2020-11-17

17:00

UPS battery Monitor Program and configure the new BMS All UPS connected loadsrantissi@illinois.edu

COMPLETED

2020-11-16 18:002020-11-16 21:30NCSA File & Print ServersFile & Print servers were offline for scheduled maintenance.  Windows File Shares and printing were unavailable.Windows File Shares such as HR, Business Office, Home, etc. and printing in the NCSA & NPCF buildings were unavailable.help+its@ncsa.illinois.edu 

COMPLETED

2020-10-14 13:452020-10-14 14:04NCSA Wiki & JiraNCSA's Wiki & Jira servers were restarted.Wiki & Jira were offline while their servers reboot.help+its@ncsa.illinois.edu

COMPLETED

2020-11-13 12:002020-11-13 18:00Software Directorate VM FarmFailing power supply of switch will be replaced, will use this to upgrade OS as well.NCSA OpenSource, INCORE, etc (all machines running on 141.142.277.X).kooper@illinois.edu

COMPLETED

2020-11-12 9:552020-11-12 10:15cilogon.orgUpdate to OA4MP v5.0.2 was unsuccessful.CILogon Service has been reverted to OA4MP 4.4.5.help@cilogon.org

DELAYED

2020-11-10
07:00
2020-11-10 21:05iForge cluster

Quarterly Maintenance

Switching to GPFS 5 formatted filesystem.

All iForge nodes.

iforge-admin@ncsa.illinois.edu

COMPLETE

2020-11-05 10:00

2020-11-05

12:24

NPCF enterprise UPSUPS maintenance, replace defective communication cardsAny rack (system) that is UPS power fed rantissi@illinoise.edu

COMPLETE

2020-11-04 19:002020-110-4 21:00NCSA Building Router (2 of 2)Software UpdatesSoftware Updates will be applied to one of the NCSA building routers.  Traffic will fall back to the seconds router.  No network traffic should be affected.help+neteng@ncsa.illinois.edu

COMPLETED


2020-11-02 08:002020-11-02
10:20
NCSA GitLab
git.ncsa.illinois.edu
LDAP authentication was disabled for NCSA GitLab. Users of the GitLab web interface are required to authenticate to NCSA through CILogon.

NCSA passwords can no longer access repositories. Use GitLab personal access tokens to authenticate against Git over HTTPS.

help+its@ncsa.illinois.edu

COMPLETED

2020-11-02 09:00

2020-11-02 09:05

NCSA DuoThe icon shown in the Duo app for NCSA will be updated to match the icon used in NCSA Slack.NCSA Duo App pushes will show updated icon to match NCSA Slack. May need to restart phone/app to see updated icon.help+duo@ncsa.illinois.edu

COMPLETED

2020-10-30
08:00
2020-10-30
08:45
SVN at subversion.ncsa.illinois.eduRetired SVN Service at subversion.ncsa.illinois.edu SVN is no longer be available. NCSA users are recommended to use one of our various Git repository options.help+its@ncsa.illinois.edu

COMPLETED

2020-10-29

9:00

2020-10-29

11:15

iForgemaintenance to switch to GPFS version 5.All nodes.

iforge-admin@ncsa.illinois.edu


COMPLETED

2020-10-28 19:002020-10-28 21:00NCSA Building Router (1 of 2)Software UpdatesSoftware Updates will be applied to one of the NCSA building routers.  Traffic will fall back to the seconds router.  No network traffic should be affected.help+neteng@ncsa.illinois.edu

COMPLETED

2020-10-27 22:00

2020-10-27

22:10

NCSA VPNFirmware Updates

Firmware updates were applied to the NCSA VPN. Any AnyConnect VPN sessions were reset during the maintenance and users may need to reconnect. Any IPSEC sessions failed over to the standby unit and were not affected.

help+neteng@ncsa.illinois.edu

COMPLETED

2020-10-27 09:002020-10-27 09:15

idp.ncsa.illinois.edu

crl.ncsa.illinois.edu

  1. Upgrade Shibboleth IdP software from v3.4.7 to v4.0.1.
  2. Move IdP software from VM to Docker container.
  3. Change DNS entry for idp.ncsa.illinois.edu and crl.ncsa.illinois.edu to point to new Docker server.
The DNS CNAME entries for idp.ncsa.illinios.edu and crl.ncsa.illinois.edu will be changed from cilogon-web.ncsa.illinois.edu to shib-docker.security.ncsa.illinois.edu (141.142.149.33). NCSA Shib IdP v4.0.1 is currently up and running at 141.142.149.33.help+idp@ncsa.illinois.edu

COMPLETED

2020-10-22 11am2020-10-22 1pmsecurity.ncsa.illinois.edu and grid.ncsa.illinois.eduCert replacement issueSites were downcpitcel

RESOLVED

2020 10-21 09002020 10-21 1400WAN Link MigrationICCN Engineers will be migrating NCSA's 100G WAN links over to new optical cards.

Below is the timetable for the moves:

10:00am CARNE (Node 1) to I2 (710 N Lakeshore Dr)
10:30am NCSA (Node 2) to MREN (710 N Lakeshore Dr)

11:30am CARNE (Node 2) to OmniPop (600 West Chicago)
12:00pm NCSA (Node 1) to I2 (600 West Chicago)
12:15pm NCSA (Node 1) to ESNet (600 West Chicago)

Individual links will be migrated one at a time, each taking roughly 15-20 minutes to complete, leaving redundant paths operational. Traffic will automatically be re-routed to these redundant paths during each link outage.

There are exceptions where certain services won't failover in this way. In these cases, individual notifications have been sent out to affected parties.
help+neteng@ncsa.illinois.edu

COMPLETED

2020-10-13 08:002020-10-13 09:00CILogonUpdate to OA4MP v5.0 OAuth2/OIDC Libraries encountered issue with Syngenta IdP. Reverted to OA4MP v4.4.5. Will be addressed in future OA4MP update.https://cilogon.orghelp@cilogon.org

DELAYED

2020 10-13 06:002020 10-13 06:15CILogon COmanage Registry at https://registry.cilogon.orgService stack restart.COmanage Registry and LDAP directory for the multi-tenant services.help@cilogon.org

COMPLETED

2020-10-12 18:002020-10-12 21:30NCSA File & Print ServersMonthly Maintenance for Updates / Backup ChecksFile & Print Servers were unavailable.  Printing was offline, and fileserver shares were unavailable.help+its@ncsa.illinois.edu

COMPLETED

2020-10-01 06:002020-10-01 06:15NCSA VPNThe certificate for sslvpn.ncsa.illinois.edu was updated. The SSL certificate has been updated.neteng@ncsa.illinois.edu

COMPLETED

2020-09-28 21:002020-09-29 16:45Nebula Network card failed in network node, was replaced and network settings reconfigured All non-virtual networking services for Nebula instances (north/south traffic)nebula@ncsa.illinois.edu

COMPLETED

2020-09-23
19:00

2020-09-24
12:30

LSST

Monthly Maintenance:

  • GPFS version upgrade from 4.x to 5.x
  • Routine system OS and firmware updates
ALL LSST systemslsst-admin@ncsa.illinois.edu

COMPLETED

2020-09-222020-09-22CILogon multi-tenant COmanage RegistryUpgrade to version 3.3.0COmanage Registry service at https://registry.cilogon.orghelp@cilogon.org

COMPLETED

2020-09-14
18:00
2020-09-15
21:30
NCSA File & Print ServersWindows file and print servers were patched and unavailable during maintenance.Access to Fileserver (Business Office, HR, Home, and Swap shared drives) was unavailable, printing was unavailable.help+its@ncsa.illinois.edu 

COMPLETED

2020-09-08 08:002020-09-08 09:00CILogonUpdate CILogon OIDC Client Admin APIhttps://cilogon.orghelp@cilogon.org

COMPLETED

2020-09-02 0800

2020-09-02

1600

Core-EastSoftware UpgradesNo user impact expectedhelp+neteng@ncsa.illinois.edu

COMPLETED

2020-09-012020-09-01ldaps://ldap.cilogon.orgRestart of LDAP gateway service containers.All LDAP services operated by CILogon.help@cilogon.org

COMPLETED


StartEndWhat System/Service was affected?What happened?What was affected?

Contact Person

Status







2020-08-28 1425

2020-08-28 1725NCSA Email from Campus Exchange

Campus Exchange could not deliver email addressed to "@ncsa.illinois.edu" addresses. A user requested a change to Exchange that broke delivery to NCSA.

Any email you sent to "@ncsa.illinois.edu" addresses from campus Exchange will need to be resent by you.

Campus Exchange could not deliver email addressed to "@ncsa.illinois.edu" addresses.

The change in Exchange was reverted around 1655, but it may take a bit of time for Office365 to update all of its servers.

help+its@ncsa.illinois.edu

RESOLVED

2020-08-28 08002020-08-28 1000NPCF Networking DC Power SystemTesting and maintenance of the DC power system and battery backup was performed.No outage.help+neteng@ncsa.illinois.edu

COMPLETED

2020-08-26 0800

2020-08-26 1200

Core-WestReplacing failed internal board.  Software upgradesNo user impact expected.  help+neteng@ncsa.illinois.edu

COMPLETED

2020-08-17 08002020-08-22 19:00Code42 Crashplan EndpointOne of the primary Code42 Crashplan Endpoints crashed and was not accepting backup requestsCode42 Crashplan clients that were assigned to backup to this server were unable to run backups.crashplan@ncsa.illinois.edu

RESOLVED

2020-08-18

06:00

2020-08-08

06:20

NCSA VPN ServiceSoftware upgrade was completed successfully.All IPSEC sessions failed over successfully to the standby. Any users connected to the AnyConnect VPN were briefly disconnected and need to reconnect. An upgrade of the AnyConnect client was included with this upgrade and users will receive the upgrade when they reconnect to the VPN.help+neteng@ncsa.illinois.edu

COMPLETED

2020-08-18
06:00
2020-08-18
06:15
registry.cilogon.orgCILogonPerson LDAP schema plugin was removed.The COmanage Registry service.help@cilogon.org

COMPLETED

2020-08-11 07:002020-08-11 18:30iForgeQuarterly MaintenanceAll systems will be unavailable during the maintenanceiforge-admin@ncsa.illinois.edu

COMPLETED

2020-08-04 08:002020-08-04 08:30CILogonhttps://go.ncsa.illinois.edu/CILogonServiceUpdate2020-08-04Remove <whitelisted> tag from idplist.xml file. Add "?initialidp=..." query parameter.help@cilogon.org

COMPLETED

2020-08-03 18:002020-08-03 23:00

Fileserver & Printing

Monthly Windows Server MaintenanceNCSA Fileserver(s) and NCSA-Print were unavailable during the maintenance.   Business Office, HR, home, swap, etc. shares were unavailable.  Printing was unavailable.help+its@ncsa.illinois.edu

COMPLETED

2020-07-28
06:00
2020-07-28
06:05
registry.cilogon.orgOA4MP plugin for creating/managing OIDC clients was updatedThe COmanage Registry service was unavailable during the outage.skoranda@illinois.edu

COMPLETED

2020-07-24 07002020-07-24 0707

NCSA Wiki

Applied some minor configuration adjustmentsPages were unavailable for 5 minutes as the service restarted.

help+its@ncsa.illinois.edu

COMPLETED

2020-07-22 08002020-07-22 0823git.ncsa.illinois.eduGitLab was upgraded to the latest versionThe git service was intermittently unavailable during this upgrade window.

help+its@ncsa.illinois.edu

COMPLETED

2020-07-18 05002020-07-18 0600NCSA WAN connectivity over Internet2 and ESnet links. Emergency network maintenance performed by ICCN (our WAN provider) Instability on WAN connectivity to Industry partners and research institutions using internet2 service provider network to connect to NCSA. General network connectivity will re-route accordingly. help+neteng@ncsa.illinois.edu

COMPLETED

2020-07-16 07002020-07-16 0800Core-WestA service alarm on Core-West switch requires a reset of an internal controller board.Traffic will be failed over to Core-East while performing the maint.  No impact expected.help+neteng@ncsa.illinois.edu

COMPLETED

2020-07-15 14002020-07-15 1730Storage CondoPerc Raid card failure in core serverCondo Services/GridFTP/NFSckerner@illinois.edu

COMPLETED

2020-07-07 18:002020-07-07 21:30

Fileserver & Printing

Monthly Windows Server MaintenanceNCSA Fileserver(s) and NCSA-Print were unavailable during the maintenance.   Business Office, HR, home, swap, etc. shares were unavailable.  Printing was unavailable.help+its@ncsa.illinois.edu

COMPLETED

2020-07-07 08002020-07-07 0900CILogonUpdate Admin Client APIhttps://cilogon.org . No downtime is expected.help@cilogon.org

COMPLETED

2020-06-24
1900

2020-06-25
1300

Kubernetes

updates completed 

at 1330

LSST

Monthly Maintenance:

  • OS updates and reboots
  • Other updates as needed
  • Firmware on GPFS appliance (fix network issues)
ALL LSST systemslsst-admin@ncsa.illinois.edu

COMPLETE

2020-06-17 06002020-06-17 1600Blue Waters scratch filesystemDisk failure during OST failover, both OSTs unavailable
14 drives offline, reassembly required
Some filesystem operations.

COMPLETED

2020-06-15 08:002020-06-15 11:50Software Directorate VM Farm
  • NCSA Open Source
  • INCORE
  • classtranscribe
  • ...
Upgrade of servers, including vm serversDuring this time all servers will be down, servers will be returned ASAP

COMPLETED

2020-06-10 09:002020-06-12 21:00hal.ncsa.illinois.eduQuarterly PMHAL System Servicedmu@illinois.edu

COMPLETE

2020-06-08 18:002020-06-09 07:00

Fileserver & Printing

Monthly Windows Server MaintenanceNCSA Fileserver(s) and NCSA-Print will be unavailable during the maintenance.   Business Office, HR, home, swap, etc. shares will be unavailable.  Printing will be unavailable.help+its@ncsa.illinois.edu

COMPLETE

2020-06-04 09:302020-06-04 14:46vsphere.ncsa.illinois.eduUpdated vCenter SSL certificate and trust chainManagement of VM's was unavailable while updating SSL certificateshelp+its@ncsa.illinois.edu

RESOLVED

2020-05-30 08:002020-06-01 12:00linux.ncsa.illinois.edu
public-linux.ncsa.illinois.edu
SSH password-based authentication were failing due to changes with intermediate certificates

SSH password-based authentication broke. Kerberos based auth continued to work.

help+its@ncsa.illinois.edu

RESOLVED

2020-5-28 07:002020-5-29 09:00NCSA Virtual ClassroomNodes were added, network reconfigured, and updates were applied.Student VM's were be unavailablehelp+its@ncsa.illinois.edu

COMPLETE

2020-5-22 07:572020-5-22 14:00Blue Waters ComputeMistaken cabinet removed from configuration causing unroutable configuration for HSNAll compute, all running jobs

COMPLETE

2020-05-20 10:002020-05-20 14:00DNS1/2UpgradesDNS servers will be rebooted during this time.help+neteng@ncsa.illinois.edu

COMPLETE

2020-5-14 06:00

2020-5-14 08:00NCSA WikiUpgradeWiki pages will be unavailableswrights@illinois.edu

COMPLETE

2020-05-19 09:302020-05-19 10:40netact.ncsa.illinois.eduFixing a problem with apachenetact is down.help+neteng@ncsa.illinois.edu

COMPLETE

2020-05-12 07:002020-05-12 17:00iForgeQuarterly MaintenanceAll systems will be unavailable during the maintenanceiforge-admin@ncsa.illinois.edu

COMPLETE

2020-05-04 18:002020-05-04 22:30NCSA Fileservers & Print ServersMonthly ITS Windows Server MaintenanceFileserver Shares (HR, Business Office, Home, Swap, etc.) and shared printers on NCSA-Printhelp+its@ncsa.illinois.edu 

COMPLETE

2020-04-30 9:002020-04-30 11:00Systems connection to idds-prodITS will be updating firewall settings for idds-prod.No impact is expected, but users should contact help+idds if issues occur.help+idds@ncsa.illinois.edu

COMPLETE

2020-04-22  8:002020-04-22 9:00CILogonhttps://go.ncsa.illinois.edu/CILogonServiceUpdate2020-04-22CILogon will relax name and email attribute requirement for IdPs.help@cilogon.org

COMPLETE

2020-04-21  6:002020-04-21  6:05CILogonAWS CILogon COmanage updateHTTP (80/443) and LDAP (389/636) ports will be unavailablehelp@cilogon.org

COMPLETE

2020-04-15 9:252020-04-15 10:01Campus Cluster user portalLogin access via UIUC Shibboleth was not working, while Shibboleth configurations were updated in support of new Shibboleth versionLogin access via UIUC Shibboleth was not working.help+its@ncsa.illinois.edu

COMPLETE

2020-04-14 11:082020-04-14 11:38vsphere.ncsa.illinois.eduVcenter was upgraded.Management of VM's was unavailable for 30 minutes.help+its@ncsa.illinois.edu

COMPLETE

2020-04-14 10:002020-04-14 12:00CILogonhttps://go.ncsa.illinois.edu/CILogonServiceUpdate2020-04-14DELAYEDAn incompatibility in the OIDC "getcert" endpoint was discovered. The update has been delayed.help@cilogon.org

COMPLETE

2020-04-13 13:55
RSA Authentication Manager and SecurIDRSA Authentication Manager service was turned off.Authentication using NCSA RSA tokens is no longer supported.  If you are using RSA with other organizations it should continue to work.otp@ncsa.illinois.edu

COMPLETE

2020-04-13 0900

2020-04-13 00930All Globus Services currently using RSA authenticationRSA authentication will be changed to DUO authentication

All Globus Services currently using RSA authentication


help+globus@ncsa.illinois.edu

COMPLETE

2020-04-07 10:002020-04-07 13:00Cerberus Bastions, BWBH BastionsThese systems will be migrated from using RSA to DUO for their second factor.

SSH logins on the hosts:

cerberus{1,2}..ncsa.illinois.edu

bwbh{1,2}.ncsa.illinois.edu

help+security@ncsa.illinois.edu

COMPLETE

2020-04-01 18:002020-04-01 22:00Windows Server MaintenanceNCSA Windows File & Print Servers were unavailable.  Users were not be able to access data on fileserver, or print to any printers in the building while maintenance was completed.NCSA File & Print Servershelp+its@ncsa.illinois.edu

COMPLETE

2020-04-01 9:02
RSA SecurID self-service portal, https://otp.ncsa.illinois.edu/ Portal was turned off.If you need to change your PIN on activate a new phone you won't be able to.otp@ncsa.illinois.edu

COMPLETE

StartEndWhat System/Service was affected?What happened?What was affected?

Contact Person

Status
2020-03-29 12:022020-03-29 18:41Blue Waters compute serviceHigh speed network out of serviceAll compute service, running jobs lost.

RESOLVED

2020-03-19 10:00 am2020-03-19 10:20 amNAPS ApplicationNAPS upgrade completeA set of planned changes including new features and improvements to existing ones were deployed to produciton.

Kimber Blum (kimber7@illinois.edu) or help+idds@ncsa.illinois.edu,

Alina Banerjee(alinab@illinois.edu)

COMPLETE

2020-3-16 8AM2020-3-16 4PMBlue Waters computeHSN issueCompute was rebooted

COMPLETE

2020-03-16 10:00 am2020-03-16 01:00 pmMain UPS/Critical PowerUPS annual maintenanceAll production areas (no intended power interruptions, just loss of UPS functionality during the work)rantissi@

COMPLETE

2020-03-11 15:002020-30-11 17:00netactUpgradesNetact will be down for system updateshelp+neteng@ncsa.illinois.edu

COMPLETE

2020-03-09 19:082020-03-10 02:00VMware vSphere infrastructure for BW, iForge, ICCvSphere data store failed

The NPCF data store failed. Optional NFS datastores are available to rebuild VMs.

VMs used by Industry, ICC and BW needed to be recovered and rebuilt.

help+its@ncsa.illinois.edu

RESOLVED

2020-03-10 10:002020-03-10 11:00CILogon, NCSA IdP, XSEDE IdPApache HTTPD SSL configuration change to require TLSv1.2 .https:// connections to cilogon.org, demo.cilogon.org, ecp.cilogon.org, idp.ncsa.illinois.edu, and idp.xsede.org will use TLSv1.2 exclusively. Older clients may be impacted.help@cilogon.org , help+idp@ncsa.illinois.edu

COMPLETE

2020-03-05

06:00

2020-03-05

06:55

NCSA VPN ServiceSoftware UpgradesAll IPSEC sessions were seamlessly failed over. Any users connected to the AnyConnect VPN were disconnected and need to reconnect.help+neteng@ncsa.illinois.edu

RESOLVED

2020-03-04 11:142020-03-04 11:21NCSA WikiA virtual CPU became disabled and triggered a rebootwiki.ncsa.illinois.edu was unavailable while it rebooted.help+its@ncsa.illinois.edu

RESOLVED

2020-03-03 17:002020-03-03 19:00DNS1/DNS2OS patching Both DNS servers will be patched and rebooted.  There may be some delays in DNS resolution during that time frame.help+neteng@ncsa.illinois.edu

COMPLETE

2020-03-03 11:002020-03-03 12:00NCSA-Print / PrintingSome users are experiencing issues with printingPrintinghelp+its@ncsa.illinois.edu

RESOLVED

2020-03-03 11:002020-03-03 11:03public-linux upgradeThe public-linux server was upgraded.public-linux.ncsa.illinois.edu hostname now redirects to the new linux.ncsa.illinois.edu replacement server.help+its@ncsa.illinois.edu

COMPLETE

2020-03-02 17:002020-03-02 22:30Windows Server MaintenanceWindows Servers such as Fileservers and Print Servers were upgraded/patched.NCSA Windows File & Print Servers were unavailable.  Users were not be able to access data on fileserver, or print to any printers in the building until maintenance was completed.help+its@ncsa.illinois.edu 

COMPLETE

2020-02-27 08:002020-02-27 12:00LSST

Monthly Maintenance:

  • OS updates and reboots
  • Other updates as needed
  • note: Slurm compute nodes will be updated through rolling reboots beginning on 2020-02-28

ALL LSST systems will be updated, including:

  • TBD
lsst-admin@ncsa.illinois.edu

COMPLETE


2020-02-17 07:00 am2020-02-18 05:00 pmSelect clowder systems, users have been notifiedMigration from NCSA to AWSSelect clowder systemskooper@illinois.edu

COMPLETE

2020-02-18 06:002020-02-18 06:30CILogon COmanage RegistryCILogon COmanage Registry AWS infrastructure updatehttps://registry.cilogon.org ports 80 and 443 will be unavailable for approx. 5 minutes. LDAP ports 389 and 636 will not be affectedhelp@cilogon.org

COMPLETE

2020-02-17 18:002020-02-17 21:30NCSA LDAPThe primary LDAP server ran out of disk space, later causing intermittent outages with all LDAP replica servers.

All ITS managed LDAP servers, including:

help+its@ncsa.illinois.edu 

RESOLVED

2020-02-17 12:00 pm2020-02-17 3:00 pmOSN PodCeph UpdateAll OSN Pod servicesbdickin2@illinois.edu; sstevens@illinois.edu

COMPLETE

2020-02-14
10:00
2020-02-14
14:00
Wired networking in NCSA buildingSome users reported they were unable to connect to the internet through their wired network connection. Wireless remained fully operational.DHCP for NCSA building wired network.  help+neteng@ncsa.illinois.edu

Resolved

2020-02-11 07:002020-02-11 1635iForgeQuarterly MaintenanceAll systems will be unavailable during the maintenanceiforge-admin@ncsa.illinois.eduComplete

2020-02-04

10:00

2020-02-04 12:00CILogon upgradeCILogon Service web front-end Bootstrap upgrade (http://bit.ly/36BvG57)No downtime is expected.help@cilogon.orgPrimary production server upgraded. Secondary production server to be upgraded in a week.
2020-02-03 10:002020-02-03 10:05Systems connection to idds-prodITS will be updating firewall settings for idds-prod.No impact is expected, but users should contact help+idds if issues occur.

help+idds@ncsa.illinois.edu


Complete
2020-01-30 13:002020-01-30 13:05Systems using acctdIDDS will install triggers on the production database to support the new project-data message.There are no changes that need to be made to current acctd implementations. The only impact acctd users may notice is the presence of project-data messages in acctd logs.

help+idds@ncsa.illinois.edu


Complete
2020-01-30 8:002020-01-30 9:00LSST FirewallsFirewall upgradeNo impact is expected.  Traffic to/from 141.142.181.0/24 and 141.142.182.128/26 will be failed over from the primary firewall to the secondary firewall while the primary is upgraded, then failed back.  Traffic between these subnets and the LSST storage network does not traverse the firewall.help+security@ncsa.illinois.edu

COMPLETE

2020-01-28 17:002020-01-28 19:00DHCP UpgradeOS updatesDHCP server will be rebooted for all office and wireless networks.  All connected clients will not be affected.  Any new IP requests during the reboot will be delayed.  This shouldn't be impacting for most users.help+neteng@ncsa.illinois.eduCompleted
2020-01-28 17:002020-01-28 19:00Exit-East RouterOS updateMost traffic will be sent via our second router.  Some specific projects may be affected.  Neteng will talk to those projects directly.help+neteng@ncsa.illinois.eduCompleted
2020-01-27 11:542020-01-28 09:36oa4mp.ncsa.illinois.eduan automated CA certificate update caused authentication failuresNCSA RSA authentication to Globus was unavailablehelp+idp@ncsa.illinois.edutemporary work-around in place; proper fix scheduled for 2020-01-29 14:00
(note: oa4mp.ncsa.illinois.edu is scheduled for retirement on 2020-04-01)
2020-01-21 08152020-01-21 0825ldap2ldap2 was returning ldap queries inconsistently so the service was restarted.login to certain services was unusually slow for some users. Jira being the top problem.help+its@ncsa.illinois.eduldap2 queries are working as expected after the restart.
2020-01-16 : 17302020:01-16: 1748Condo NFS serviceNFS exports are failing path resolutioinNFS file system client mountsChad KernerServers rebooted, mounts restored
2020-01-15 08:002020-01-16 01:55ICCP

Quarterly Maintenance

  • Golub IB Core switch FW update
  • Golub 10G Core switch FW update
  • GPFS 5.0.4.1 update
  • Moved golub Rack8 to accommodate expansion
Total outage including export nodes (access to HTC will still available)iccp-admins@campuscluster.illinois.eduComplete
2020-01-15 07:002020-01-15 12:00LSST NCSA Test StandHardware repair in NCSA Test Stand

21 servers in the NCSA Test Stand had their drive backplanes replaced by the vendor.

lsst-admin@ncsa.illinois.edu

COMPLETE

2020-01-06

10:00

2020-01-08

14:30

Code42 Crashplan EndpointsThe Code42 Crashplan servers start edpushing out Code42 Crashplan client updatesAll users of CrashPlan will have their clients upgraded.help+its@ncsa.illinois.eduComplete

2020-01-03 

10:20

2020-01-03 11:20Code42 Crashplan was upgradedSoftware updates to the CrashPlan Auth and Storage servers were appliedBackups were queued while the services restarted.help+its@ncsa.illinois.eduComplete
2020-01-02 11:302020-01-02 17:39NCSA ITS vSphere vCenterVCenter was upgraded to latest patch level. Due to some bugs it took longer to apply than expected.The VMware administrative interface was unavailable during the update.help+its@ncsa.illinois.eduComplete

2019-12-18

08:00

2019-12-18

10:00

Facility infrastructure  Electrical Transformer

TX-5C

Replace defective temperature controller "No Outage"  Production projects on feeder CMO Rantissi

Complete


2019-12-17 06:002019-12-17 10:25JIRAJIRA Upgrade from 7.6 to 8.5All JIRA usershelp+its@ncsa.illinois.edu

COMPLETE

2019-12-12 08:002019-12-12 14:00LSST

Monthly Maintenance:

  • OS updates and reboots
  • GPFS filesystem restructure

ALL LSST systems will be updated, including:

  • lsst-dev01, lsst-xfer, etc.
  • Slurm verification cluster
  • PDAC/Kubernetes/LSP clusters
  • tus-ats01
  • L1 test stand
lsst-admin@ncsa.illinois.edu

COMPLETE


2019-12-12 10:022019-12-12 10:08internal.ncsa.illinois.eduSystem memory was exhausted and OOM killer started killing https connections.Savanna tools were unavailablehelp+its@ncsa.illinois.eduMemory resources for the server were doubled and service was brought back online.
2019-12-10 13:452019-12-10 16:55Internet2 ConnectivityInternet2 Engineers isolated the issue to a malformed route update coming from an external peer to one of its nodes in Ashburn, VA. As this update was propagated throughout the Internet2 Network, it triggered a bug on the Internet2 routers and caused all internal BGP sessions of each router to rapidly flap, thus causing instability across the footprint. Engineers mitigated the issue by placing a filter on the specific peer to reject the malformed packet. The Major Incident has been resolved at this point.Many different external resources, data transfers, sessions, etc. to various destinations.help+neteng@ncsa.illinois.edu

Connectivity has stabilized. Please report any issues should they arise.

2019-12-22019-12-2 afternoonWireless network Tech Services reports they are having authentication issues affecting Wifi and VPN.  Engineers are working on the problem. Tech Services Issue Description.NCSAnet, IllinoisNet wireless are non functional at the moment. NCSA wired network remains available. IllinoisNet_guest is also functional. help+neteng@ncsa.illinois.eduTroubleshooting in progress
2019-11-14 18:002019-11-14 19:00Exit-West RouterSoftware UpgradesThis should not be user impactful.  All traffic will re-route via the other router.help+neteng@ncsa.illinois.edu

COMPLETE

2019-11-14 5:00 AM2019-11-14 3:30 PMNearline EndpointIssue with one storage librarySome Globus transfers were stalled for the period of the outagebw+storage@ncsa.illinois.edu

COMPLETE

Nov 7 10:00Nov 7 14:00ICCP.  All login nodes will be down.Reroute some IB cables between Core switches and compute nodes.  Changing topology on Subnet Manager.Scheduler will be pause. No users access to login nodes.  All running jobs will be kill.  help@campuscluster.illinois.edu

COMPLETE

2019-11-05 07:002019-11-05 16:53iForgeQuarterly MaintenanceAll systems will be unavailable during the maintenanceiforge-admin@ncsa.illinois.edu

COMPLETE

2019-10-12019-11-1NCSA Windows Domain ControllersITS Migrated all Windows Systems to using the Campus Domain.  The existing NCSA Windows Domain has been decommissioned and shutdown.NCSA Windows Systemshelp+its@ncsa.illinois.edu

COMPLETE

2019-10-23

8 a.m.

2019-10-23

12:00 p.m.

Core-West Code upgrades will be performed on Core-West network switch.This should not be user impacting.  All traffic will flow through the redundant Core.neteng+help@ncsa.illinois.edu

COMPLETE

2019-10-22 06:12

2019-10-22 07:18

Jira and WikiDuring reboots for system patches the wiki and Jira got stuck in a state that was not providing data to the users.Only web access to these tools was impacted.help+its@ncsa.illinois.edu

COMPLETE

2019-10-16 08:002019-10-16 20:30ICC system wideQuarterly maintenanceAll services on ICChelp@campuscluster.illinois.edu

COMPLETE

2019-10-16

8 a.m.

2019-10-16

12:00 p.m.

Core-East Code upgrades will be performed on Core-East network switch.This should not be user impacting.  All traffic will flow through the redundant Core.neteng+help@ncsa.illinois.edu

COMPLETE

2019-10-15 11:45am2019-10-15 11:56AM npcf-exit-east BGP peering flapped over I2 AL2S circuitTraffic got re-routed but some WAN services were impacted as reported by users. help+neteng@ncsa.illinois.edu

COMPLETE

2019-10-10 07:00

2019-10-10 07:30

mysql.ncsa.illinois.eduSome table repairs broke replication; this maintenance will update the replicas with newer databases so the service will work as expected again.Wiki, JIRA, and some web sites will stop working.  Email forwarding to user accounts at NCSA will be delayed during the outage.lindsey@ncsa.illinois.edu

COMPLETE

2019-10-01


2019-10-03NCSA-Print & Building Printers

Some printers are having issues connecting to the NCSA Print Server.  

After updating drivers on the print server, public printers are working as expected.

Printinghelp+its@ncsa.illinois.edu

COMPLETE

2019-10-03 6AM

2019-10-03

7:45AM

Jira and WikiDuring reboots for system patches the wiki and Jira got stuck in a state that was not providing data to the users.Only web access to these tools was impacted.help+its@ncsa.illinois.edu

COMPLETE

2019-10-01 7AM2019-10-01
8:30PM
Blue WatersNGA work load scheduled testingscheduler testing for NGA workloadDavid King

COMPLETE

2019-10-01 10AM2019-10-01
12:04PM
Blue WatersEPO 4 racks lost xdp (cooling)
CRAY warm swapped racks back into system successfully.
scheduler, some computes missing and Gemini was rerouted

COMPLETE

2019-10-01 07:00

2019-10-01 07:30

mysql.ncsa.illinois.eduMySQL servers needed to be synchronized to convert the server in NPCF back to a replicated host.Wiki, JIRA, and some web sites stopped working.  Email forwarding to user accounts at NCSA was delayed during the outage.lindsey@ncsa.illinois.eduCOMPLETE
2019-04-18 08002019-04-18 1200LSST

Monthly Maintenance:

  • 10G Network switch maintenance
  • GPFS server updates
  • OS updates and reboots
  • Dell firmware updates
  • Kubernetes update
  • Pending Puppet changes

ALL LSST systems, including:

  • lsst-dev01, lsst-xfer, etc.
  • Slurm verification cluster
  • PDAC/Kubernetes/LSP clusters
  • tus-ats01
lsst-admin@ncsa.illinois.eduMaintenance completed.
2019-04-18 09002019-04-18 0930NCSA Open SourceUpgrade confluence to apply security patch NCSA Open Source confluence, all other services are unaffectedopensource@ncsa.illinois.eduConfluence upgraded to 6.15.2
2019-04-17 08002019-04-17 1000ADSICCP Carne MaintenanceAll services will be down.
Maintenance completed.
2019-04-17 07:302019-04-17 19:45ICCP

Quarterly Maintenance

  • Ur1carne router code upgrade
  • Centos 7.6 upgrade
  • Deployment of HDR
All services unavailableiccp-admins@campuscluster.illinois.eduMaintenance completed.
2019-04-12
1410
2019-04-12
1534
Blue Waters/ SchedulerHSN issuescheduler paused
New Login sessions hang
tbouvet@illinois.eduHSN recovered, scheduling resumed
2019-04-11 
0555
2019-04-11 
0702
LDAPLDAP process crashedAuthentication to LDAP-backed serviceshelp+its@ncsa.illinois.eduLDAP was restarted
2019-04-10 08002019-04-1530wikiwiki was taken off-line for a security related upgradewiki was unavailablehelp+its@ncsa.illinois.eduNow running the latest version of confluence

2019-04-09 0900

2019-04-09 0930CILogon ( https://cilogon.org), myproxy.xsede.org, tfca.ncsa.illinois.eduDeploy new Luna SA HSM (hsm5) to production and take one old HSM (hsm4) offline (to serve as emergency backup).No downtime is expected. Use instructions at SafeNet LunsaSA HSM Monthly Testing to change pool of available HSMs on \{warm,cool,tepid\}.ncsa.illinois.edu .help@cilogon.org\{warm,cool\}.ncsa.illinois.edu now use hsm3+hsm5. tepid.ncsa.illinois.edu uses hsm5+hsm3. hsm4 will eventually be powered off and reserved as a backup.

2019-04-07

0645

2019-04-07

1650

Campus Cluster and ADSWe were experiencing network connectivity issues to both WAN and to some stuff internal to ICCP but all the traffic that was suspicious was going through the cc-core. Rebooting cc-core0 seems to have resolved the issue.Intermittent connectivity issue causing login and job submission to failed.iccp-admins@campuscluster.illinois.eduRebooting cc-core0 seems to have resolved the issue.
04/07/2019
9:30AM
04/07/2019
2:30PM
Blue WatersScheduler paused, oss hardware was replaced on scratch. Filesystem check in progress.

New jobs not starting.
Current jobs may stall if access bad oss.

tbouvet@illinois.eduOSS hardware replaced and scheduler resumed

2019-04-04

0800

2019-04-04

0830

NCSA LSST ResourcesSwitches servicing LSST hardware in NCSA-3003 were migrated to a new aggregation router.A brief network blip (~60s) occurred. All hosts have been verified after the moveneteng@ncsa.illinois.eduMaintenance has been completed.
2019-04-02 09002019-04-02 1500NCSA Open SourceUpgrade software and serverServer and/or services can be down during this timeopensource@ncsa.illinois.eduUpgrade completed
2019-04-02 09002019-04-02 1100CILogon (https://cilogon.org)Upgrade PHP from v5.6 to v7.3No downtime is expected.help@cilogon.orgUpgrade completed.
2019-April-01
NCSA DuoBackup code reminder emails were sent to all NCSA Duo participants in error. Your previously created backup codes are still valid. We are investigating why this email was sent.NCSA Duohelp+security@ncsa.illinois.eduAPI changes required re-coding the backup code process.
20190318 - 140020190318 - 1500BW Nearline EndpointScheduled HPSS software patch roll-upAccess to BW Nearline endpoint is suspededhelp+bw.ncsa.illinois.eduPatch installation complete
2019-03-12 07:002019-03-13 17:45LSST - LSST dev/Slurm compute nodes

network testing

24 compute nodes were reserved for admin use for this testinglsst-sysadm@ncsa.illinois.edutesting was extended into the 13th but was completed and nodes have been returned to service
2019-03-12 13:252019-03-12 14:25LSST

public DNS names were inadvertently removed for LSST's Oracle servers/service and the service became unavailable

LSST Oracle servers/servicelsst-sysadm@ncsa.illinois.edu
  • DNS was completed restored by 14:25
  • slowness following return to service was initially reported by one user but this seems to have resolved itself
2019-03-09 22:352019-03-09 22:35LSSTPower sag caused 27 L1 "NCSA test stand" nodes to reboot27 L1 "NCSA test stand" nodeslsst-sysadm@ncsa.illinois.eduServers rebooted themselves
2019-03-09 09:562019-03-09 10:31NCSA Jira, Pop, File-serverA VM host kernel panicked, causing its VMs to restart on alternate hosts.Jira, pop mail server, and file-server serviceshelp+its@ncsa.illinois.eduVMs automatically restarted themselves.
2019-03-08 06:152019-03-08 06:45NCSA Storage CondoThere was an IB error on the storage network causing the core servers to lose connectivity to disk.NFS/GridFTP/Remote Cluster Mountsckerner@illinois.eduThe node with the IB issue has been temporarily removed from service and will be placed back in when corrected.
06-Mar-2019 8am (CST)06-Mar-2019 9am (CST)All services behind pfsense firewall at NCSA. (qserv, verify, lsp, oradb)pfsense network config update to stage 'k8s-prod' deployment. Requires failover of firewall, and may cause short (~60s) outage of systems behind the firewall.All services behind pfsense firewall at NCSA. (qserv, verify, lsp, oradb)

lsst-sysadm@ncsa.illinois.edu

help+security@ncsa.illinois.edu

Complete
2019-03-04 2:00 pm2019-03-04 2:08pmNAPS

IDDS will be applying several updates to the NCSA Allocations Processing Service (NAPS):

(1) Searches for logins will only find those logins for the current domain

(2) Logins will always be created for the same organization as the domain (instead of always creating an NCSA login)

(3) Valid login rules will check the rules for the organization of the current domain

(4) Bug fix to make sure int args to procedures are passed as ints, not strings

(5) Speed up project loading process

(6) Dynamically determine compute resources

(7) Correct information in confirmation message when terminating a user from a project

(8) When selecting allocation for new users, only show the most current allocations for each resource

NAPShelp+idds@ncsa.illinois.eduComplete
2019-03-01 09:002019-03-01 09:33aForgeMultiple Ambari Services were in an error state. Individual service starts would fail.Job submission was downaforge-admin@ncsa.illinois.eduCluster was restarted
2019-02-27 23:002019-02-28 08:23NCSA VPNCampus moved Duo to a different instance (off of DUO1) to improve performance and reduce future downtime.  NCSA Duo is bundled with campus Duo and is also affected.  The vendor has completed changes but additional work appears needed on the NCSA VPN to accommodate this change.NCSA VPN was not working with Duo push - entering the 6 digit passcodes generated by the Duo app can be used as a work-around.

help+neteng@ncsa.illinois.edu


NCSA VPN is now working for both push and passcodes.
2019-02-27 23:002019-02-27 23:59Any system using Duo authenticationThe vendor moved us to a different Duo instance (off of DUO1) to improve performance and reduce future downtime.

Anyone who has a current session will not be impacted, it will only be for people trying to auth into a new session.   We expect Duo to be up most of this change window and actual downtime to be minutes.  All systems using Duo are affected including:

help+neteng@ncsa.illinois.edu

help+security@ncsa.illinois.edu


Vendor has completed work and most systems appear to be functioning, however it appears some local changes are needed for the NCSA VPN - see separate posting.
2019-02-22 06:302019-02-22 07:00ICCP WAN

This morning during a routine generator transfer test, one of the UPS units in a Tech Services networking node, node-1, failed resulting in a loss of power to portions of node-1.  Network Engineers were on-site during the test and were able to quickly resolve all issues stemming from that loss in power.  Not all equipment hosted in node-1 was impacted but one of the campus core routers, equipment hosting the science DMZ (CARNE and thus ICCP WAN as a whole) and other parts of the ICCN (Inter-Campus Communication Network) were impacted. 

All networking in and out of ICCP was down. Intra-cluster networking within ICCP was not affectedhelp+neteng@ncsa.illinois.eduICCN network engineers resolved the issues and things came back up successfully
2019-02-21 10:002019-02-21 14:00ICCPmoab core dump during startup.No one can submit job and no new jobs will start.help@campuscluster.illinois.eduAble to restart moab after removing all checkpoint files.
2019-02-21 08:002019-02-21 12:00LSST

Monthly maintenance

  • OS/Yum updates
  • Switch maintenance in NPCF N73 & P73
  • pfSense update & port negotiation change
  • GPFS server updates
  • Firmware updates for Dell C6420s

ALL LSST systems, including:

  • lsst-dev01, lsst-xfer, etc.
  • PDAC, verification, and Kubernetes clusters
  • tus-ats01
lsst-admin@ncsa.illinois.edu

Maintenance was successfully completed with one pending issue:

  • monitoring hosts (lsst-int-monitor; monitor-ncsa) are not showing status information due to problem reaching InfluxDB resolved
2019-02-21 09:20 AM2019-02-21 09:26 AMServices using DUOThe DUO1 deployment experienced a load balancer failure resulting in 100% of authentication requests failing to complete.All systems using Duo were affected including:help+security@ncsa.illinois.eduThis issue was identified and resolved via automated remediation by the vendor.  See http://stspg.io/940af334e for details.
2019-02-18 01:31 PM2019-02-18 05:05 PMICCPMoab was crashing after a few minutes of starting.Jobs could be submitted, but would not start.iccp-admins@campuscluster.illinois.eduMoab was restarted with no additional commands run (showconfig, etc.). This allowed Moab to properly index the job database. After completion, the scheduler was stable again.
2019-02-18 9:00 AM2019-02-18 11:00 AMLSST - K8sSecurity update of Docker and Kubernetes packages to address CVE-2019-5736Qserv, All LSST services running in K8s.lsst-admin@ncsa.illinois.eduPatching completed on time (10:00 AM). Additional troubleshooting of lsp-stable & lsp-int indirectly related to maintenance.
2019-02-15 1:15 PM2019-02-15 about 1:45 PMSome internet connectivity

ICCN router card crashed. Some commodity internet traffic was affected during the timeframe listed.

Commodity traffic to/from NCSA. neteng+help@ncsa.illinois.edu This has been resolved.
2019-02-13 17:002019-02-13 21:00netdot.ncsa.illinois.eduNetEng will be migrating Netdot to a new platform.Users will not be able to login into the NetDot IPAM and make/view DNS entries. The DNS servers will remain available throughout the window. help+neteng@ncsa.illinois.edu This has been completed.
2019-02-10 11:40am

2019-02-12 11:50am

ICCPController failed that caused an interruption with the redundant controller, have a new enclosure in place, waiting on valid second controller still. Cluster has returned on one controller after FSCK came back clean on the file systemShared file systems on cluster were unavialableset@ncsa.illinois.eduAfter force verifying the Pools, running FSCK on file system, swapping enclosure, file system returned to service. New controller successfully installed on 02/13; opened PMR with IBM on FSCK duration
2019-02-11 11:002019-02-11 17:50IDDS job processing

We will be doing a correction to a large number of Blue Waters job records in the IDDS database.
This process will begin at 11am and is expected to last around 6-7 hours.

There will be a small interruption to real time job loading for Blue Waters that should last around 1 hour.
Although there should be little impact to other systems, database access to the jobs table might be sluggish.
help+idds@ncsa.illinois.eduComplete
2019-02-10 21:002019-02-11 09:15NCSA Open Sourcekernel crashed. proxy server is down resulting in all of NCSA Open Source services being unreachable

NCSA OpenSource: JIRA, WIKI, BAMBOO, Confluence

devops.isda@lists.illinois.eduphysical reboot of server resolved issue
2019-02-08 13:002019-02-08 17:30

BlueWaters HPSS

ncsa#Nearline globus service

HPSS core server encountered a bug and crashed


Vendor is installing a patch to the core hpss server. 


Anticipating the system will be returning to service by 17:20

BlueWaters HPSS storage

Globus transfers to/from ncsa#Nearline

Vendor installed a patch

HPSS and ncsa#Nearline were returned to service


2019-02-07
5:00 AM
2019-02-07
5:30 PM
BW/HPSS
ncsa#Nearline (GO)
Scheduled MaintenanceSoftware and firmware updates completed.help+bw@ncsa.illinois.eduncsa#Nearline (GO) returned to service
2019-02-06
9:05 AM
2019-02-06
3:14 PM
BW/SchedulerHSN issue - full reboot to recoverMainframe rebooted and all running jobs were lost.help+bw@ncsa.illinois.eduBW returned to service
2019-02-05 07:002019-02-05 22:00iForge/aForgeQuarterly Maintenance (20190205 Maintenance for iForge)All systems were unavailable during the maintenance.iforge-admin@ncsa.illinois.eduMaintenance was successfully completed. iForge and aForge were returned to service by 22:00.
2019-02-02 6:402019-02-02 10:20ICCP schedulerRoot fill up on cc-mgmt1.Both resource manager and scheduler were downiccp-admins@campuscluster.illinois.edu

Boot the system into single user mode and gzip old messages file and moved this to GPFS.

Having issue restarting moab after that. Restart moab with clear checkpoint option and it works.

2019-01-31 06:002019-01-31 07:10NCSA ITS vSphere vCenterUpgraded ITS vSphere vCenter server to latest versionAll VMs will remained online during the maintenance, but management through vCenter was unavailable.

help+its@ncsa.illinois.edu

Upgrade complete

2019-01-30 10:00 p.m.2019-01-30 12:00 p.m.NCSA XSEDE DNS serverPerforming patching/upgrade on the ns1.xsede.orgWhile patching the ns1.xsede.org DNS server will be unavailable intermittently. Backup DNS servers will remain during this time frame.help+neteng@ncsa.illinois.eduMaintenance complete

2019-01-23

5PM

2019-01-24

8AM

FileserverScheduled MaintenanceShares on Fileserver were unavailable during the outage.help+its@ncsa.illinois.edu

Maintenance complete

2019-01-18 12:142019-01-18 14:32RSA OTP user portalAn ESXi server crashed taking down several VMs it was hosting. The OTP VM rebooted on an alternate ESXi hosts.RSA OTP user portalhelp+its@ncsa.illinois.edu

RSA OTP user portal online

2019-01-18 12:142019-01-18 13:30JIRA, file-server, ad-a, jabber, vsphere, email relayAn ESXi server crashed taking down several VMs it was hosting. The VMs all rebooted on alternate ESXi hosts.

JIRA, file-server, ad-a, jabber, vsphere, and email relay all rebooted

JIRA had index files corrupted and took a while to repair those

help+its@ncsa.illinois.edu

JIRA, file-server, ad-a, jabber, vsphere, and email relay rebooted and online

2019-01-17 08:002019-01-17 12:00LSST

Monthly maintenance

  • Power rebalancing in NPCF L73
  • Switch maintenance in NPCF M73, N73, P73
  • Critical security patching
  • Dell firmware upgrades
ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters, and tus-ats01)lsst-admin@ncsa.illinois.edu

Maintenance was completed successfully with the following caveats:

  • lsp services in Kubernetes are not fully functional (this is carryover from before the PM; see discussion on Slack, dm-lsp-users and possibly other channels)
  • lsst-l1-cl-dmcs will not boot after firmware updates

Please open tickets if you notice other issues.

2019-01-16

10:00

2019-01-16

10:15

NPCF Emergency power offEmergency power off panel was energized Facility electrical and HVAC systemsmrantissi@illinois.eduPanel is armed
01/12/2019 8AM

01/12/2019 1PM

BW/Mainframe resourceHung threads on scratch/home, paused the scheduler, HSN requires full reboot to recover 9:30AMMainframe rebooted and all running jobs were lost.Timothy BouvetBW returned to service 1PM

2019-01-10

5:35PM

2019-01-10 5:55PMcode42 crashplan pro e services had update for dataloss bug with MS OneDriveCode42 crashplan service was updatet to latest release to fix a dataloss problem with clients also running MS One Drive.Backup services were interrupted for a few minutes while services updatedcrashplan@ncsa.illinois.eduNow running Code42 6.8.6
2019-01-10 3:20PM2019-01-10 5:00PMDUO 2-Factor AuthDUO Upstream vendor reported issues with their service.
https://status.duo.com/
NCSA systems that use DUO for 2FAhelp+security@ncsa.illinois.eduDUO brought their systems back online
2019-01-09 10:28 AM2019-01-10
3:00 PM
BW/HPSSPower event at NPCF and recovery from falloutHPSS ncsa#NearlineGlasgow, James A
glassgow@illinois.edu
HPSS ncsa#Nearline RTS
2019-01-09 10:28 AM2019-01-09
4:35 PM
BW/All Resources DownPower event at NPCF and recovery from falloutAll BW Resources Downtbouvet@illinois.eduPower Restored, All Resources Except HPSS RTS
2019-01-09
1015
2019-01-09
12:15
Industry systems/ LSST systemsPower event at NPCF caused some Industry and some LSST systems to go offlineRunning jobs on iforge and other systems

help+industry@ncsa.illinois.edu

lsst-admin@ncsa.illinois.edu

The affected systems have been returned to service and users are being notified of which jobs to rerun
2019-01-08
18:00 
2019-01-08
19:00 
NCSA office net firewallSoftware upgrade on NCSA firewall and some config changes.NCSAnet wireless, Wired network (closed and partially-closed nets). IllinoisNet wireless will remain available during the maintenance.help+neteng@ncsa.illinois.eduFirewall upgrade did not go through however all services have been restored. NetEng is investigating and will work with the vendor to figure out a solution.

01/08/2018 2:20PM

1/08/2018

2:40PM

code42 crashplan pro e services had update for security issuesCode42 crashplan service was updated with the latest security fixesBackup services were interrupted for a few minutes while services updatedcrashplan@ncsa.illinois.eduNow running Code42 6.8.5

12/26/18

7:31AM

21/31/18

12:50PM

OTP self-services site was downPower on the hypervisor running the rsa otp self-service site was lost and the service didn't restartPIN changes and new software distribution was unavailableotp@ncsa.illinois.eduNow running updated version of software and all functionality was restored.
12/27/18 1:16PM12/27/18 11:40PMNPCF - 2 power blips (B transformers)System has been returned to service.Blue Waters ongoing jobs all terminated, scheduler paused while mainframe rebooted.

After an absurdly long outage to perform a reboot, the system was returned to service. There were apparently issues on shutdown, and again on bringup with various hardware fallout.

12/27/18 7:58AM12/27/18
8:30AM
Blue Waters/ bwedge, bwds2 - rebooted on backup server
bwdsm-dev - unresponsive?
jack.internal.ncsa.edu stopped respondiing/crash, we’re having intermittent issues with ESXi hosts kernel dumpingVM server went down impacting vm's on that server. VM's will restart on other backup server with temporary interruption in their services.jack.internal.ncsa.edu was power cycled. VM's were migrated to balance the load on servers after Jack was returned to service.
2018-12-172018-12-19ICCPRemoved old file systems no longer in production; reformatted LUNs; rolling reboot of NSD servers to pick up new presentations; rebalance startedNo user impact, all services remained fully operationalset@ncsa.illinois.eduNew v5 formatted disks added successfully; FS expanded to full size; rebalance of FS began
2018-12-18 08:002018-12-18 10:00NPCF-EXIT-EASTThe firmware on NPCF-EXIT-EAST was upgraded.Traffic was re-routed through NPCF-EXIT-WEST during the maintenance. No impact to users was observed.neteng@ncsa.illinois.eduFirmware was upgraded without issue.
2018-12-12 08:002018-12-12 20:43ICCP

Monthly maintenance

  • cutting the cluster over to new Spectrum Scale v5 formatted file system

Total cluster outage.

Taking a bit longer to bring the system back because interface renaming script stop working.


help@campuscluster.illinois.edu
2018-12-11 08:002018-12-11 10:00NPCF-EXIT-WESTThe firmware on NPCF-EXIT-WEST was upgraded.Traffic was re-routed through NPCF-EXIT-EAST during the maintenance. No user visible outage occurred.neteng@ncsa.illinois.eduFirmware was upgraded without issue.
2018-12-07 08:542018-12-07 19:25ICCPACB UPS experienced fault causing storage appliance to shutdown in controlled mannerJobs halted on system due to lack of parallel file system presence.help@campuscluster.illinois.eduF&S was dispatched to fix put UPS in bypass, FSCK's were run on File Systems to ensure integrity and the cluster was returned to service.
2018-12-06
5.30PM 
2018-12-06
8.30PM 
Wired network connections in NPCF office spaceSoftware upgrade on network switcheswired network service in NPCF office space. NCSAnet, IllinoisNet Wireless remained availablehelp+neteng@ncsa.illinois.eduMaintenance was successful
2018-12-06 06:002018-12-06 07:30NCSA ITS vSphere vCenterITS vSphere vCenter server was upgraded to latest versionAll VMs remained online during the maintenance, but management through vCenter was unavailable from 06:18-07:25.help+its@ncsa.illinois.edu

Upgrade was completed successfully

2018-11-29 08:002018-11-29 14:00LSST

Monthly maintenance

  • Puppet code changes
  • disable CPU hyperthreading
  • OS/Yum updates
  • code upgrades on select service & management switches NPCF
  • pfSense updates

ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters, and tus-ats01)

lsst-admin@ncsa.illinois.eduMaintenance was completed successfully
2018-11-19 08:002018-11-19 19:30ICCP

Monthly maintenance

  • Split the filesystem
  • Reformat with new v5 format
Total cluster outage.
help@campuscluster.illinois.edu
2018-11-15 5.30PM2018-11-15
7.30PM 
NCSA building router in 2045software upgrade on one of the building routers (2045-br)Traffic failed over to redundant building router and no impact on network traffic was seenhelp+neteng@ncsa.illinois.eduMaintenance was completed successfully
11/14/18
5:30PM

11/14/18
6:23PM

Blue Waters/Home filesystemMDS issuescheduler paused
Logins impacted
Timothy BouvetHome file system RTS
2018-11-14 10:00am

2018-11-14 11:00am

idp.ncsa.illinois.eduUpgrade Shibboleth IdP from v3.3.2 to v.3.4.1ECP (command line) Duo authentication is now supported natively by Shib IdP software.Terrence Fleurywork completed a day early

2018-11-14 10:45 am

2018-11-14 11:20

Blue Waters /Home filesystem

Investigation ongoing- suspect HSN quiesce/home, and new job starts during the scheduler pauseTimothy Bouvetback in service at 11:20

2018-11-06

06:00 am

2018-11-06 12:25 pm

Networking NetSure DC Distribution System

Tape Library QBERT and DIGDUG

iForge racks:

  • Y121, Z121, AA121, CC121, DD121
De-energize distribution power panel DP-6C-020 to install new power panel PPC4

Loss of power to the core network DC Distribution panel (B Side), the network is 2N power feed, no impact on the network due to redundancy.

Loss of power to two tape libraries, a temporary power feeds will be provided.

iForge system will be powered down for quarterly maintenance.


Mohammad Rantissiwork completed as expected

2018-11-13

7:30 PM

2018-11-13

8:30 PM

LSST

lspdev/Kubernetes

Cluster rebootMemory performance on most k8s nodes was in degraded state as a result of a power event that occurred over the weekend. Reseating the nodes in their chassis slots resolves the issue.lsst-admin@ncsa.illinois.eduSystems rebooted and memory performance is back to normal
2018-11-10 ~04:402018-11-10 ~04:45iForge (select compute nodes)

A power event caused some compute nodes to reboot

Select skylake platform compute nodes, including 7 nodes in the skylake queue. Jobs running on those nodes would have been impacted.iforge-admin@ncsa.illinois.eduSystems rebooted and brought themselves back online.
2018-11-10 ~04:402018-11-10 ~04:45LSST (lspdev and select L1 hosts)

A power event caused some hosts to reboot:

  • lspdev kubernetes cluster (3 nodes including master node did not come back on their own and were manually brought online around 09:30)
  • some L1 nodes rebooted as well

lspdev/Kubernetes cluster was unavailable from ~04:40 until ~09:30

select L1 hosts rebooted

lsst-admin@ncsa.illinois.eduSystems should be back online and functioning. Users are asked to create tickets if there are lingering issues.
2018-11-08
5.30PM 
2018-11-08
7.30PM 
NCSA building router in basement 07 (ncsa-07-br)software upgrade on one of the building routers.Traffic failed over to redundant NCSA building router. No impact on the network was observedhelp+neteng@ncsa.illinois.eduMaintenance was completed successfully without any issues
2018-11-07 16:502018-11-07 17:00NCSA JiraJira was rebooted to increase RAM.NCSA's Jira was offline while it's RAM configuration is upgraded.help+its@ncsa.illinois.eduUpgrade was completed successfully without any issues.
2018-11-06 06:002018-11-06 21:45iForge / aForgeQuarterly Maintenance (20181106 Maintenance for iForge)

All systems were unavailable during the maintenance.

iforge-admin@ncsa.illinois.edu

Maintenance was completed successfully:

  • aForge returned to service at 21:15
  • iForge returned to service at 21:45

NOTE: OFED was updated to v4 on the clusters during the PM. Some MPI software may need to be recompiled due to changes in libraries (e.g., libpsm_infinipath is no longer present in OFED v4). Frequently used openmpi installations have been updated to accommodate this change. Software compiled against affected MPI software may also need to be recompiled.

2018-11-06 07:002018-11-06 09:00NCSA VPN ServiceThe VPN was upgraded.The NCSA VPN service was down for maintenancehelp+neteng@ncsa.illinois.eduThe NCSA VPN has been upgraded
2018-11-01
7.00PM 
2018-11-01
7.30PM 
wired networking on 4th floor in NCSA buildingSoftware upgrade on network closet switchesWired network, VOIP phones on 4th floor. NCSAnet Wireless remained available during maintenance window.help+neteng@ncsa.illinois.eduupgrade was completed successfully without any issues.
2018-11-02 3:30 AM

2018-11-02

6:10 AM

iforge cluster

GPFS issue. "ls /usr/local" hangs.


direct access to some directories under /iusr/local was OK.


ie. "ls /usrlocal/modules-3.2.9.iforge" was OK.

iforge login node is currently down.

New ssh connections are hanging.

There is the potential for issues with running jobs.

Scheduler has been paused.


Jim Long

jlong1s@illinois.edu


Something odd going on with iforge020 was causing hangs.


Once iforge020 was rebooted, access to /usr/local was unlocked.

2018-10-30 9:00 p.m.

2018-10-30

11:00 p.m.

NCSA DHCPPatchesThe DHCP server will be unavailable periodically for reboots and patching. Possible timeouts for DHCP, but generally no interruptions are expected.
help+neteng@ncsa.illinois.edu
2018-10-25
5.30PM
2018-10-25
6.00PM
wired networking on 3rd floor in NCSA buildingsoftware upgrade on network closet switcheswired network, VOIP phones. NCSAnet Wireless remained available during maintenance window.help+neteng@ncsa.illinois.educode upgrade completed successfully without any issues.

2018-10-22
12:00pm

2018-10-22 1:00pmIDDS serversPatchesXRAS admin/review/submit UIs, XDCDB Admin UI, NAPSidds-admin@ncsa.illinois.eduPatches complete
2018-10-18 08:002018-10-18 12:00LSST

Monthly maintenance

  • firmware update and reboot on monitor01 (monitoring collector)
  • OS & Kernel updates on tus-ats01.lsst.ncsa.edu
  • Puppet code changes
  • monitor01/InfluxDB (and likely the front-end Grafana monitoring, e.g., monitor-ncsa.lsst.org) will be unavailable for a short period of time
  • tus-ats01 will be unavailable for OS & Kernel updates
  • the Puppet changes are intended to be functional "no-ops" and should cause no outage, although we scheduled these changes during our monthly PM window in case something unexpected occurs
lsst-admin@ncsa.illinois.edumaintenance completed successfully
2018-10-17 08:002018-10-17 18:00ICCP

Monthly Maintenance

  • Deploying new kernel with CVE-2018-14634 fix
  • Switching to MTU9000 across
  • GPFS 5.0.2 upgrade
  • Firmware bug fixes applied to DDN SFA14KX 
Total system outagehelp@campuscluster.illinois.edumaintenance completed
2018-10-17 15:402018-10-17 23:003rd Floor NetworkingPortions of the third floor did not have network connectivity due to a switch malfunction.Portions of the third floor are without network connectivity.neteng@ncsa.illinois.eduThe issue has been resolved.

2018-10-15

08:00 AM

2018-10-15

08:50 PM

Blue WatersMaintenance to apply security PatchesAll services for Blue Waters will be down except for ncsa#Nearlinebw-admin@ncsa.illinois.eduOutage extended for 2 hours due to unexpected power loss to 3 rows of equipment
2018-10-16
10:00 AM
2018-10-16
01:00 PM
DUO 2-Factor AuthDUO Upstream vendor has reported issues with their service.
https://status.duo.com/
NCSA systems that use DUO for 2FA might experience intermittent issues
help+security@ncsa.illinois.edu
2018-10-15 7:30 am

2018-10-15 11:00 pm

Nebula, File-serverPower Loss in the NCSA building is causing issues with systemsNebula web services are turned off, File-server is unavailablehelp+its@ncsa.illinois.eduSystems we brought back online and repaired.
2018-10-15 07:352018-10-15 09:15LSSTPower event -> host outage at NCSA 3003

affected: all physical LSST hosts (and VMs) at NCSA 3003:

  • incl. lsst-dev*, lsst-xfer, lsst-l1*, lsst-daq, lsst-dev-db
lsst-admin@ncsa.illinois.edu
  • most physical hosts rebooted themselves after the event, although a few L1 systems had to be manually powered on
  • most VMs had to be manually started after the event
2018-10-11 16:302018-10-11 17:00crashplan backup servicecrashplan was upgraded to code42 6.8.4crashplan service was restarted and clients reconnectedcrashplan@ncsa.illinois.educrashplan service has fewer security vulnerabilities now.
2018-10-092018-10-09DHCPAdditional DHCP attributes will be passed to clients.The Security Operations group has requested that the Web Proxy Auto-Discovery Protocol (WPAD) be set to blank via DHCP to better secure client workstations/laptops. This should not impact any users general network usage.

help+neteng@ncsa.illinois.edu

WPAD has been applied to all user networks at NCSA and NPCF (including wireless).

2018-10-08 17:002018-10-08 21:00Wired networking on 2nd floor in NCSA buildingncsa-2045 Network switch software upgradeWired networking for desktop computers and VOIP phones. Wireless network remained available during maintenancehelp+neteng@ncsa.illinois.eduswitch stack on second floor was upgraded. There were some issues during upgrade process due to which maintenance ran longer than expected. All networking services are restored back to normal.
2018-10-4-16:352018-10-4-16:35jabber.ncsa.illinois.eduThe open fire jabber server stopped working correctly and was restarted.Everyone using jabber reconnected.help+its@ncsa.illinois.eduJabber rooms are working like they should again
2018-10-04 08:002018-10-04 09:15LSSTCritical security patching

ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters)

The following systems will remain online and unaffected:

  • tus-ats01
lsst-admin@ncsa.illinois.eduMaintenance was successful.
2018-10-03 06:002018-10-03 07:00Campus Cluster - NetworkingMaintenance was performed on the OmniPoP uplink on ur1carne, which is the upstream router for all ICCP based network traffic. Engineers worked to transition the link from old optical transport gear to new gear that is optically protected with automatic failover.All traffic that would normally take this OmniPoP link will reroute through other WAN links on ur1carne. Downtime of < 15 min is expected within the hour window while engineers swing the fiber jumpers from the old optical gear to the new optical gear. There should be no impact to DES or any ICCP customers. Please contact NetEng if you notice any unexpected outages. help+neteng@ncsa.illinois.edu Maintenance was successful.
2018-10-02 17:002018-06-02 20:00NPCF Networking DC Power SystemTesting and maintenance of the DC power system and battery backup will be performed.No outage.help+neteng@ncsa.illinois.eduTests were completed without issue.
2018-09-26 11:002018-09-26 12:00Campus Cluster - MWT2 NetworkingMaintenance was performed on the Internet2 uplink on ur1carne, which is the upstream router for all ICCP/MWT2 based network traffic.MWT2 lost connectivity to LHC1 but everything else rerouted, all of which was expected. help+neteng@ncsa.illinois.edu The maintenance was successful, no issues have been reported
2018-09-202018-09-24OpenAFS serversOpenAFS file and database servers were upgraded to 1.6.23The OpenAFS servers were upgraded to the latest code without service interuptionafs@ncsa.illinois.eduNow running with latest security fixes in place
2018-09-20 08:002018-09-22 16:50LSST Qservqserv-master01 is having trouble booting after a motherboard replacement during planned maintenance.Qserv in general, specifically qserv-masterlsst-admin@ncsa.illinois.edu

RESOLVED

2018-09-20 08:002018-09-20 14:40LSST LSPdev

LSPdev kubernetes is having a gateway error after upgrading

LSPdev kuberneteslsst-admin@ncsa.illinois.edu

RESOLVED

2018-09-20 08:002018-09-20 14:00LSST

Monthly maintenance (Sep):

  1. Network switch firmware updates/reboots
  2. Lenovo firmware updates/reboots
  3. OS package updates/reboots
  4. ESXi hypervisor updates/reboots
  5. GPFS client changes and upgrade to 4.2.3-10

  6. GPFS server upgrade to 4.2.3-10

All LSST systems and services will be unavailable for the duration of the maintenance period.lsst-admin@ncsa.illinois.edu

RESOLVED

qserv-master01 and LSPdev are still having issues. These will be tracked as a separate incidents.

2018-09-19 08:002018-09-19 22:00Campus Cluster

Monthly maintenance

  1. Switching to CentOS 7.5 across cluster
  2. Upgrading gpfs to 4.2.3.10 (client only)

All compute and login nodes were down.

The filesystems were also unavailable due to issues with the change to gpfs and RH7.5

help@campuscluster.illinois.eduThe cluster was back in service at 2200
2018-09-17 17:3020018-09-19:30Wired networking on 1st floor (ncsa-1045)software upgrade on network switch for 1st floor.Wired networking for users on 1st floor was unavailable as network engineering performed software upgrades on their equipment.
Wireless network (NCSAnet) remained available during this time.
help+neteng@ncsa.illinois.eduMaintenance was completed successfully. Users can contact neteng if they have any issues with their wired network connections.
2018-09-12 06:002018-09-12 09:00DNS1, DNS2DNS1 and DNS2 will be updated/upgraded

DNS servers will be undergoing routine maintenance. During this maintenance window, system

and services will be restarted. One DNS server will always be responsive during the maintenance.

help+neteng@ncsa.illinois.edu Updates have been applied.
2018-09-11 9:30 a.m.2018-09-11 11:00 a.m.Internet2 100G connectionICCN engineers will be migrating our Internet2 connection to the new ICCN optical equipment.Traffic will fail over to a secondary peering. We expect minimal impact to users. Direct peering will fall back to normal routing.help+ neteng@ncsa.illinois.eduThe migration has been completed.
2018-09-11 8:30 a.m.2018-09-11 11:00 a.m.ESnet 100G direct connectionWe will be migrating our ESnet connection to the new ICCN optical equipment.Traffic will fail over to a secondary peering. We expect no impact during this maintenance.
The migration has been completed.
2018-10-10 09:00

2018-10-10
17:00 

netact.ncsa.illinois.eduMultiple users reported they were unable to delete their activations or change networks within Netact.netact.ncsa.illinois.eduhelp+neteng@ncsa.illinois.eduFixed the bug and tested. Issue was resolved.
2018-09-06 11:002018-09-06 12:00MREN Circuit MoveThe MREN WAN circuit is being moved to an optical protection switch.Traffic will be re-routed over an alternate peering during the test period.
neteng@ncsa.illinois.edu
2018-09-06 16:002018-09-06 16:40RSA Authentication ManagerRSA Authentication Manager 8.2 SP 1 P 08 was appliedBoth primary and replica servers were updated with the latest security patchesotp@ncsa.illinois.eduRunning 8.2 SP1 P08
2018-08-15 08:002018-08-15 20:08Campus Cluster

Preventative Maintenance

  • FSCK on filesystem
  • Reseat and reset management modules on IB core switch
  • BIOS updates on some nodes
  • Upgrade Carne uplink to 2x100G
Total outagehelp@campuscluster.illinois.edu

Corrected bad inode on filesystem.

Rebooted IB core switch

2x100G links are working

2018-08-29 09:382018-08-29 10:21Services that utilize Duo 2FA including bastions hosts and VPN.latency issues with DUO1 as per https://status.duo.com/any service that uses Duo for authentication including bastion hosts and VPN.help+security@ncsa.illinois.eduService appears be to be returning to normal as per updates on https://status.duo.com/
2018-08-20 10:202018-08-20 12:00sslvpn.ncsa.illinois.eduintermittent login issues with DUO two factor authentication due to an outage on DUO's end.Two factor authentication to sslvpn service.help+neteng@ncsa.illinois.eduDuo identified the issue and resolved the outage. Users can connect to sslvpn over Duo 2FA now.
2018-08-16 11:252018-08-16 12:41SlackSlack is reporting connectivity issues on their status page ( https://status.slack.com/ )Slack reported, "connectivity issues impacting all workspaces "

feedback@slack.com

Slack reported this resolved at 12:41, though NCSA users reported it working around 11:38.

2018-08-15 08:002018-08-15 16:00ISDA VM infrastructureUpgrade of all VM servers as well as backend storage systemNCSA opensource, NCSA docker hub, ISDA VM serverskooper@illinois.eduUpgrade was successful
2018-08-15 08002018-08-15 1430Storage Condo MaintenanceAll servers were upgraded to gpfs 4.2.3.10 and the clustered nfs service was implemented as well.Storage Condockerner@illinois.eduUpgrade was successful
2018-08-14 05:002018-08-14 09:00NCSA Wikiwiki.ncsa.illinois.edu will be upgraded to Confluence 6.10.1 and then to 6.10.2.The wiki will be down intermittently during the upgrade. Read the banner at the top of wiki pages for current status.help+its@ncsa.illinois.eduUpgrade was successful
2018-08-07 07:002018-08-10 12:00iForge ifdbpoc serverHardware issues require migrating to new server; some signs indicate service was impacted prior to 2018-08-07 07:00 but no reports have confirmedifdbpoc
Admins migrated data and services to another server. Verification was performed by the apps team.
2018-08-08 -- 1430hrs 2018-08-10 -- 0730hrsBlue Waters Nearline Endpoint
Due to very high demand for data retrieval from Nearline, a pause rule is in effect to allow manual task scheduling. You may submit tasks as normal and they will be run as quickly as possible.
Data storing and retrieving to/from the Nearline storage system.hpssadmin@ncsa.illinois.eduMany tasks were manually scheduled and completed to help re-balance the system utilization. The endpoint pause rule was lifted and all tasks are running again.
2018-08-07 07:002018-08-07 22:15iForge / aForgeQuarterly Maintenance ( 20180807 Maintenance for iForge )All systems will be unavailable during the maintenance.

iforge-admin@ncsa.illinois.edu

In progress

  • iForge was placed into production at 22:15
  • aForge was brought back online by 19:45


2018-08-03 11:302018-08-03 13:30NCSA VPNA configuration issue caused some VPN users connection problems to some NCSA resources.Some VPN users reported connectivity problems to some internal NCSA resources.help+neteng@ncsa.illinois.eduA configuration change was applied which corrected the routing issue.
2018-07-27 11:452018-07-27 13:45NCSA WikiThe wiki was being intermittently slow and unresponsive.wiki.ncsa.illinois.eduhelp+its@ncsa.illinois.eduUpgraded several software packages and rebooted wiki server
2018-07-27 08:002018-07-27 08:15NCSA VPNThe old NCSA VPN (vpn.ncsa.illinois.edu) was decommissioned. All users should be using the new VPN (sslvpn.ncsa.illinois.edu).VPN was decommissioned.neteng@ncsa.illinois.eduThe old VPN has been decommissioned and all users should be using the new VPN.
2018-07-26 14:002018-07-26 19:00NCSA RTThe RT help site was being intermittently slow and unresponsive.help.ncsa.illinois.eduhelp+its@ncsa.illinois.eduUpgraded several software packages and rebooted RT server
2018-07-25 14:302018-07-25 14:40NCSA WikiWiki RestartConfluence service restarted help+its@ncsa.illinois.edu
2018-07-24 13:202018-07-24 13:55crashplancrashplan was upgraded to 6.7.3 for latest feature and security updates. Client updates will push out to system automatically over the next few days.all client paused backups for about 2 mins as servers restarted with new code.crashplan@ncsa.illinois.edunow running Code42 6.7.3
2018-07-22 19:142018-07-22 19:45NCSA GitLabNCSA GitLab server was updated.
  • Renewed SSL certificate
  • Upgraded GitLab software
  • Increased CPU & RAM
help+its@ncsa.illinois.edu Completed
2018-07-19 18:442018-07-20 10:45:13nebulanebula controller experienced a fatal hardware error on 10gE nic

horizon interface to nebula https://nebula.ncsa.illinois.edu/ and all open stack command line tools are non-functional. Keystone authentication services are also off-line.

Instances that were running should continue to run but restarting will probably fail until the controller is repaired. launching new instances will also fail.

nebula@ncsa.illinois.eduReplaced card, nebula.ncsa.illinois.edu is now accessible again.
2018-07-19 12:00 2018-07-19 12:30

LSST: lsst-dev-db and dependent services, including kubernetes lspdev

Following the July 19 planned maintenance, MariaDB services on lsst-dev-db are unavailable along with dependent services, including:

  • kubernetes lspdev

DB services on lsst-dev-db along with dependent services, including:

  • kubernetes lspdev
lsst-admin@ncsa.illinois.eduResolved
2018-07-19 08:002018-07-19 12:00LSST

Monthly maintenance (July):

  1. Dell firmware updates/reboots
  2. OS package updates/reboots
    1. including upgrades to CentOS 7.5
  3. GPFS client changes and upgrade to 4.2.3.9

  4. GPFS server upgrade to 4.2.3.9

ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters)

The following systems will remain online and unaffected:

  • lsst-daq
  • lsst-l1-*
  • tus-ats01
lsst-admin@ncsa.illinois.edu

Maintenance was successfully completed, although the following resultant issue is being tracked in a separate status event:

DB services on lsst-dev-db are unavailable along with dependent services, including:

  • lspdev
2018-07-16– 9002018-07-16– 1938BluewatersSystem was upgraded for security issues and to migrate to Cuda 9.1Bluewaters compute and scheduler

bwadmin@ncsa.illinois.edu
D
avid King

Bluewaters is now updated
2018-07-09 – 11302018-07-10 – 1700Campus Cluster Monitoring WebpageSET is moving set-analytics to https. This should have been a simple change to a host name, but after the change the new value was not picked up.The monitoring web page gave a loading circle that never resolved to anything.help@campuscluster.illinois.eduSet up a Grafana instance for the display of the Campus Cluster monitoring.
2018-06-282018-07-09NebulaNebula was taken offline to repair the filesystemAll Nebula servicesnebula@ncsa.illinois.eduNebula is performing well now
2018-06-29 -- 1300hrs 2018-07-08 – 1400hrsBlue Waters Nearline Endpoint
Due to very high demand for data retrieval from Nearline, a pause rule is in effect to allow manual task scheduling. You may submit tasks as normal and they will be run as quickly as possible.
Tasks submitted to Globus will start in a paused state but will be released to run, at the earliest possible time, based on resource availability.hpssadmin@ncsa.illinois.eduBacklog of file stages was cleared and endpoint pause rule removed.
1800
2018-07-02
0600
2018-07-06
Access to NPCFFor the July 4th UIUC fireworks show, the parking Lots E14 and E14-shuttle will be closed from 6:00 p.m. Monday, July 2nd, through 6:00 a.m. Friday, July 6th. No parking will be allowed in these locations at any time during this period.  Please do not park in the NPCF dock area - use the shuttle buses, or park in lot E46 (south on Oak St.).Parking facilites for NPCF
 Parking is back to normal
2018-05-03 14:302018-06-28 09:00iForge gpu queueboth nodes in the general 'gpu' queue were offline due to issues with the GPUsiForge 'gpu' queue could not be used
Tried driver updates and engaged with vendors; ultimately got one node working with 4 M40 GPUs rather than the previous 2 K80 GPUs; continue engaging with vendors to get the other node working but queue is now available.
0800 2018-07-021200 2018-07-02Blue Waters NearlineOne tape library (of four) will be powered down for hardware maintenance (replacement of tape import/export module).Access to tapes in the affected library will be blocked until the system returns to service. Users staging data may see delays in accessing data until the library is back online.hpssadmin@ncsa.illinois.eduWork was completed with some delay (scheduled to complete by 0930) due to a failed SD card (used for storing and loading library geometry).
2018-06-27 9:002018-06-27 1:00LSST - k8s lspdevkub001 unplanned reboot and kub004 ran out of memory.lspdev JupyterHublsst-admin@ncsa.illinois.edu

Nodes/Services rebooted.

Kubernetes pods restarted.

2018-06-27 08:302018-06-27 11:49SlackSlack is reporting connectivity issues on their status page (https://status.slack.com/)Slack feedback@slack.com Slack reports, "workspaces should be able to connect again"

2018-06-23
19:44

2018-06-23
19:59

Blue Waters Scratch FilesystemTop of Rack network switch died in rack 8. Cray onsite and performed a work around and will replace Monday. Sonexion rack 28 lost mind and was rebooted.Partial scratch outage of ost169-179bw-admin@illinois.edu
tbouvet
bypassed faulty switch, rack 28 sonexion rebooted. faulty swich replaced Monday 25th.
2018-06-21 -- 1200hrs 2018-06-23 -- 1045hrsBlue Waters Nearline Endpoint
Due to very high demand for data retrieval from Nearline, a pause rule is in effect to allow manual task scheduling. You may submit tasks as normal and they will be run as quickly as possible.
Tasks submitted to Globus will start in a paused state but will be released to run, at the earliest possible time, based on resource availability.hpssadmin@ncsa.Illinois.eduMany tasks were pushed through the system by manually ordering them to reduce tape drive competition. Endpoint pause rule removed and all tasks resumed.
2018-06-21 08:002018-06-21 09:35LSST

Monthly maintenance (June):

  • pfSense firewall update
  • OS package updates/reboots for CentOS 6.9 servers (lsst-web, lsst-xfer, lsst-nagios)
  • Slurm update (lsst-dev01, lsst-verify-worker*)
  • Update host firewalls on GPFS servers
  • iDRAC configuration updates on lsst-dev01 and ESXi hosts

CentOS 6.9 servers:

  • lsst-web
  • lsst-xfer
  • lsst-nagios

Slurm/verification cluster

Other impact was not expected but unexpected issues could have lead to connectivity issues for other hosts or downtime for lsst-dev01 or hosted VMs

lsst-sysadm@ncsa.illinois.eduMaintenance was completed
2018-06-20 14:002018-06-20 19:00Campus ClusterRolling reboot of the core IO servers to move GPFS from 4.2.3.8 to 4.2.3.9 for CentOS 7.5 support; No downtime occurredSuccessful Upgradeset@ncsa.illinois.eduCluster now supports CentOS 7.5 clients
2018-06-182019-06-20 7pmNebulaNebula was shut down to fix broken filesystems.All Nebula servicesnebula@ncsa.illinois.eduNebula is up and running again. Please contact nebula@ncsa.illinois.edu if you still see issues.
2018-06-19 08:002018-06-19 12:00LSST L1 Test Stand

Scheduled Maintenance:

  • BIOS firmware updates
  • Puppet and firewall changes (including support of SAL unicast/multicast traffic)
  • OS package updates (staying with CentOS 7.4)

Level One Test Stand, including:

  • lsst-daq
  • lsst-l1-*
lsst-sysadm@ncsa.illinois.edu Maintenance completed successfully
2018-06-18 07:002018-06-18 09:30vSphere & Various VMsTwo of our hosts went down with network interface errors.Multiple VMs hosted on those nodes (incl. Fileserver, ncsa-print, and subversion)help+its@ncsa.illinois.eduBoth hosts are back online as well as all VMs
2017-06-16 22:18:322017-06-17 08:10:00

cforge

PBSPro server was hung on cfsched

Job scheduling and job submission were failing.Jim Longrestarted PBSPro server on cfsched
2018-06-15 1330hrs2018-06-15 1530hrsBlue Waters NearlineReplacement of a tape robot transporterThis work is not expected to impact operations. The library system will continue to operate with a single transporter but mount times may be somewhat longer until the second unit is returned to service.
hpssadmin@ncsa.illinois.edu
2018-06-12 04:3010:00Blue WatersThunderstorms have resulted in a power interruption. This outage impacts both the compute nodes and all filesystems. Therefore, a full reboot will be necessary.Return to service is estimated to be approximately 10 am Central time.Blue Water in total
Full reboot
2018-06-12 ~03:452018-06-12 ~06:00Campus ClusterMany compute nodes rebooted. No system on UPS was affected, and some compute nodes remained up. Facilities at ACB report that there were no power events this morning or last night, but this seems the most likely cause.Many compute nodes, but not all. Jobs on the nodes that rebooted were lost.help@campuscluster.illinois.eduNodes rebooted at a similar time, and many returned in a state unsuitable to run jobs. Rebooting in smaller groups got everything working again.
2018-06-12 ~03:402018-06-12 ~06:30iForge

A storm caused a brief power event which impacted:

  • big_mem queue
  • skyake queue
All nodes in the big_mem and skylake queues were rebooted by the power event.
Nodes rebooted on their own and were marked back online in the scheduler by around ~6:30am.
2018-06-12 ~03:402018-06-12 09:00LSST

Storm caused power event which impacted:

  • Kubernetes Commons / lsst-lspdev
  • 75% of verification cluster compute / Slurm


The following nodes rebooted because of the power event:

  • all kub* nodes (causing outage of Kubernetes Commons / lsst-lspdev)
  • 75% of verify-worker* nodes (partial outage of Slurm / verification cluster compute nodes
lsst-sysadm@ncsa.illinois.edu
  • verify-worker nodes were put back online in Slurm around 06:10
  • Kubernetes Commons resumed service by around 09:00
2018-06-11 08:302018-06-11 8:35Campus Cluster ADSVlan changes on campus clustercampus cluster - Active data storage (ADS)help+neteng@ncsa.illinois.eduMaintenance completed successfully
2018-06-07 06:302018-06-07 14:00Blue WatersThe boot node crashed requiring the system to be rebooted. File system and ESLogins remain up.All running jobs were lost, no new jobs were started until system is return to service, Torque was updated to ver. 6.1.2.
bw-admin@ncsa.illinois.edu
2018-06-01 00:502018-06-01 03:50Blue Waters/var space filled up by additional logging in Moab to troubleshoot job slide issue.PBS server went down due to no space in /varbw-admin@ncsa.illinois.eduZipped and moved old Moab logs to lustre file system to free up /var space, then restarted PBS server.
2018-05-31 14:002018-05-31 14:10NCSA Open SourceRetirement of both HipChat and FishEye/CrucibleServices will be shutdown and archived.
Services are disabled and will be archived in a month.

2018-05-31 08:00


2018-05-31 11:55NCSA ITS vSphere vCenterITS vSphere vCenter server will be upgraded to the latest VMware vCenter 6.7 All VMs will remain online during the maintenance, but management through vCenter will be offline during the upgrade.help+its@ncsa.illinois.eduSuccessful upgrade to VMware vCenter 6.7.
2018-05-23 06:552018-05-24, 1900hrsCampus Cluster File SystemA failure of both disk array controllers serving the CC file systems resulted in abrupt loss of access to the underlying storage. One array controller was identified as broken while the storage system was brought back up on the remaining controller for inspection and analysis. A thorough check of the file systems and storage devices was started. At 1100hrs May 24th the replacement array controller arrived and was installed. After further testing to assure system stability, the file systems were brought back online and released to the cluster admins.All campus cluster file systems
Normal cluster operations were resumed. Investigation into the root cause is ongoing with the cooperation of the system manufacturer.
2018-05-21DNS1/2There were a few reports of intermittent DNS lookups failures/slowness
Firewall state tables resources were being exhausted. Limits for those state tables have been increased. This appears to have resolved the problem. help+neteng@ncsa.illinois.edu   No further reports of the issue, after making the adjustment.
2018-05-24 10:55am

2018-06-24 11:08am

ifsm.ncsa.illinois.edu System is being upgraded and rebootedNo services should be affectedyum upgrade and reboot

2018-05-17 8:002018-05-17 15:00NPCF-Core-EastThe hardware and firmware on the core east router was be upgradedTraffic rerouted through npcf-core-west during the maintenance window. There was an unexpected outage for about 10 mins which impacted network connectivity throughout NCSA.
Upgrade on core-east was completed successfully. No further network outages are expected.
2018-05-09 7:002018-05-09 17:40dns1.ncsa.illinois.eduEnabling BIND on ipv6 and enabling a firewall on the serverNo impact is expected.
Maintenance was completed.
2018-05-17
08:00
2018-05-17
13:30
LSST

Monthly maintenance (May):

  • GPFS server & client updates, plus nosuid mounting
  • Physical firewall changes in NPCF for new vLANs
  • BIOS firmware updates
  • OS updates
  • Update of puppet-stdlibs module
All systems (except lsst-daq, lsst-l1-*, & tus-ats01) were unavailable for maintenance.

Maintenance was extended until 13:30 and then completed.

External Grafana monitoring (monitor-ncsa.lsst.org) was offline until 14:25 due to storage rebuild on lsst-monitor01.

2018-05-17 10:132018-05-17 10:18Core OutageDuring core router maintenance the incorrect core router was powered off.Network connectivity across NCSA was affected.
The core router was powered back on, verified and brought back into service.
2018-05-16 08:00


2018-05-16 17:40Campus Cluster

Monthly maintenance (May)

  • GPFS upgrade to 4.2.3.8
  • FW upgrade on Juniper switches
  • OS updates
  • Add 4 more 40G cables for ccioe nodes for redundancy
Entire system was unavailable for maintenance.
Maintenance complete, all tasks complete.
2018-05-16  1100hrs2018-05-16 1300hrsADSPlanned Campus Cluster network upgrades also impacted access to ADSAll ADS storage exports became unreachable
Eric has notified us that the networking maintenance is complete and ADS customers are able to access their storage again.
21 Mar 201814 May 2018openxdmod.ncsa.illinois.eduAn update to Torque broke the updates of XDMoD. openxdmod.ncsa.illinois.edu was offline while the system it resided on was updated, all the dependency software was installed, and the latest version of XDMoD was installed. Then all the data had to be re-imported.Software updating
Service restored with updated software.
2018-05-08 00002018-05-09 0015NCSA Storage Condo
One node ran out of memory, causing a deadlock in GPFS. During deadlock recovery, GPFS shut down on multiple nodes. Upon restart of the cluster, a different metadata server had a check on its PCI bus, forcing another unmount. All file systems but one were recovered. While recovering the last one, one of the Roger NetApp storage arrays started throwing errors, requiring a power cycle of the controller and disks, prompting a final recovery of the last file system.
Condo file systems and services.
All file systems recovered and services restored.
2018-05-08 07:002018-05-08 07:40iForgeQuarterly Maintenance ( 20180508 Maintenance for iForge )All systems were unavailable during the maintenance.
Planned maintenance completed successfully
2018-05-08 8:002018-05-09 8:00NPCF-Core-WestThe hardware and firmware on the core router will be upgradedTraffic will be rerouted through npcf-core-east during the maintenance window. No impact is expected.
The hardware and firmware was upgraded on npcf-core-west without incident. Traffic has been successfully failed back.
2018-05-03 08:452018-05-03 10:15NCSA WIKI, JIRA, services that rely on NCSA LDAPLarge amount of connections from two particular servers were hitting LDAP, causing the slow-down that in term caused timeouts for various applications using LDAP authentication. Blocking the cuplrit servers remedied the situationNCSA WIKI, NCSA JIRA, other applications that rely on NCSA LDAP authentication.
Culprit servers were blocked
9:00am9:25amsyslog-sec.ncsa.illinois.eduout of cycle patching of Security Syslog collectors to address CVE-2018-1000140Load balance fail over to secondary collector, RELP will be buffered.

relay-01 was updated and loadbalancer failed back.


4/25 14:004/25 15:00MREN WAN CircuitWAN circuit testing.Traffic will be re-routed over an alternate peering during the test period.
The MREN circuit was brought back in to production.
2018-04-24 12:302018-04-24 16:00NCSA jabber servicejabber was down while we repaired its authorization configuration.jabber.ncsa.illinois.edu wasn't accepting jabber logins
jabber working again.
2018-04-24 0 9:102018-04-24 0 9:50LSSTincreased LDAP timeout to 60 seconds in sssd.conf to fix problems with long login times and failure to start batch jobskub*, verify-worker*

sssd.conf updated, sssd restarted

verify-worker nodes were drained during the change

affected nodes may have slow LDAP response times for a short while (due to local cache needing rebuilt)

04/18/2018 10:3004/18/2018 11:30ICCP April MaintenanceReplaced 4x10G links from cc-core0 to carne. Updated BIOS on remaining parts of Cluster nodes.No outage.
Completed without any outage.
04/18/2018 10:3004/18/2018 11:30ICCP core switchesOne of the 4x10G links from cc-core0 to carne had incrementing errors and has been administratively down to prevent those errors from affecting traffic. There was a scratched fiber that earlier diagnosis had revealed, so we replaced the fiber during this ICCP PM.Nothing, all traffic rerouted through cc-core1
The errors are still incrementing, but we've narrowed down the remaining options for what might be going on.
4/12 09304/12 1830ADS NFS/SambaThe ESXi Hypervisor server had an error on it: 'A PCI error requiring a reboot has occurred.'.ADS NFS/Samba/Gridftp
The server was rebooted, the error cleared and all systems/services were restarted.
4/11 03:00 p.m.4/11 03.15 pmNetactNetact code was updated. Going forward new office activation names will have "-ofc" appended to them.No service impact to Netact.
Change was successfully implemented. Netact remained in service during and after the change.
4/11 9:004/11 10:00LSST NPCF FirewallPrimary firewall will be upgraded to use FRR instead of openBGP.No impact is expected.  The firewalls do not need to be failed over and no interruption in traffic flow is anticipated.
Firewall was successfully migrated.  No downtime occurred.

4/10

17:00

4/10

18:00

dns1.ncsa.illinois.eduOS Patching and BIND updatesdns1 (secondary DNS server) will be rebooted to apply patches. DNS2 will remain up.
DNS1 OS patching is completed. BIND was upgraded to 9.11. BIND is only bound currently to its ipv4 interface.

4/10

15:00

4/10

16:00

dns2.ncsa.illinois.eduOS Patching and BIND updatesdns2 (secondary DNS server) will be rebooted to apply patches. DNS1 will remain up. An IPv6 address will also be added to system in preparation for a broader IPv6 DNS rollout.
DNS2 OS was patched. BIND was upgrade to 9.11. IPv6 Address was also enabled on the server and BIND is listening on that address.

4/04/2018

16:00

4/04/2018

17:00

MREN WAN CircuitPort MoveTraffic will be re-routed over an alternate peering during the maintenance.
 The port was moved and the circuit was brought back into service without issue.
04/04/2018
16:17:00
04/04/2018
16:42:00
LDAPLDAP process crashedAuthentication to LDAP-backed services
LDAP was upgraded and restarted

4/04/2018

16:00

4/04/2018

17:00

MREN WAN CircuitPort MoveTraffic will be re-routed over an alternate peering during the maintenance.
The port was moved and the circuit was brought back into service without issue.

3/29/2018

17:00

MREN WAN CircuitWAN circuit testing.Traffic will be re-routed over an alternate peering during the test period.
Testing was completed and the circuit was brought back into service.
2018-03-21 08:002018-03-21 17:30Campus Cluster manage server and compute nodes except DES and MWT2Deploying new management server, upgrading to Torque 6.1.2 and Moab 9.1.2. Bios update. Configuration changes on GPFS servers. Tech Service CARNE code upgrade.Scheduler down. User access disabled

New management server is up with Centos7. Installed Torque 6.1.2 and Maob 9.1.2. Bios update are done on most nodes. Configuration changes on GPFS done. Tech services CARNE code upgrade done.

2018-03-16 1:00pm2018-03-16 5:45pmISDA + NCSA OpenSource

Security patches of VM servers as well as backend filesystem

Updates of Bamboo, JIRA, Confluence, BitBucket and CROWD 

All systems will be unavailable for a brief period of time.

During updates of OpenSource services part of OpenSource will be offline for up to an hour.


Updated fileserver (brief struggle with zfs and kernel updates). Updates of proxmox servers, Updated JIRA, Confluence, ButBucket and CROWD. Bamboo will be done later this weekend.
2018-03-12
9:00am
2018-03-12
5:00pm

Nebula Openstack cluster

Security and filesystem patchesAll instances and Nebula services were unavailable
Filesystem updates and security patches were applied. Filesystem is more responsive, but ~20 instances are repairing from problems that occurred before the outage.
2018-03-15
12:20
2018-03-15 16:20LSST

Lingering issues on select nodes following March PM

  • lsst-qserv-master01 - cannot mount local /qserv volume
  • lsst-xfer - issue w/ sshd
  • lsst-dts - issue w/ sshd
  • lsst-l1-cl-dmcs - unknown issue
  • lsst7 - issue w/ sshd

Following resolved by 13:23:

  • lsst-qserv-master01
  • lsst-xfer
  • lsst-dts
  • lsst-l1-cl-dmcs

Resolved by 16:20:

  • lsst7
2018-03-15
08:00
2018-03-15
12:20
LSST

March maintenance:

  • GPFS server updates and configuration of additional NFS/Samba services
  • Urgent Firmware updates
  • Increase size of /tmp on lsst-dev01
  • Hardware maintenance/memory increases on select servers/VMs
  • Release of refactored Puppet code for NCSA 3003 servers
  • OS updates
  • Recabling servers in NCSA 3003 to new switches
All systems were unavailable for maintenance.
Completed and most systems back online. Lingering issues for lsst-qserv-master01, lsst7, lsst-xfer, lsst-dts, and lsst-l1-cl-dmcs are being tracked in a separate status event.

2018-03-14

12:00

2018-03-14

12:35

Remote Access VPNAn issue with authentication for the VPN has occurred.Any new connections will not be established. Existing connections are unaffected.
Authentication services were restored.
2018-03-09 10:08am2018-03-09 11:00amCampus ClusterAccording to IBM, cc-mgmt1 was a culprit on halting communication across the cluster during the GPFS snapshot process.User can't login or access to filesytem.
Rebooted cc-mgmt1 and restarted services (RM & Scheduler).
2018-03-09 06:052018-03-09 08:00public-linux, www.ncsa.illinois.edu, & events.ncsa.illinois.eduA routine kernel upgrade resulted in failure of the OpenAFS client on these servers.OpenAFS storage was unavailable on these servers, resulting in the website failures.
Resolved. Packages were updated and OpenAFS reinstalled.
2018-03-07 15:002018-03-07 16:10LSSTqserv-db12 had one failed drive in the OS mirror replaced but the other was presenting errors as well so the RAID could not rebuild. The Qserv system would have been unavailable during this maintenance.qserv-db12
The node was taken down for replacement of the 2nd disk, to rebuild the RAID in the OS volume, and to reinstall the OS.

2018-03-07

14:00

2018-03-07

14:40

ESnet PeeringThe connection servicing our direct peering with ESnet will be moved during this window.Connections will be rerouted over a redundant peering. No service impact is expected.
The connection was successfully migrated and the peering with ESnet was brought back into service without issue.

2016-03-06

0100

2016-03-06

1040

WAN Connectivity DegradedThe router servicing several of our WAN connections is currently in a degraded state.Traffic has been gracefully rerouted. No user facing connectivity issues have been reported.
Graceful failover to the backup routing engine cleared a fault condition and affected peerings were re-established.
2018-02-27 07:152018-02-27 09:10Campus Cluster schedulerScheduler become unresponsiveJob submission & starting new jobs
Rebooted the node, restarted RM & Scheduler.
2018-02-26 06:002018-02-27 01:35All Blue Waters ServicesSecurity Patch CLE, SU26 Lustre patchAll Blue Waters resources are unavailable
Blue Waters returned to service at 1:35AM 27th Feb, with HPSS returned earlier at 10PM 26th Feb.
2018-02-23 16:302018-02-23 16:30Kerberos Admin serviceKDC configuration was modified to allow creation of service principles that can create and modify host and service principles.kadmin service was unavailable for 1 second while new config was read.

We can now delegate to group or users the ability to create and manage host keys and service principles.

2018-02-23 08:002018-02-23 09:00LSST Puppet ChangesRolled out significant logic and organization of the Puppet resources in NCSA 3003 data center in order to standardize between LSST Puppet environments at NCSA. We had done extensive testing and did not expect any outages or disruption of services.

No interruption of services.

Changes being applied to: lsst-dev01, lsst-dev-db, lsst-web, lsst-xfer, lsst-dts, lsst-demo, L1 test stand, DBB test stand, elastic test stand.


Updated successfully with no interruption of availability or services.
2018-02-21 13:302018-02-22 00:39ESnet 100G Peering DownThere is a suspected fiber cut between Urbana and Peoria on ICCN optical equipment. Our 100G direct WAN path to ESnet rides over this optical path and is thus currently down. The fiber vendor has identified the source of the problem (high water caused the fiber to be pulled out of a splice case)Nothing. All traffic destined for ESnet or resources that would normally take the ESnet WAN path will reroute through our other WAN paths
Repaired.
2018-02-21 08:002018-02-21 20:00Campus Cluster

Campus Cluster February Maintenance

  1. Applying security patches & OFED upgrade
  2. Testing/tuning metadata performance
  3. Troubleshoot/upgraded code on cc-core switches
 All systems were unavailable

Completed partially and following items are reschedule for next maintenance.

  1. Deploying new scheduler (due to a system stability)
  2. Upgrading Torque 6.1.2 and Moab 9.1.2 (not enough time for testing after release)
  3. Maintenance on CARNE router (bug in the code)
2018-02-05 10:452018-02-21 13:30ICCP Networking - Outbound

A hardware failure on one of the two core switches for ICCP caused that switch to enter a degraded service mode and eventually fail completely. This was also combined with software bugs that caused looping of packets between the two cores in the MC-LAG. The other core was still functioning properly and was providing connectivity for all ICCP/ADS/DES systems normally for the duration of the degraded service time period. A hardware replacement RMA was initiated. The hardware came in but the hardware alone did not fix the issue. We then waited until a ICCP PM where we could test things without interruption of service and we upgraded the code and put in some bug mitigation configuration changes. These things combined solved the issues.

Nothing as far as production. During the period where cc-core0 was down, aggregate bandwidth outbound was 40Gbps instead of the normal 80Ghps.
As of now the cores are both in production and stable.
2018-02-16 12:002018-02-16 12:30IPSEC VPNThe appliance servicing various IPSEC VPN connections was patched.Nothing
Patch was successful utilizing the failover capability of the VPN cluster to mitigate any service interruptions
2018-02-15 08:002018-02-15 13:00LSST

February maintenance:

  • Updating GPFS mounts to access new storage appliance
  • Rewire 2 PDUs at NCSA 3003
  • Switch stack configuration changes at NCSA 3003
  • Routine system updates
  • Firewall maintenance NPCF
  • Updates to system monitoring
 All systems were unavailable.
Completed and all systems back online.
2018-02-13 08:002018-02-06 09:00Certificate System Firewall 2Upgrade software to current production version. No interruptions to service expected CA services
FW upgraded - services were interrupted due to failed routing service.
2018-02-13 06:002018-02-13 06:30AnyConnect VPNPatches are being applied to the AnyConnect VPN applianceAccess to the NCSA AnyConnect VPN will be unavailable.
The VPN has been patched and client connections have been re-established.
2018-02-10 02:002018-02-10 10:35Campus ClusterGPFS snapshot hang and lock the filesystemAll systems were inaccessible. Lost running jobs.
Gather information for IBM, bounce the filesystem and reboot the cluster
2018-02-06 07:002018-02-06 17:35iForgeQuarterly Maintenance ( 20180206 Maintenance for iForge )All Systems were unavailable during the maintenance.
Planned maintenance completed successfully
2018-02-06 08:002018-02-06 09:00Certificate System Firewall 1Upgrade software to current production version. It is expected that current connections will be interrupted and a retry will be required.
  • cilogon.org
  • idp.ncsa.illinois.edu
  • idp.xsede.org
  • NCSA TFCA Myproxy
  • XSEDE Myproxy

  • Completed
2018-02-01 16:302018-02-01 16:45sslvpn.ncsa.illinois.eduWe are rebooting our VPN appliances to mitigate a critical security vulnerability that allows for remote code execution exploits. That vulnerability is described here: https://tools.cisco.com/security/center/content/CiscoSecurityAdvisory/cisco-sa-20180129-asa1Certain Industry partners' site-to-site VPNs
VPN rebooted without incident. Service was restored at 4:34PM.
2018-02-01 16:302018-02-01 16:45vpn.ncsa.illinois.eduWe are rebooting our VPN appliances to mitigate a critical security vulnerability that allows for remote code execution exploits. That vulnerability is described here: https://tools.cisco.com/security/center/content/CiscoSecurityAdvisory/cisco-sa-20180129-asa1Certain Industry partners' site-to-site VPNs and the NCSA remote access VPN service will be down during the maintenance. Any users connected to the NCSA VPN at the time of the maintenance will lose connectivity.
VPN rebooted without incident. Service was restored at 4:34PM.
2018-01-29 10:052018-01-29 10:10LSST verify worker nodes and lsst-devA network flap on the LSST network caused GPFS ejection of some nodes. Network and security is investigatinga few of the LSST nodes for 2-5 minutes and 2 jobs
Qualys scan time frame changed and investigatino continues.
2018-01-29 12:272018-01-29 12:31NCSA Jabber serviceJabber service was restarted to install a new SSL certificate.NCSA Jabber was down momentarily
NCSA Jabber restarted with new SSL certificate
2018-01-26 13:002018-01-26 13:15LSST NFS service slowdownA cron for lenovo system cleanup was run, and caused the lenovo box to showdown services. The NFS service was starved.lsst-dev NFS showed stale mounts
cron deleted, and re-written.
Wed 1/24/2018 13:35Wed 1/24/2018 14:55LSST NFS serviceWe were notified by NCSA security team that there was a stale NFS mount on one of the LSST test nodes. NFS services stopped workingAll NFS mounts for LSST systems such as lsst-demo and lsst-SUI were not working
NFS server was rebooted.
Tue 1/23/2018 23:00Wed 1/24/2018 01:25Condo storage servicesHit a known bug in GPFS 4.2.0.4 for quota management.All Condo services from 11pm to 1:25 am
Need to upgrade to a newer level of GPFS, but for now we have lowered frequency of the check_fileset_inodes script
2018-01-22 07:002018-01-22 13:05Blue Waters Compute NodesBlue Waters compute nodes were bounced to resolve issues caused by previous home file system outage (due to bad OST)Compute nodes were down, scheduler was paused.
Compute nodes were bounced successfully and returned to full service.
2018-01-21 08:422018-01-21 11:30Netsec-vc switch stack - FPC 4Switch member 4 of the netsec switch stack was down. Severe filesystem corruption occurred on the primary partition.Any hosts connected to member 4 of that switch that were not redundantly connected to other switches in the stack.
The switch was repaired by doing a full reformat/reinstall of JunOS. Everything is back into production.
2018-01-20 22:002018-01-21 0300Condo file systemsBringing the Roger disk into the condo, commands executed from the Roger GPFS servers caused the cluster to arbitrate for GPFS servers.All condo file systems mounted on nodes.
The SSH configuration was changed on the Roger GPFS servers to include the Condo GPFS server IP's. All file systems were returned to normal with no other problems and no remounts required.
2018-01-18 17:002018-01-19 15:00ISDA Hypervisors, NCSA Open SourceHypervisor updates.All systems were down for short amount of times as hypervisors rebooted
All patches applied.
2018-01-18 00:002018-01-18 24:00Campus ClusterCopying all data to new filesystem.  Deploying new Storage (14K).  Dividing cluster into two (IB & Ethernet).  Upgrading GPFS to 4.2.3.6.  Deploying new management node and new image server (if time permit).  Applying Security patches to compute nodes(no FW update at this time).All systems unavailable.
New Storage System was brought online, additional capacity and performance was added.

2018-01-18 18:40

2018-01-18 23:00

LSSTLSST Firewall outage in NPCF. Both pfSense firewalls were accidentally powered off.

PDAC (Qserv & SUI) and verification clusters were inaccessible, as well as introducing GPFS issues across many services, e.g. lsst-dev01.


The pfSense firewall appliances were power cycled and services restored.
2018-01-18 12:582018-01-18 14:10Code42 Crashplan backup systemCode42 Crashplan server were upgraded to latest JDK and Code42 6.5.2.Clients were unable to perform restores or push files into backup archive from roughly 13:35 - 13:55
Code42 servers are now running latest security updates to the crashplan service.
2018-01-18 08:002018-01-18 10:00LSSTMonthly OS updates, network switch updates, firmware updates, etc.All dev systems unavailable. Qserv and SUI nodes will remain available.
COMPLETE
2018-01-17 10:352018-01-17 13:00RSA Authentication Manager ServersUpgraded to Authentication Manager 8.1sp1p7No systems should have seen any impact
Latest security patches are applied.
2018-01-12 06:002018-01-12 10:00Decommission NCSA Rocket.chatThe old NCSA Rocket.chat service was shutdown.

Any archived conversations or content are no longer be available to users.



NCSA Rocket.chat service was shutdown and redirected to NCSA @ Illinois Slack .
Friday, Jan 12th,  0000-0600 CSTInternet2 Engineers from Internet2 will be migrating our BGP peering with I2's Commercial Peering Service (CPS) to a new location. Small disruptions may occur with the maintenance for the CPS service, but no user traffic disruptions should occur.None, Alternatives routes are present.none
Maintenance was completed successfully.
2018-01-11 08:002018-01-11 13:30LSSTCritical patches on lsst-dev systems (incl. kernel updates).All systems unavailable.

 COMPLETE

Thursday, Jan 11th, 0000 CSTThursday, Jan 11th, 0400 CST

Connectivity to Internet2 and backup LHCONE peerings - ICCP and MWT2 respectively

Engineers from Internet2 performed maintenance that affected certain BGP peerings that exist on the device that is ICCP/MWT2's upstream router, CARNE. Specifically, both the 100G Internet2 peering and the Internet2 LHCONE peering on CARNE were disrupted during this timeframe. MWT2 currently gets to LHCONE through CARNE's ESnet peering, which was fully functional. They also were able to get to UChicago through CARNE's OmniPoP 100G peering. As for ICCP, traffic to/from Internet2 based routes rerouted through the ICCN. Nothing was reported to be service impacting by this maintenance from neither ICCP nor MWT2.
Successful maintenance was completed.
2018-01-08 10:472018-01-08 11:30NebulaStorage nodes lost networkingAll nebula instances
Storage nodes were brought back online, instances were rebooted
2018-01-02
09:00
2018-01-05 17:00NebulaNebula was shut down for hardware and software maintenance from January 2nd, 2018 at 9am until January 5th, 2018 at 5pm. Spectre and Meltdown patches were applied, as well as all firmware updates, OS/distribution updates, and the filesystem was upgraded.All systems were unavailable.
Faster system that is now homogenous, so OpenStack upgrades are now possible.
2018-01-04 17:002018-01-05 20:00Blue WatersOne OST hosting the home file system has three drives failed simultaneously.Portion of home file system (with data on the affected OST) are not accessible.

Repair works were carried out on the failed OST. Scheduler continued to operate but restricting only jobs not affected by the failed OST to start.

Full operation resumed after successful recovery of the failed OST.

2017-12-20 08:002017-12-20 10:00LSST(1) Firewall maintenance (08:00-09:00) and (2) migration of NFS services (08:00-10:00).

Firewall maintenance: There should be no noticeable effect but scope of service includes most systems at NPCF (including PDAC, SUI, and Slurm/batch/verify nodes).

Migration of NFS services: SUI and lsst-demo* nodes.


Maintenance completed without issues.
2017-12-14 06:002017-12-14 20:30LSSTMonthly OS updates, network switch updates, firmware updates, etc.All systems unavailable.

All systems back online.
We ran into issues with the policy based routing on the LSST aggregate switches in NPCF that caused the outage to be extended longer than planned.

2017-12-13 09:002017-12-13 11:00JIRA UpgradeUpgraded JIRA to version 7.6 from 7.0NCSA Jira
Succesfully upgraded
2017-12-13 06:302017-12-13 07:39NCSA JabberAttempted to upgrade Openfire XMMP jabber software.NCSA Jabber was unavailable during the upgrade.
The upgrade failed. Jabber is available, but still running the old version. The upgrade will be rescheduled.
2017-12-11 10:002017-12-11 16:00Unused AFS fileserver were upgraded to 1.6.22After moving all volumes to servers updated on 2017-12-07, the now unused AFS servers were upgraded to OpenAFS 1.6.22.No impact to other systems as they were unused at the time they were upgraded.
All of NCSA's afs cell is running on OpenAFS 1.6.22
2017-12-09 03:002017-12-09 07:42BlueWaters PortalThe BlueWaters portal software crashed. Automated monitoring processes did not restart it correctly.The BlueWaters portal website was unavailable.
The BlueWaters portal service was manually restarted and the website is available.
2017-12-09 1000hrs2017-12-09 1400hrsGlobus Online (Globus.org) Please be advised that the Globus service will be unavailable on Saturday, December 9, 2017, between 10:00am and 2:00pm CST while we conduct scheduled upgrades. Active file transfers will be suspended during this time and they will resume when the Globus service is restored. Users trying to access the service at   globus.org   (or on your institution's branded Globus website) will see a maintenance page until the service is restored.
All NCSA Globus endpoints.

2017-12-072017-12-07Unused AFS file servers were upgraded to 1.6.22Three unused AFS fileserver were upgraded to the latest 1.6.22 release of OpenAFSNo impact to other systems as they were unused.
These AFS fileserver can no longer be crashed by malicious clients.
2017-12-072017-12-07AFS database servers were upgraded to 1.6.22The three database servers were upgraded to the latest 1.6.22 release of OpenAFSNo modern clients noticed the staggered updates.
These servers can no longer be crashed by malicious clients.
2017-12-05 16.002017-12-05 16:20dhcp.ncsa.illinois.eduNCSA Neteng will be migrating the DHCP server VM to Security team's VMware infrastructure.

- Hosts on the NCSAnet wireless network might be impacted.
- Any activated hosts that might be on the roaming range might be impacted.
+ Illinoisnet and Illinois_Guest wireless will be available at ALL times.
+ Wired network connection will be available throughout the maintenance window.


Maintenance was completed successfully and services are running as expected.
2017-12-02 09:302017-12-02 11:45NCSA opensourceUpgrade of Bamboo, JIRA, Confluence, BitBucket FishEye, and CROWDSub services of opensource can be down for a short time.
All services upgraded and running as normal.
2017-11-20 18:21

2017-11-29 14:30

ROGER OpenStack clusterI/O issues highlighted that GPFS CES NFS servers probably shouldn't run 400+ days without rebootROGER's OpenStack and the various services which were hosted therein, including JupyterHub Server
reboot of all nodes, including CES servers as well as the reboot of all hypervisors (with the fallout being one node required fsck and second reboot and another node/hypervisor is still unavailable) cleared most of the problems. I/O contention was felt as many instances were simultaneously attempting to start/restart. instances that were housed on the unavailable node are being migrated to another hypervisor
2017-11-21 9:00

2017-11-22 14:00

Open Source

ISDA servers

Update the fileserver that hosts VM's
all the XEN servers.

NCSA Open Source unavailable
Most of ISDA servers unavailable


 


Network issues delayed updates
All hosts updated and everything back to normal.

2017-11-21 16:002017-11-21 16:40Code42 CrashplanThe Code42 crashplan infrastructure was upgraded to version 6.5.1 to apply security and performance improvementsClients transparently reconnected to servers after they restarted
Now running on Code42 version 6.5.1
2017-11-20 9:002017-11-20 16:38Nebula Openstack clusterNebula OpenStack cluster was unavailable for emergency hardware maintenance. A failing RAID controller from one of the storage nodes and a network switch were replaced.Not all instances were impacted. Running Nebula instances that were affected by the outage were shut down, then restarted again after we finished maintenance.

Nebula is available.
No additional maintenance is needed for Tuesday, November 21.

2017-11-16 16:462017-11-20 12:40NCSA JIRAJIRA wasn't importing some email requests properly after the NCSA MySQL restart.Some email sent to JIRA via help+ addresses wasn't being imported.
JIRA is now accepting email and all email sent while it was broken has now been imported as expected.
2017-11-16
08:30
2017-11-16
13:30

BW LDAP Master

(Blue Waters)

Scheduled maintenanceUpdated LDAP lustre quotas to bytes and add archive quotas. IDDS will track and drive quota changes with acctd.
Production continued w/o interruption. BW LDAP master was isolated, lustre quotas changed to bytes with the addition of archive quotas. Replicas pulled updates w/o error.
2017-11-16 14:302017-11-16 16:52Internal website (MIS Savanah)A database table used by MIS tools became corrupted.The website would become unresponsive every time the corrupted database table was accessed.
OS kernel and packages where updated during debugging. The MIS database table was restored and the website came back online.
2017-11-16 16:462017-11-16 16:48NCSA MySQLThe NCSA MySQL server had to be restarted in order to delete the corrupted table used by MIS.All services that use MySQL were down during the outage. This includes: Confluence, JIRA, RT, and lots of websites
MySQL was restarted successfully.
2017-11-16 08002017-11-16 1200LSSTMonthly OS updates, plus first round of Puppet technical debt changes (upgrading to best design & coding practices)

All systems unavailable from 0800 - 1000 hrs.

GPFS unavailable from 0800 - 1000 hrs.

PDAC systems unavailable from 0800 - 1200 hrs.


Completed. OS kernel and package updates. Slurm upgrade to 17.02.

2017-11-15 13:302017-11-15 15:10RSA Authentication ManagerRSA Authentication Manager were patched to fix cross site scripting vulnerabilities and other fixesNothing was affected by the update
RSA Authentication Manager is running 8.2 SP1 P6. Process worked as expected.
2017-11-15 - 13:302017-11-15 - 14:30BW 10.5 Firewall Upgrade Part 2The normal active, "A" unit, NCSA BW 10.5 Firewall will be upgraded and then normal fail-over status will be re-enabled.The possibility of connection resets when the A unit comes back from being upgraded and state is being sycned.
Completed, process worked as expected.
2017-11-14 11:272017-11-14 11:33LDAPLDAP was unresponsive to requests.Several services hung while authentication was unavailable.
LDAP services were killed and restarted.
2017-11-05
02:15
2017-11-06
17:11
ROGER Hadoop/Ambaricg-hm12 and cg-hm13 took minor disk failures which crashed the nodeAmbari was effectively off-line
rebooted node, and node ran fsck as part of its startup sequence, node booted properly
2017-10-31
17:22

2017-11-03
17:00

ROGER hadoop/ambarihard drive failures on cg-hm10 and cg-hm17certain ambari services and HDFS
cg-hm17 returned to service after power cycle and reboot, cg-hm10's hard drive didn't respond to a reboot
2017-11-11 16:582017-11-11 19:09Blue WatersWater leak from XDP4-8 causing high temperature to c12-7 and c14-7. EPO on c12-7 and c14-7.
Scheduler was paused to place system reservations on compute nodes in affected cabinets, then resumed.
2017-11-10 14:002017-11-10 14:45NCSA Open SourceUpgrade of the following software: Bamboo, JIRA, Confluence, and BitBucketUpdates will happen in place and will result in minimal downtime of components.
completed, minimal interruption of service
2017-11-10 - 08:002017-11-10 - 08:30CA Firewall Upgrade - B unitthe stand-by, "B" unit, NCSA Certificate Service Firewall will be upgraded to same version as A unit.Expect no impact to services
 completed, no interruption of service

2017-11-08

16:30

2017-11-08

17:30

Netdotnetdot.ncsa.illinois.edu was migrated to Security's VMware infrastructure.During the downtime users weren't able to activate or deactivate their network connections via Netact.
Migrated successfully. Netdot is up and running.
2017-11-08 06:002017-11-08 15:00ITS vSphere vCenterITS vSphere was upgraded to the latest version of VMware vCenter. New access restrictions were also be put into place.All VMs remained online during the maintenance, but management through vCenter was offline during the upgrade.
Upgrade completed successfully.
2017-11-08 09:302017-11-08 10:00BW 10.5 Firewall Upgrade Part 1the stand-by, "B" unit, NCSA BW 10.5 Firewall will be upgraded and then traffic redirected through it for load testing before the "A" unit is upgradedExpect no impact to services
Upgrade completed successfully. Some states were reset when traffic switched to the B unit.
2017-11-07 7:002017-11-07 18:37iForge

quarterly maintenance

Update OS image.
Update GPFS to version 4.2.3-5
Redistribute power drops.
Update TORQUE.
BIOS updates.

iForge (and associated clusters)

All production systems are back in service

2017-11-07 - 13:302017-11-07 - 15:00CA Firewall Upgrade Part 2The normal active, "A" unit, NCSA Certificate Service Firewall will be upgraded and then normal fail-over status will be re-enabled.The possibility of connection resets when the A unit comes back from being upgraded and state is being sycned.
Completed upgrade
2017-11-06 15:282017-11-06 15:53Blue WatersEPO happened to c12-7 and c14-7.HSN quiesced.
Scheduler was paused to place system reservations on compute nodes in affected cabinets, then resumed.
2017-11-03 16:212017-11-03 16:32LDAPLDAP was unresponsive to requests.Several services hung while authentication was unavailable.
LDAP services was killed and restarted.
2017-11-02 09:002017-11-02 16:00LSSTLSST had a GPFS server that was down and had failed over to the other server for NFS.The GPFS client’s failed over automatically, and we manually failed over the NFS in the morning.
NFS exports were moved to an independent server. IBM was at NCSA and is continuing to debug the problems.
2017-10-31 17:112017-11-01 11:13LSSTGPFS degraded/outage

most NCSA-hosted LSST resources experienced degraded GPFS performance

hosts with native mounts (PDAC) experienced an outage


A deadlock at 17:11 yesterday temporarily caused slow performance. Then one GPFS server went offline at 18:21 and services failed over. NFS mounts (qserv/sui) were reported as hanging by a user at 09:12 today but may have been degraded over night. Affected nodes were rebooted and NFS mounts recovered by 11:13. IBM is onsite diagnosing issues with the GPFS system and ordering repairs (including a network card on one server).
2017-10-31 15:302017-10-31 16:00LSSTGPFS outage

most NCSA-hosted LSST resources

native mounts (e.g., lsst-dev01, verify-worker*) and NFS mounts (e.g., PDAC)


All disks in the GPFS storage system went offline temporarily and came back online by themselves. NFS services were restarted. Client nodes all recovered their mounts on their own. Logs have been sent to the vendor for analysis.
2017-10-31 - 13:302017-10-31 - 14:30CA Firewall Upgrade Part 1the stand-by, "B" unit, NCSA Certificate Service Firewall will be upgraded and then traffic redirected through it for load testing before the "A" unit is upgradedExpect no impact to services
Upgrade completed successfully. Some states were reset when traffic switched to the B unit.
2017-10-30 18:362017-10-31 00:46LSSTGPFS outage

most NCSA-hosted LSST resources

native mounts (e.g., lsst-dev01, verify-worker*) and NFS mounts (e.g., PDAC)


GPFS servers were rebooted. lsst-dev01 and most of the qserv-db nodes were also rebooted. Native GPFS and NFS mounts were recovered. May have been (unintentionally) caused by user processes but will continue to investigate..
2017-10-25 22:002017-10-26 11:20LSSTfull/partial GPFS outage

full outage for GPFS during 22:00 hour on 2017-10-25

outage for NFS sharing of GPFS (for qserv, sui) continued through the night

full outage for GPFS recurred 2017-10-26 around 08:44


All GPFS services and mounts have been restored.
2017-10-26 09:042017-10-26 09:04Various buildings across campus, including NPCF and NCSAIssue with an Ameren line from Mahomet caused a bump/drop/surge in power that lasted 2msLSST had approximately 20 servers at both NPCF and NCSA buildings reboot
Was a momentary issue with minimal effect to most systems
2017-10-26 00:002017-10-26 08:00ICCPgpfs_scratch01 was filled by a very active userAdditional space in scratch wasn't available
Out of cadence purge was run to free 2TB, users jobs held in scheduler; user contacted
2017-10-25 06:002017-10-25 14:05Blue WatersSecurity Patching of CVE-2017-1000253 security vulnerability.Restricted access to logins, scheduler and compute nodes. HPSS and IE nodes are not affected.
System was patched. Logins hosts are made available at 9am. The full system is returned to service at 14:05.
2017-10-24 09:502017-10-24 20:10LSSTNetwork outage / GPFS outage

All LSST nodes from NCSA 3003 (e.g., lsst-dev01/lsst-dev7) and NCPF (verify-worker, PDAC) that connect to GPFS (as GPFS or NFS) lost their connections.

All LSST nodes at NPCF lost network during network stack troubleshooting and replacement of 3rd bad switch.


A 3rd bad switch was discovered and replaced. All nodes have network and GPFS connectivity once again.
2017-10-23 08:002017-10-24 05:00Campus ClusterCampus Cluster October maintenance.Total outage of the cluster.
Replaced core ethernet switches from share services pod. Run new ethernet cables for share services pod. Moved DES rack from share services pod to ethernet only pod. Deployed new image with patched.
2017-10-21 17:15

2017-10-23 17:45

LSSTFirst one then two public/protected network switches went down in racks N76, O76 at NPCFMostly qserv-db[11-20] and verify-worker[25-48]; there was also shorter outage for qserv-master01, qserv-dax01, qserv-db[01-10], all of SUI, and the rest of the verify-worker nodes.
Two temporary replacement switches were swapped in. Maintenance and/or longer-term replacement switches is being procured for the original switches.
2017-10-18 13:002017-10-18 14:00NetworkingReplaced a linecard in one of our core switches due to hardware failure.Any downstream switches were routed through the other core switch.
All work was completed successfully.
2017-10-19 08:002017-10-19 21:30LSSTOutage and migration of qserv-master01: provisioning of new hardware, copying of data from old server to new.qserv-master01 (and any services that depend on qserv-master01, which may include services provided by qserv-db*, qserv-dax01, and sui*)

UPDATE (2017-10-19 15:15) OS install took much longer than anticipated, completed at 15:00. Data sync is started. Extending outage till 22:00.

Completed

10-19 08:002017-10-19 12:00LSSTRoutine patching and reboots, pfSense firmware updates (NPCF), Dell server firmware updates (NPCF).All NCSA-hosted resources except for Nebula.
Maintenance completed successfully. (qserv-master migration is ongoing, see separate status entry)
2017-10-18 14:452017-10-18 15:35Campus ClusterRestart of resource manager failed after removing all block array jobs.Job submission
Opened case with Adaptive (#25796). Found more array jobs and bad jobs in jobs directories. Removed all of those.
2017-10-15 08:152017-10-15 08:30Open SourceEmergency upgrade of Atlassian Bamboo.Bamboo will be down for a few minutes during this outage window.
Bamboo upgraded to the latest version.
2017-10-14 22:152017-10-14 23:35Campus ClusterScheduler crashJob submission
Opened case with Adaptive, run diag and uploaded the output along with the core file. Restarted the moab.
2017-10-14 13:002017-10-14 15:23Campus ClusterResource manager crashJob submission
Applied patch from Adaptive, which help with faster recovery. Suspend/block all current and new array jobs until we have a resolution.
2017-10-06 09:002017-10-11 01:00NebulaGluster and network issues

1) Gluster sync issues continue from 2017-10-05's Nebula incident.
2) At approximately 2017-10-06 16:10, a Nebula networking issue (unrelated to the Gluster issues) occurred resulting in host network drops within the Nebula infrastructure. This internal networking incident resulted in additional gluster and iscsi issues.
Many instances are broken because iSCSI is broken from the Nebula network issues. And any instances that were broken because of gluster are still broken.


All instances have been restarted and are in a state for admins to run. Some mounted file systems might require a fsk to verify. If there are other issues please send a ticket.

As the file system continues to heal we may see slower interaction.

2017-10-10 16:302017-10-10 19:10Campus ClusterResource manager crashJob submission
After removing problematic jobs from queue and we were able to restart the RM. Opened the case with Adaptive and forwarded those job scripts and core files.
2017-10-05 14:002017-10-05 17:00NebulaGluster sync issuesOne of the gluster storage servers within Nebula had to be restarted.
Approximately 100 VM instances experienced IO issues and were restarted.
2017-10-06 08:002017-10-06 17:00NCSA direct peering with ESnet

A fiber cut between Peoria and Bloomington caused our ESnet direct peering to go down.

All traffic that would have taken the ESnet peering rerouted through our other WAN peers. As such there were no reported outages of connectivity to resources that users would normally access via this peering
The fiber cut has been repaired and the peering has been re-established.
2017-10-06 08:002017-10-06 10:00LSSTKernel and package updates to address various security vulnerabilities, including the PIE kernel vulnerability described in CVE-2017-1000253. This will involve an upgrade to CentOS 7.4 and updates to GPFS client software on relevant nodes.All NCSA-hosted LSST resources except for Nebula (incl. LSST-Dev, PDAC, and verification/batch nodes) will be patched and rebooted.
Maintenance completed successfully. Pending updates to a couple of management nodes (adm01 and repos01) and one Slurm node that is draining (verify-worker11).
2017-10-4 07:402017-10-4 09:55Campus ClusterResource Manager crashJob submission
Failure on initial restart attempt. After looking through the core, decided to try a restart again without any change. This time it worked.
2017-10-03 13:002017-10-03 19:00Campus ClusterResource Manager crashJob submission
After removing ~30 problematic jobs from queue and we were able to restart the RM. Opened the case with Adaptive and forwarded those job scripts and core files.
2017-09-21 02:572017-09-21 09:40Storage server (AFS, iSCSI, web, etc)

The parchment storage server stopped responding on the network.


  • Several websites were down, including the following: www.ncsa.illinois.edu, cybergis.illinois.edu, nationaldataservice.org, etc
  • iSCSI storage mounted to fileserver went offline.
  • Several AFS volumes, including some users' home directories were offline.

Replaced optical transceiver on the machine and networking restarted. Also updated kernel and AFS.
2017-09-20 08:002017-09-20 13:45Campus ClusterSeptember MaintenanceTotal cluster outage
Maintenance completed successfully.
2017-09-20 08:002017-09-20 11:30 NCSA Storage Condo
Normal maintenance --Firmware upgrade on Netapps so new disk trays could be attached for DSILtotal file system outage
The quarterly maintenance was complete
2017-09-18 11:202017-09-18 13:30Active Data StorageRAID Failure in NSD server and disk failure on secondary NSD server.ADS service was unavailable
Recovered RAID configuration on NSD server and replaced failed disk on secondary NSD. ADS restored.
2017-09-15 06:202017-09-15 09:28public-linuxOpenAFS storage was not running or mounted after rebooting to a new kernel.AFS storage was not available from this server
Reinstalled the dkms-openafs package restarted the openafs-client. AFS is now working as expected.
2017-09-10 09:452017-09-10 11:30NCSA Open SourceUpgrade of Bamboo, JIRA, Confluence, BitBucket, FishEye, CrowdDuring the upgrade the services will be unavailable for a short amount of time.
All services upgraded successfully.
2017-08-31 11:072017-08-31 11:11NCSA LDAPNCSA LDAP TimeoutsNCSA LDAP was overloaded and timing out. Users were not able to authenticate via NCSA LDAP during that time.
NCSA LDAP stopped timing out at 11:11 am and authentication resumed.
2017-08-28 11:552017-08-28 12:59NCSA GitLabNCSA GitLab server ran out of disk space for the OSThe web interface at https://git.ncsa.illinois.edu wasn't working
Web interface is now working. Space freed up by clearing CrashPlan caches.
2017-08-24 13:002017-08-24 14:30netact.ncsa.illinois.eduTransient config issues from some system patching caused apache to not be able to start on the netact serverNetwork Activation 
The issues were fixed and Network Activation is working again
2017-08-24 08:002017-08-24 15:30LSSTRack upgrades in NCSA 3003Most LSST Developer services offline during upgrade
All LSST systems are back online with new racks and switches
2017-08-24 08:002017-08-24 09:30LSSTmonthly maintenance for NPCF (includes patching to address CESA-2017:1789 and CESA-2017:1793)adm01, backup01, bastion01, monitor01, object*, qserv*, sui*, verify-worker*, test0*
Maintenance was successfully completed.
2017-08-23 09:212017-08-23 16;50aForge/iForgegpfs failed during an upgrade of GPFS on the iforge storage nodes.  There was an IB hiccup at the time, but causality is unclearall jobs on iforge were aborted, gpfs clients needed to be upgrade, all gpfs client nodes were rebooted
iForge went production shortly before 5:12pm.  aforge went "production" at ~1630
2017-08-22 20:002017-08-22 30:00Patching DHCP servicePatching OS and services on DHCP1.Will need to reboot DHCP server a few times during this process. During the time dhcp will be unavailable. This is during the evening so I don't expect any direct issues from this.
Patching has been completed.
2017-08-16 08:002017-08-16 16:00Campus ClusterAugust MaintenanceScheduler and resource manager down
Upgraded Moab 9.1.1 and Torque 6.1.1.
2017-08-16 08:002017-08-16 09:15NoneReplace Line Card in Core SwitchI believe all systems connected to this switch, are multihomed and will not experience an outage.
The line was has been successfully replaced.
2017-08-16 00:302017-08-16 02:30Blue WatersTwo cabinets (c10 & c11) had EPO due to XDP control valve failure.Scheduler was paused to isolate failing parts, resumed at 2:09.
Parts replaced and cabinets were returned to service.
2017-08-08 7:002017-08-09 3:00iforge/cfdforge/aforge

Update OS image to RH 6.9

Update GPFS to version 4.2.3-2

Redistribute power drops

All four clusters were updated.

All items on checklist completed.

20170808 Maintenance for iforge

2017-08-03 06:452017-08-03 07:35NCSA Jabber upgradeUpgraded Openfire XMMP jabber softwareNCSA Jabber was unavailable during the upgrade.
Jabber was upgraded to the latest version of Openfire
2017-07-28 17:002017-07-31 evening
Update - All of the production data has been migrated except for the largest object table. That is loading now, then the user space will be loaded. Should all hopefully be done by this evening. Migration of operational database to new hardware happening during the weekend. DES old operational database
migration done successfully. Some other maintenance tasks that will give DES additional disk space was done, too and some performance improvements.
2017-07-27 11:002017-07-28 15:00netact.ncsa.illinois.edu

 The netact.ncsa.illinois.edu network activation server VM needed to be restored from backup

Network Activation service
The service has been fully restored
2017-07-25 02:362017-07-25 18:00Campus Cluster / Scheduler downBlip on mgmt1 causing GPFS drop and scheduler to crashScheduler offline
Still taking long time for Scheduler to initialize but jobs can start and run as usual. Opened case with Adaptive.
2017-07-20 09:002017-07-20 17:00ROGER Ambari and OpenStackUpdates to openstack control node and the Ambari clusterAmbari nodes (cg-hm08 - cg-hm18), OpenStack instances and servers
Openstack was back in service on time. Ambari had issues mounting hdfs was held out of service. HDFS was remounted on 25 July
2017-07-20 06:002017-07-20 10:00All NCSA hosted LSST resourcesMonthly OS patches (addressing issues including CESA-2017:1615 and CESA-2017:1680 ). Roll-out updated puppet modules. Batch nodes updated firmware.All nodes in NCSA 3003 and NPCF (batch nodes) will reboot.
Overall success. Exceptions: verify-worker31 failed a firmware update and is out of comission (LSST-914) and there are connectivity issues for some VMs used by the NCSA DM team (IHS-365). adm01, backup01, and test[09-10] will be patched in the near future.
2017-07-19 08:002017-07-19 14:44Campus ClusterJuly Maintenance (applied security patch)Cluster wide, except mwt2 nodes
Applied new kernel, glibc, bind patches and newest NVIDIA driver.
2017-06-29
1800
2017-06-30 0000Blue WatersEmergency maintenance to apply security patch addressing Stack Guard security vulnerability.Compute, Login, Scheduler are offline.
Kernel and glibc library patched on all affected system.
2017-06-22 08002017-06-22 1200All NCSA hosted LSST resourcesCRITICAL kernel and package updates to address Stack Guard Page security vulnerability.

Systems will be patched and rebooted.


Outage was extended to last past 1000 until 1200. Systems were successfully patched as planned except for qserv-db12 and qserv-db27, which will not boot. We will follow up on those with a ticket.
2017-06-22 08002017-06-22 0930LSST cluster nodes (verify-worker*, qserv*, sui*, bastion01, test*, backup01)Deploy Unbound (local caching DNS resolver)DNS resolving may have a short (~30 mins) delay.  
Successfully deployed and all tests (including reverse DNS and intra-cluster SSH) pass.
2017-06-20
0930
2017-06-20
1100
BluewatersXDP shutting down causing EPO on cabinet c1-7 and c2-7.Scheduler was paused to isolate the failing components, then resumed.
Warmswap of failing components, and returned them to service.

2017-06-20

0900

2017-06-20

1000

NCSA Open Source

Security upgrade needed for Bamboo, will also update the following components: Bamboo, JIRA, Confluence, BitBucket, FishEye

Most of the subcomponents of NCSA opensource will be down for a short time when the software is updated.
Upgraded Bamboo, JIRA, Confluence, BitBucket, FishEye to latest versions

2017-06-16

0900

2017-06-16

1100

ROGER Openstack nfs backend failed and was restartedThe primary CES server for the openstack backend failed and tried to fail over to the secondary server, which also failed. SET was notified and they had the CES nfs service back up by 1100The RoGER openstack dashboard went down and needed a restart. Several VM's experienced "virtual drive errors" and will need to be restarted
SET is still investigating the cause of the GPFS CES service failover. CyberGIS is working with their users to get the affected VM's restarted
2017-06-15 08002017-06-15 0930LSST cluster nodes (verify-worker*, qserv*, sui*, bastion01, test*, backup01)Deploy unboundDNS resolving may have a short (~30 mins) delay.

Updates deployed successfully via new puppet module. All tests passed.

EDIT 2017-06-15 1500 - Reverse DNS not working, which broke ssh to qserv* nodes. Disaabled unbound.

6/14/2017

8:00 a.m.

6/14/2017

10:00 p.m.

Network Core SwitchNetwork Engineering will be replacing a line card in one of our Core switches due to hardware issue.All services should remain active. Any affected switch will have a second redundant link to the other core to pass traffic.
Line card was successfully replaced.
2017-06-08 12:002017-06-11 22:20Campus Cluster (scheduler paused)Disk Enclosure 3 failure on DDN 10K.Lost redundancy and force us to drain the cluster.
Repair/replacement for controller can be time consuming so we took action to rebalance data out of failed enclosure. Scheduler was resumed as of 22:00.

2017-06-07 12:07

2017-06-07 12:42NCSA LDAPThe NCSA LDAP service crashedNCSA LDAP service was unavailable
LDAP software and OS were updated and server rebooted. LDAP is working normally.
2017-05-31 20:062017-05-31 20:36NCSA LDAPThe NCSA LDAP service was timing outNCSA LDAP service was unavailable
The root cause of LDAP timeouts is still being investigated.
2017-05-222017-05-26Campus Cluster VMsNetwork issue ESXI (hypervisor) Boxes after maintenanceCould no longer able to login to start VMs. License Server, nagios, all MWT2 VMs were down

The issue is fixed on 5/24. Restored license and Nagios service on 5/24. Moved MWT2 VMs to Campus Farm. All VMs return to service as of noon 5/26.

5/12/20175/18/2017Condo/NFS partitions onlythe NFS partition for the condo became extremely unstable after a replication (normal daily maintenance) was completed. Many iterations with FSCK and IBM on the phone got it resolved, and then 1.5 days restoring files that had been put in Lost and found.UofI library was switched to the READONLY version on the ADS during this time
The root cause is still being investigated.
2017-05-23 14:052017-05-23 14:13NCSA LDAPThe NCSA LDAP service was timing outNCSA LDAP service was unavailable
The issue is still being investigated, but seems to be steadily available since the incident.
2017-05-22 15:412017-05-22 15:51idp.ncsa.illinois.edu
oa4mp.ncsa.illinois.edu
Apache Tomcat out of memoryInCommon/SAML IdP and OIDC authentication services were unavailable.
Service restored by failing over to secondary server while memory is being increased on primary server.
05/20/2017 21:09

05/20/2017 23:37

DES nodes on Campus ClusterCould not communicate outside the switchAll nodes connected to switch in POD22 Rack2 @ACB
Upgraded the code on the switch resolved the issue.
05/20/2017 05:0005/20/2017 21:09Campus Cluster and Active Data Storage (ADS)Total power outage at ACBAll systems currently reside at ACB

Power was restored around 13:00hrs. We rotated ADS rack to align with Campus Cluster Storage Rack. Changed couple of VLAN IDs to reflect campus for future merger. ESXI boxes are down due to a configuration error after reboot. No major issue from output of FSCK from scratch02.

05/17/2017 02:0005/17/2017 10:45Internet2 WAN connectivityIntermittent WAN connectivity. The outage was a result of Tech Services' DWDM system, which provides us with our physical optical path up to Chicago via the ICCN. Specifically, the Adva card that our 100G wave is on was seeing strange errors, which was causing input framing errors for traffic coming in on this interface.General WAN connectivity to XSEDE sites, certain commodity routes, and other I2 AL2S connections.
The Adva card was rebooted and we stopped seeing the input framing errors. Tech Services is working with Adva to find the root cause of the issues on the card.
5/11/20175/12/2017ESnet 100G connectionNCSA and ESnet will be moving their 100G connection to a different location in Chicago.We have several diverse high speed paths to ESnet and DOE, traffic will be redirected to a secondary path.

2017-05-11
06:45
2017-05-11
07:33
NCSA Jabber upgradeUpgraded Openfire XMMP jabber softwareNCSA Jabber was unavailable during the upgrade.
Jabber was upgraded to the latest version of Openfire

2017-05-09

07:00

2017-05-09

18:15

iForge, GPFS, License ServersiForge Planned MaintenanceiForge systems, including the ability to submit/run jobs.
Pm was completed early at 1815
2017-05-06 22:002017-05-06 23:00NCSA Open SourceUpgrades of Atlassian softwareNCSA Open Source BitBucket
BitBucket is upgraded.
2017-05-06 09:002017-05-06 10:00NCSA Open SourceUpgrade of Atlassian SoftwareMost services hosted at NCSA Open Source were down for 5 minutes during rolling upgrades.
The following services were upgraded: HipChat, Bamboo, JIRA, Confluence, FishEye and CROWD.
2017-05-05 17:432017-05-05 20:02ITS vSphereA VM node panickedSeveral VMs died when the node panicked and were restarted on other VM nodes. This included LDAP, JIRA, Help/RT, SMTP, Identity, and others.
All affected VMs were restarted on other VM nodes. Most restarted automatically.
2017-04-27 18:102017-04-27 18:55Campus ClusterAnother GPFS interruptionBoth Resource Manager and Scheduler went down along with hand full of compute nodes.
Restarted the RM and Scheduler and rebooted all down nodes.
2017-04-27 13:112017-04-27 14:20Nebulaglusterfs crashed due to this bug, so no instances could access their filesystemsAll instances running on Nebula
Needed to reboot the node that systems were mounting from, but took the opportunity to upgrade all gluster clients on other systems while waiting for a reboot. Version 3.10.1 fixes the bug. All instances with errors in their logs were restarted.
2017-04-27 11:202017-04-27 12:45Campus ClusterGPFS interruptionBoth Resource Manager and Scheduler went down.
Torque serverdb file was corrupted. Restore the file from this morning snapshot and modified the data to match the current state.
2017-04-26 12:002017-04-26 18:30CondoA bug in the delete of a disk partition from GPFS. a problem within GPFSDES, Condo partitions, and UofI Library.
Partitions had been up for 274 days, and many changes. The delete partition bug caused us to stop ALL operations on the condo and repair each disk through GPFS. Must have quarterly maintenance. Just too complicated to go a year without reseting things.
2017-04-19 16:542017-04-20 08:45gpfs01, iforge

Filled-up metadata disks on I\O servers caused failures on gpfs01.

iforge clusters, including all currently running jobs.

Scheduling on iForge was paused for the duration of the incident. Running jobs were killed.13% metadata space was freed. Clusters were rebooted and scheduling resumed.

2017-04-19 08:002017-04-19 13:00Campus ClusterMerging xpacc data and /usr/local back to data01 (April PM)Resource manager and Scheduler were unavailable during the maintenance.
Once again, /usr/local, /projects/xpacc and /home/<xpacc users> are mounting from data01. No more split cluster.
2017-04-04 (1330)2017-04-04 (1600)NetworkingSome fiber cuts caused a routing loop inside one of the campus ISP's network.Certain traffic that traversed this ISP would never make the final destination. Some DNS lookups would have also failed.
Campus was able to route around the problem, and the ISP also corrected their internal problem. The cut fiber was restored last night.
2017-03-28 (0000)2017-03-29 (1600)LSSTNPCF Chilled Water OutageLSST - Slurm cluster nodes will be offline during the outage. All other LSST systems are expected to remain operational.
No issues. Slurm nodes restarted.
2017-03-28 (0000)2017-03-29 (0230)Blue WatersNPCF Chilled Water OutageFull system shutdown on Blue Waters (except Sonexion which is needed for fsck)
FSCK done on all lustre file systems, XDP piping works done (no leakage found), Software updates (PE, darshan) completed.
2017-03-25
10:15PM
2017-03-26
00:08AM
Blue WatersBW scratch MDT failover, df hangsBW scratch MDT failover, load on mds was 500+ delayed failover. Post FO had some issues that delayed RTS.
scheduler was paused
2017-03-25
4pm
2017-03-25
8Ppm
Blue WatersBW login node ps hangrebooted h1-h3, lost bw/h2ologin DNS record, had neteng recreate the record. Had to rotate login in and out of round-robins until all rebooted. User email sent (2).
Logins nodes rebooted
DNS round-robin changes
2017-03-23 (1000)2017-03-23 (1500)NebulaNCSA Nebula OutageNebula will take an outage to balance and build a more stable setup for the file system. This will require a pause of all instances, and Horizon being unavailable.
File system online and stable. At this time all blocks were balanced and healed.
2017-03-16 (0630)2017-03-16 (1130)LSSTLSST monthly maintenanceGPFS filesystems will go offline for entire duration of outages. Some systems may be rebooted, especially those that mount one or more of the GPFS filesystems.

2017-03-15
15:11 
2017-03-15
16:01 
Blue WatersFailure on cabinet c9-7, affecting HSN.Filesystem hung for several minutes.
Scheduler was paused for 50 minutes.
Warmswap cabinet c9-7.
Nodes on c9-7 are reserved for further diagnosis.  
2017-03-15 09:002017-03-15 12:47Campus ClusterUPS work at ACB.Reshuffling electrical drops on 10k controllers, storage IB switches and some servers.Scheduler will be paused for regular jobs. MWT2 and DES will continue run on their nodes.
UPS work at ACB - incomplete (required additional parts)Redistributing power work done.Scheduler was paused for 3hrs 50 mins.
2017-03-10 13:002017-03-10 18:00Campus ClusterICCP - We lost 10K controllers due to some type of power disturbance at ACB.ICCP - Lost all filesystem and its a cluster wide outage.
Recovered missing LUNs and rebooted the cluster. Cluster was back in service at 18:00.
2017-03-09 09002017-03-09 1500RogerROGER planned PMbatch, hadoop, data transfer services & Ambari
system out for 6hrs, DT services out until 0000
2017-03-08 19:412017-03-08 22:41Blue WatersXDP powered off that served the four cabinets
(c16-10, c17-10, c18-10, c19-10).
scheduler paused, four rack power cycled.
moab required a restart, too many down nodes
and itterations were stuck.

Scheduler paused
three hours
2017-03-03 17002017-03-03 2200Blue WatersBW hpss emergency outage to clean
up db2 database
ncsa#nearline, stores are failing with cache full
Resolved cache full errors
2017-02-28 12002017-02-28 1250Campus ClusterICC Resource Manager downUser can't submit new jobs or start new jobs
Remove corrupted job file
2017-02-22 16152017-02-221815NebulaNebula Gluster IssuesAll Nebula instances paused while gluster repaired
Nebula is available.
2017-02-11 19002017-02-11 2359NPCFNPCF Power HitBW Lustre was down, xdp heat issues.
RTS 2017-02-11 2359
2017-02-15 08002017-02-15 1800Campus ClusterICC Scheduled PMBatch jobs and login nodes access