status.ncsa.illinois.edu
Watch this page in the wiki to subscribe to automatic updates to this status page.
Please do not refer to any NCSA Industry Partners on this page. Please use the iforge nomenclature for all of the *forge infrastructure.
To see older events, see Archive of NCSA Status Home
Current Status
START | END | What System/Service is affected | What is happening? | What will be affected? | Contact Person | Status |
---|---|---|---|---|---|---|
2019-10-01 | NCSA-Print & Building Printers | Some printers are having issues connecting to the NCSA Print Server. We are working to find a solution. | Printing | help+its@ncsa.illinois.edu | IN PROGRESS | |
7AM | 5PM | Blue Waters | no user jobs | scheduler testing for NGA workload | @kingda | IN PROGRESS |
Report a problem
Upcoming Scheduled Maintenance
Start | End | What System/Service is affected | What is happening? | What will be affected? | Contact Person | Status |
---|---|---|---|---|---|---|
Previous Outages or Maintenance
Start | End | What System/Service was affected? | What happened? | What was affected? | Contact Person | Status |
---|---|---|---|---|---|---|
2019-10-01 10AM | 2019-10-01 12:04PM | Blue Waters | EPO 4 racks | xpd failed cooling, all nodes in racks missing, system will reroute the network. warm swapped racks back into system | @mshow | COMPLETE |
2019-10-01 07:00 | 2019-10-01 07:30 | mysql.ncsa.illinois.edu | MySQL servers needed to be synchronized to convert the server in NPCF back to a replicated host. | Wiki, JIRA, and some web sites stopped working. Email forwarding to user accounts at NCSA was delayed during the outage. | lindsey@ncsa.illinois.edu | COMPLETE |
2019-09-28 05:00 | 2019-09-30 08:00 | Fileserver | Continued maintenance on fileserver after Power Outage.
| Shares on fileserver were unavailable (HR, Finance, Home, Swap, etc.) | help+its@ncsa.illinois.edu | COMPLETE |
2019-09-27 17:00 | 2019-09-29 1:20 | Nebula | Power outage in the NCSA building. | All Nebula services will be down | nebula-admin@ncsa.illinois.edu | COMPLETE |
2019-09-28 07:00 | 2019-09-28 11:00 | All Systems in NCSA Building | NCSA Building Scheduled Power Maintenance affecting offices and some ICI services. | F & S has scheduled a 4 hour total power outage for the NCSA building This is to clean, adjust, inspect and perform any other preventative maintenance on the main transformers and associated switching gear on the high voltage equipment for the building.
| COMPLETE | |
2019-09-27 21:00 | 2019-09-28 13:30 | LSST | Monthly Maintenance:
| ALL LSST systems will be updated, including:
| lsst-admin@ncsa.illinois.edu | COMPLETE Oracle is still down but is expected to be returned to service later this afternoon. |
2019-09-28 07:00 | 2019-09-28 07:45 | NCSA DNS1 & DNS2 | Unexpected routing issue during power outage | DNS2 remained unavailable from 7.00am-7.15AM until switch in 3003 was upgraded as a part of planned maintenance. During the same time DNS1 became unreachable due to some routing/forwarding issues on a different switch stack which was unexpected. Both DNS1 and DNS2 are available now. Neteng is investigating the root cause. | help+neteng@ncsa.illinois.edu | COMPLETE |
09/19/2019 | Networking | One NCSA's core switch crashed last night. Root cause was identified as a software bug on the switch. The bug has been patched. Issue was resolved. | No services were impacted. | help+neteng@ncsa.illinois.edu | COMPLETE | |
2019-09-27 07:00 | 2019-09-27 07:30 | mysql.ncsa.illinois.edu | The backup server in NPCF is becoming the primary server in preparation for the weekend power outage | Wiki, JIRA, and some web sites will stop working. Email forwarding to user accounts at NCSA will be delayed during the outage. | lindsey@ncsa.illinois.edu | COMPLETE |
2019-09-26 13:22 | 2019-09-26 13:30 | mysql.ncsa.illinois.edu | The primary server was accidentally powered off | Wiki, JIRA, and many web sites stopped working. | help+its@ncsa.illinois.edu | RESOLVED |
2019-09-26 07:00 | 2019-09-26 07:10 | mysql.ncsa.illinois.edu | First stage of switchover to NPCF in preparation for weekend power outage | MySQL was briefly unavailable (~1 min), wiki was unavailable for ~10 min. | lindsey@ncsa.illinois.edu | COMPLETE |
2019-09-03 08:00 | 2019-09-03 9:52 | NCSA Jabber | Decommission NCSA Jabber | The NCSA Jabber service was permanently shutdown. Users should migrate to using Slack, Skype, or Microsoft Teams. | help+its@ncsa.illinois.edu | COMPLETE |
2019-08-29 06:00 | 2019-08-29 06:05 | NCSA VPN | The NCSA VPN will be cut over to the new LDAP replica servers | No outage is expected. Existing IPSEC and AnyConnect sessions will not be disrupted. There may be a brief disruption for new AnyConnect sessions during the window. | help+neteng@ncsa.illinois.edu | The cut over has been completed with no outage. |
2019-08-15 06:00 | 2019-08-15 06:45 | NCSA VPN | The VPN was upgraded. | All IPSEC tunnels failed over to the standby unit without an interruption to service. Any clients using the AnyConnect service will need to re-connect. | neteng@ncsa.illinois.edu | Upgrade is complete. |
2019-08-13 9:30 | 2019-08-13 10:00 | Code42 Crashplan | Services was updated for critical security issue | Crashplan service was unavailable for a few minutes during the 30 minute window. | help+its@ncsa.illinois.edu | Upgrade is complete and all services are back online. |
2019-08-07 15:00 | 2019-08-07 15:56 | Code42 Crashplan NPCF Datastore FileserverB | OS and software stack was updated on underlying server | The server supporting the NPCF Code42 backup location is available again. Backup Fileserver B is available again | help+its@ncsa.illinois.edu | Upgrades complete. Performance significantly improved. |
2019-08-06 07:00 | 2019-08-06 19:20 | iForge | Quarterly Maintenance | All systems were unavailable during the maintenance | iforge-admin@ncsa.illinois.edu | Complete. |
2019-07-24 12:00 p.m. | 2019-07-24 12:10 p.m. | Wireless | Full wireless reboot at NCSA. | All wireless (NCSAnet, IllinoisNet, eduroam) will be rebooted. Expect 5-10 mins for the entire building to reboot. | help+neteng@ncsa.illinois.edu | Completed: Full reboot corrected the problem. |
2019-07-18 08:00 | 2019-07-18 10:15 | LSST | Monthly Maintenance:
| ALL LSST systems will be updated, including:
| lsst-admin@ncsa.illinois.edu | COMPLETE with the following exceptions:
|
2019-07-11 05:00 | 2019-07-11 07:00 | NCSAnet Wireless | Tech services performed a software upgrade on the wireless access points at NCSA and NPCF building. | Brief interruption in wireless network connectivity at NPCF and NCSA. Wired networking remained available during the maintenance. | help+neteng@ncsa.illinois.edu | Maintenance was completed by tech services. Wireless connectivity has been restored. If users notice any wireless issues, please contact neteng. |
2019-07-08 05:17:41 | 2019-07-08 06:38:00 | iforge login node | Lost GPFS a couple of minutes after qualys scans started on the iforge cluster | iforge login node | iforge-admin@ncsa.illinois.edu | remount failed. system needed a reboot. Resolved. |
2019-07-05 1PM | 2019-07-05 6PM | Blue Waters Compute | NPCF Power hit, BW compute requires a full reboot. | Blue Waters compute | Resolved | |
2019-06-30 ~16:15 | 2019-06-30 ~16:15 | iForge | Power sags on all feeds at the NPCF datacenter caused reboots of several skylake compute nodes. | iforge[129-136] rebooted | iforge-admin@ncsa.illinois.edu | Resolved |
Jun 28th, 2019, 9:00 AM | Jun 28th, 2019, 12:15 PM | Globus NCSA_Nearline Endpoint (HPSS) | Emergency maintenance to restore the system | Nearline storage systems | James Glasgow | Resolved |
2019-06-27 11:04 | 2019-06-27 11:25 | iForge | GPFS restarted on main head node (iforge.ncsa.illinois.edu) | On iforge.ncsa.illinois.edu only:
| iforge-admin@ncsa.illinois.edu | Resolved |
2019-06-23 06:00 | 2019-06-23 20:00 | Facility power and cooling/ all production area will be impacted | Commissioning new protective relays: 30 minute rolling power outage Sonexion re-power: 12 hours work duration | Expect down time of 2 hours minimum for all production areas Blue waters down time between 10-12 hours | rantissi@illinois.edu | Completed |
2019-06-23 0700 | 2019-06-23 16:55 | iForge/aForge | Power maintenance in NPCF | All systems were unavailable during the maintenance. | iforge-admin@ncsa.illinois.edu |
|
2019-06-23 0700 | 2019-06-23 1400 | DNS updates / Changes | DNS will be split into views (Internal / External). External requests will no longer be able to perform looks up *.internal.ncsa.edu or reverse lookups on 10.0.0.0/8 and 172.24.0.0/13 IP space. | This is a security improvement to prevent lookup of internal DNS records from external clients of NCSA. | neteng+help@ncsa.illinois.edu | Done |
2019-06-23 0700 | 2019-06-23 1400 | LSST | Power maintenance in NPCF |
| lsst-admin@ncsa.illinois.edu | Returned to service |
2019-06-18 13:00 | 2019-06-18 13:20 | npcf-core-east | Primary routing engine crashed on one of the core routers in NPCF | No visible impact to users since redundant hardware took over. If users noticed any issues please contact neteng. | help+neteng@ncsa.illinois.edu | Returned to service |
2019-06-18 06:00 | 2019-06-18 07:30 | NCSA ITS vSphere vCenter | Upgraded ITS vSphere vCenter server to latest version | All VMs remained online during the maintenance, but management through vCenter was unavailable. Note: Users of the HTML5 interface may need to clear cached browser data after the upgrade. Clear cookies and cached site data from your web browser. | help+its@ncsa.illinois.edu | COMPLETE Users of the HTML5 interface may need to clear cached browser data after the upgrade. |
2019-06-12 12:45 | 2019-06-12 13:00 | NCSA Fileserver NCSA-Print NCSA AD-B | Windows Servers experienced a brief outage, cause is currently unknown | Fileserver shares were unavailable Shared printers from NCSA-Print were unavailable | Returned to Service | |
2019-06-11 1200 | 2019-06-11 1215 | NAPS | Minor updates being applied to NAPS | NAPS will be unavailable for a very short period from noon to 12:15pm | help+idds@ncsa.illinois.edu | Complete |
2019-06-05 13:20 | 2019-06-05 18:05 | NCSA Fileserver | The storage used by NCSA Fileserver lost networking and is offline. | SMB file sharing is down, e.g.:
| help+its@ncsa.illinois.edu | Returned to Service |
2019-06-04 0900 | 2019-06-04 1700 | ISDA VM's including NCSA open source | Upgrade of the storage server hosting the VM data, as well as upgrades to VM servers | All of ISDA VM's will be down during this time, including NCSA Open Source | All machines upgraded, including firmware. Systems back online | |
2019-06-03 0830 | 2019-06-03 14:45 | Bluewaters Nearline | One tape library is having robotic issues. It will be unavailable while the vendor fixes the issue. This will mean some files will be unavailable for retrieval while the issue is addressed. Transfers into the system should proceed as usual. | Bluewaters Nearline | Brian Dickinson | Returned to Service |
5/31/19 9:30AM | 5/31/19 11:30 AM | Blue Waters Nearline | Emergency reboot to clear issues with two libraries | Nearline | Brian Dickinson | Returned to Service |
2019-05-28 0940 | 2019-05-28 1530 | BW/HPSS | One tape library is having robot issues. It will be unavailable while the vendor fixed the issue. This will mean some files are unavailable for retrieval. Transfers in to the system should proceed as usual. | HPSS ncsa#Nearline | sstevens@illinois.edu | Returned to Service |
2019-05-22 0800 | 2019-05-22 0800 | External Samba/Windows File Sharing (TCP port 445) will be shut off. | Direct external connections to TCP port 445 (Windows File Sharing / Samba) will be turned off on the NCSA Border. | Internal clients will no longer be allowed to connect to external SMB/Windows File Sharing. | neteng+help@ncsa.illinois.edu | Completed. |
2019-05-21 11:00am | 2019-05-22 7:00pm | LSST, K8s clusters | Planned k8s cluster migration. Most Primary services did return around the scheduled time on the 21st. Development services took an extra day to stabilize. | All LSST k8s at NPCF | lsst-admin@ncsa.illinois.edu | COMPLETE Note: Nearly all services have been stabilized. The few remaining impact a very limited number of developers. |
2019-05-22 1403 | 2019-05-22 1450 | iForge | GPFS was accidentally shut down on iforge (the main head node). | on iforge only: GPFS filesystem access, TORQUE (e.g., qsub), system crons | iforge-admin@ncsa.illinois.edu | Node was rebooted and returned to service. |
2019-05-22 1200 | 2019-05-22 1300 | IDDS sybdev databases | PostgreSQL will be upgraded to version 9.6 The outage will likely last about 10 minutes, but reserving the hour in case of issues. | All uses of IDDS production databases, except Tableau dashboards
| help+idds@ncsa.illinois.edu | PostgreSQL upgraded to 9.6 |
2019-05-21 11:00 | 2019-05-21 11:30 | hub.ncsa.illinois.edu | Upgrade of software | Short outage during upgrades | upgraded to latest version | |
2019-05-15 1300 | 2019-05-16 17:07 | Blue Waters | Projects has unavailable OSS and fschk in progress, Scheduler paused | Small portion of Projects FS | Timothy Bouvet | Resolved/repaired 2 user files affected and restored |
2019-05-16 0800 | 2019-05-16 1200 | LSST | Monthly Maintenance:
| No interruption of service or downtime. All LSST systems will be updated, including:
| lsst-admin@ncsa.illinois.edu | COMPLETE |
2019-05-15 0930 | 2019-05-15 11:30 | ad.ncsa.edu Domain Controllers (ad-a, ad-b) Print Servers (ncsa-print, ncsa-printz) | Applying patches due to Microsoft Remote Desktop vulnerability CVE-2019-0708 | Login to Windows Machines on the ad.ncsa.edu domain should not be affected, but may be interrupted for a short time. Printing may be interrupted for a short time. | help+its@ncsa.illinois.edu | Windows Domain Controllers and Print Servers were patched. All services should be functioning normal. |
2019-05-14 08:00 | 2019-05-14 09:00 | CILogon | Update CILogon OAuth2/OIDC service to support public clients and redirect URI schemes other than https://. | https://cilogon.org/oauth2/register updated with new functionality. | help@cilogon.org | Update was completed successfully |
2019-05-10 0930 | 2019-05-10 1010 | NCSA Identity | Upgrade PHP to v7 | Minimal, momentary downtime is expected. | help+its@ncsa.illinois.edu | Upgrade was completed successfully. |
2019-05-08 1000 | 2019-05-08 1100 | IRST SYSLOG collectors | syslog-sec.ncsa.illinois.edu will be changed from a DNS A record to a CNAME pointing to syslog.security.ncsa.illinois.edu | systems that send logs to IRST | help+security@ncsa.illinois.edu | was completed successfully. |
2019-05-07 11:00 | 2019-05-08 10:45 | RSA Authentication Manager and the RSA Self-service Console | The services were upgraded to the latest release to solve a serious security concern | the self service portal was down overnight and users were unable to change their PIN during that outage | otp@ncsa.illinois.edu | now running Authentication Manager 8.4.0.3.0 |
2019-05-07 07:00 | 2019-05-07 20:00 | iForge/aForge | Quarterly Maintenance | All systems were unavailable during the maintenance. | iforge-admin@ncsa.illinois.edu | Maintenance successfully completed |
2019-05-02 14:00 | 2019-05-02 14:30 | LSST | Brief network outage for Oracle (oradb) | Oracle services hosted on oradb[01-03] will be interrupted briefly while the servers are moved to a different network switch | lsst-admin@ncsa.illinois.edu | |
2019-05-02 14:00 | 2019-05-02 14:20 | crashplan | crashplan was upgraded to the latest 6.8.8 release to resolve security issues | backups were suspended while the servers restarted the service | crashplan@ncsa.illinois.edu | Now running Code42 6.8.8 |
2019-04-30 0610 | 2019-04-30 0711 | NCSA Wiki & Jira | The Wiki & Jira servers did not startup as expected after a kernel update. | The NCSA Wiki & Jira were unavailable | Wiki & Jira now available | |
2019-04-30 0900 | 2019-04-30 0930 | CILogon (https://cilogon.org) | New Logout endpoint (https://cilogon.org/logout) and updated ORCID credentials | ORCID users may be asked to re-consent to release of ORCID iD to CILogon. | help@cilogon.org | Service update completed. |
2019-04-23 1200 | 2019-04-23 1700 | CILogon 2.0 AWS (COmanage and LDAP) | On April 23, 2019, the Amazon Web Services (AWS) infrastructure supporting the CILogon COmanage Registry and LDAP services will be modified to increase the high availability (HA) posture. A new network load balancer (NLB) will be introduced and DNS entries modified to point to the new NLB interfaces. The existing NLB interfaces will continue to function for 72 hours after the transition to support any clients that have cached the older (current) DNS mappings. | CILogon COmanage and LDAP | help@cilogon.org | Work completed. |
2019-04-18 0800 | 2019-04-18 1200 | LSST | Monthly Maintenance:
| ALL LSST systems, including:
| lsst-admin@ncsa.illinois.edu | Maintenance completed. |
2019-04-18 0900 | 2019-04-18 0930 | NCSA Open Source | Upgrade confluence to apply security patch | NCSA Open Source confluence, all other services are unaffected | opensource@ncsa.illinois.edu | Confluence upgraded to 6.15.2 |
2019-04-17 0800 | 2019-04-17 1000 | ADS | ICCP Carne Maintenance | All services will be down. | Maintenance completed. | |
2019-04-17 07:30 | 2019-04-17 19:45 | ICCP | Quarterly Maintenance
| All services unavailable | iccp-admins@campuscluster.illinois.edu | Maintenance completed. |
2019-04-12 1410 | 2019-04-12 1534 | Blue Waters/ Scheduler | HSN issue | scheduler paused New Login sessions hang | tbouvet@illinois.edu | HSN recovered, scheduling resumed |
2019-04-11 0555 | 2019-04-11 0702 | LDAP | LDAP process crashed | Authentication to LDAP-backed services | help+its@ncsa.illinois.edu | LDAP was restarted |
2019-04-10 0800 | 2019-04-1530 | wiki | wiki was taken off-line for a security related upgrade | wiki was unavailable | help+its@ncsa.illinois.edu | Now running the latest version of confluence |
2019-04-09 0900 | 2019-04-09 0930 | CILogon ( https://cilogon.org), myproxy.xsede.org, tfca.ncsa.illinois.edu | Deploy new Luna SA HSM (hsm5) to production and take one old HSM (hsm4) offline (to serve as emergency backup). | No downtime is expected. Use instructions at SafeNet LunsaSA HSM Monthly Testing to change pool of available HSMs on \{warm,cool,tepid\}.ncsa.illinois.edu . | help@cilogon.org | \{warm,cool\}.ncsa.illinois.edu now use hsm3+hsm5. tepid.ncsa.illinois.edu uses hsm5+hsm3. hsm4 will eventually be powered off and reserved as a backup. |
2019-04-07 0645 | 2019-04-07 1650 | Campus Cluster and ADS | We were experiencing network connectivity issues to both WAN and to some stuff internal to ICCP but all the traffic that was suspicious was going through the cc-core. Rebooting cc-core0 seems to have resolved the issue. | Intermittent connectivity issue causing login and job submission to failed. | iccp-admins@campuscluster.illinois.edu | Rebooting cc-core0 seems to have resolved the issue. |
04/07/2019 9:30AM | 04/07/2019 2:30PM | Blue Waters | Scheduler paused, oss hardware was replaced on scratch. Filesystem check in progress. | New jobs not starting. | tbouvet@illinois.edu | OSS hardware replaced and scheduler resumed |
2019-04-04 0800 | 2019-04-04 0830 | NCSA LSST Resources | Switches servicing LSST hardware in NCSA-3003 were migrated to a new aggregation router. | A brief network blip (~60s) occurred. All hosts have been verified after the move | neteng@ncsa.illinois.edu | Maintenance has been completed. |
2019-04-02 0900 | 2019-04-02 1500 | NCSA Open Source | Upgrade software and server | Server and/or services can be down during this time | opensource@ncsa.illinois.edu | Upgrade completed |
2019-04-02 0900 | 2019-04-02 1100 | CILogon (https://cilogon.org) | Upgrade PHP from v5.6 to v7.3 | No downtime is expected. | help@cilogon.org | Upgrade completed. |
2019-April-01 | NCSA Duo | Backup code reminder emails were sent to all NCSA Duo participants in error. Your previously created backup codes are still valid. We are investigating why this email was sent. | NCSA Duo | help+security@ncsa.illinois.edu | API changes required re-coding the backup code process. | |
20190318 - 1400 | 20190318 - 1500 | BW Nearline Endpoint | Scheduled HPSS software patch roll-up | Access to BW Nearline endpoint is suspeded | help+bw.ncsa.illinois.edu | Patch installation complete |
2019-03-12 07:00 | 2019-03-13 17:45 | LSST - LSST dev/Slurm compute nodes | network testing | 24 compute nodes were reserved for admin use for this testing | lsst-sysadm@ncsa.illinois.edu | testing was extended into the 13th but was completed and nodes have been returned to service |
2019-03-12 13:25 | 2019-03-12 14:25 | LSST | public DNS names were inadvertently removed for LSST's Oracle servers/service and the service became unavailable | LSST Oracle servers/service | lsst-sysadm@ncsa.illinois.edu |
|
2019-03-09 22:35 | 2019-03-09 22:35 | LSST | Power sag caused 27 L1 "NCSA test stand" nodes to reboot | 27 L1 "NCSA test stand" nodes | lsst-sysadm@ncsa.illinois.edu | Servers rebooted themselves |
2019-03-09 09:56 | 2019-03-09 10:31 | NCSA Jira, Pop, File-server | A VM host kernel panicked, causing its VMs to restart on alternate hosts. | Jira, pop mail server, and file-server services | help+its@ncsa.illinois.edu | VMs automatically restarted themselves. |
2019-03-08 06:15 | 2019-03-08 06:45 | NCSA Storage Condo | There was an IB error on the storage network causing the core servers to lose connectivity to disk. | NFS/GridFTP/Remote Cluster Mounts | ckerner@illinois.edu | The node with the IB issue has been temporarily removed from service and will be placed back in when corrected. |
06-Mar-2019 8am (CST) | 06-Mar-2019 9am (CST) | All services behind pfsense firewall at NCSA. (qserv, verify, lsp, oradb) | pfsense network config update to stage 'k8s-prod' deployment. Requires failover of firewall, and may cause short (~60s) outage of systems behind the firewall. | All services behind pfsense firewall at NCSA. (qserv, verify, lsp, oradb) | Complete | |
2019-03-04 2:00 pm | 2019-03-04 2:08pm | NAPS | IDDS will be applying several updates to the NCSA Allocations Processing Service (NAPS): (1) Searches for logins will only find those logins for the current domain (2) Logins will always be created for the same organization as the domain (instead of always creating an NCSA login) (3) Valid login rules will check the rules for the organization of the current domain (4) Bug fix to make sure int args to procedures are passed as ints, not strings (5) Speed up project loading process (6) Dynamically determine compute resources (7) Correct information in confirmation message when terminating a user from a project (8) When selecting allocation for new users, only show the most current allocations for each resource | NAPS | help+idds@ncsa.illinois.edu | Complete |
2019-03-01 09:00 | 2019-03-01 09:33 | aForge | Multiple Ambari Services were in an error state. Individual service starts would fail. | Job submission was down | aforge-admin@ncsa.illinois.edu | Cluster was restarted |
2019-02-27 23:00 | 2019-02-28 08:23 | NCSA VPN | Campus moved Duo to a different instance (off of DUO1) to improve performance and reduce future downtime. NCSA Duo is bundled with campus Duo and is also affected. The vendor has completed changes but additional work appears needed on the NCSA VPN to accommodate this change. | NCSA VPN was not working with Duo push - entering the 6 digit passcodes generated by the Duo app can be used as a work-around. | NCSA VPN is now working for both push and passcodes. | |
2019-02-27 23:00 | 2019-02-27 23:59 | Any system using Duo authentication | The vendor moved us to a different Duo instance (off of DUO1) to improve performance and reduce future downtime. | Anyone who has a current session will not be impacted, it will only be for people trying to auth into a new session. We expect Duo to be up most of this change window and actual downtime to be minutes. All systems using Duo are affected including:
| help+security@ncsa.illinois.edu | Vendor has completed work and most systems appear to be functioning, however it appears some local changes are needed for the NCSA VPN - see separate posting. |
2019-02-22 06:30 | 2019-02-22 07:00 | ICCP WAN | This morning during a routine generator transfer test, one of the UPS units in a Tech Services networking node, node-1, failed resulting in a loss of power to portions of node-1. Network Engineers were on-site during the test and were able to quickly resolve all issues stemming from that loss in power. Not all equipment hosted in node-1 was impacted but one of the campus core routers, equipment hosting the science DMZ (CARNE and thus ICCP WAN as a whole) and other parts of the ICCN (Inter-Campus Communication Network) were impacted. | All networking in and out of ICCP was down. Intra-cluster networking within ICCP was not affected | help+neteng@ncsa.illinois.edu | ICCN network engineers resolved the issues and things came back up successfully |
2019-02-21 10:00 | 2019-02-21 14:00 | ICCP | moab core dump during startup. | No one can submit job and no new jobs will start. | help@campuscluster.illinois.edu | Able to restart moab after removing all checkpoint files. |
2019-02-21 08:00 | 2019-02-21 12:00 | LSST | Monthly maintenance
| ALL LSST systems, including:
| lsst-admin@ncsa.illinois.edu | Maintenance was successfully completed with one pending issue:
|
2019-02-21 09:20 AM | 2019-02-21 09:26 AM | Services using DUO | The DUO1 deployment experienced a load balancer failure resulting in 100% of authentication requests failing to complete. | All systems using Duo were affected including:
| help+security@ncsa.illinois.edu | This issue was identified and resolved via automated remediation by the vendor. See http://stspg.io/940af334e for details. |
2019-02-18 01:31 PM | 2019-02-18 05:05 PM | ICCP | Moab was crashing after a few minutes of starting. | Jobs could be submitted, but would not start. | iccp-admins@campuscluster.illinois.edu | Moab was restarted with no additional commands run (showconfig, etc.). This allowed Moab to properly index the job database. After completion, the scheduler was stable again. |
2019-02-18 9:00 AM | 2019-02-18 11:00 AM | LSST - K8s | Security update of Docker and Kubernetes packages to address CVE-2019-5736 | Qserv, All LSST services running in K8s. | lsst-admin@ncsa.illinois.edu | Patching completed on time (10:00 AM). Additional troubleshooting of lsp-stable & lsp-int indirectly related to maintenance. |
2019-02-15 1:15 PM | 2019-02-15 about 1:45 PM | Some internet connectivity | ICCN router card crashed. Some commodity internet traffic was affected during the timeframe listed. | Commodity traffic to/from NCSA. | neteng+help@ncsa.illinois.edu | This has been resolved. |
2019-02-13 17:00 | 2019-02-13 21:00 | netdot.ncsa.illinois.edu | NetEng will be migrating Netdot to a new platform. | Users will not be able to login into the NetDot IPAM and make/view DNS entries. The DNS servers will remain available throughout the window. | help+neteng@ncsa.illinois.edu | This has been completed. |
2019-02-10 11:40am | 2019-02-12 11:50am | ICCP | Controller failed that caused an interruption with the redundant controller, have a new enclosure in place, waiting on valid second controller still. Cluster has returned on one controller after FSCK came back clean on the file system | Shared file systems on cluster were unavialable | set@ncsa.illinois.edu | After force verifying the Pools, running FSCK on file system, swapping enclosure, file system returned to service. New controller successfully installed on 02/13; opened PMR with IBM on FSCK duration |
2019-02-11 11:00 | 2019-02-11 17:50 | IDDS job processing | We will be doing a correction to a large number of Blue Waters job records in the IDDS database. | There will be a small interruption to real time job loading for Blue Waters that should last around 1 hour. Although there should be little impact to other systems, database access to the jobs table might be sluggish. | help+idds@ncsa.illinois.edu | Complete |
2019-02-10 21:00 | 2019-02-11 09:15 | NCSA Open Source | kernel crashed. proxy server is down resulting in all of NCSA Open Source services being unreachable | NCSA OpenSource: JIRA, WIKI, BAMBOO, Confluence | devops.isda@lists.illinois.edu | physical reboot of server resolved issue |
2019-02-08 13:00 | 2019-02-08 17:30 | BlueWaters HPSS ncsa#Nearline globus service | HPSS core server encountered a bug and crashed Vendor is installing a patch to the core hpss server. Anticipating the system will be returning to service by 17:20 | BlueWaters HPSS storage Globus transfers to/from ncsa#Nearline | Vendor installed a patch | |
2019-02-07 5:00 AM | 2019-02-07 5:30 PM | BW/HPSS ncsa#Nearline (GO) | Scheduled Maintenance | Software and firmware updates completed. | help+bw@ncsa.illinois.edu | ncsa#Nearline (GO) returned to service |
2019-02-06 9:05 AM | 2019-02-06 3:14 PM | BW/Scheduler | HSN issue - full reboot to recover | Mainframe rebooted and all running jobs were lost. | help+bw@ncsa.illinois.edu | BW returned to service |
2019-02-05 07:00 | 2019-02-05 22:00 | iForge/aForge | Quarterly Maintenance (20190205 Maintenance for iForge) | All systems were unavailable during the maintenance. | iforge-admin@ncsa.illinois.edu | Maintenance was successfully completed. iForge and aForge were returned to service by 22:00. |
2019-02-02 6:40 | 2019-02-02 10:20 | ICCP scheduler | Root fill up on cc-mgmt1. | Both resource manager and scheduler were down | iccp-admins@campuscluster.illinois.edu | Boot the system into single user mode and gzip old messages file and moved this to GPFS. Having issue restarting moab after that. Restart moab with clear checkpoint option and it works. |
2019-01-31 06:00 | 2019-01-31 07:10 | NCSA ITS vSphere vCenter | Upgraded ITS vSphere vCenter server to latest version | All VMs will remained online during the maintenance, but management through vCenter was unavailable. | Upgrade complete | |
2019-01-30 10:00 p.m. | 2019-01-30 12:00 p.m. | NCSA XSEDE DNS server | Performing patching/upgrade on the ns1.xsede.org | While patching the ns1.xsede.org DNS server will be unavailable intermittently. Backup DNS servers will remain during this time frame. | help+neteng@ncsa.illinois.edu | Maintenance complete |
2019-01-23 5PM | 2019-01-24 8AM | Fileserver | Scheduled Maintenance | Shares on Fileserver were unavailable during the outage. | help+its@ncsa.illinois.edu | Maintenance complete |
2019-01-18 12:14 | 2019-01-18 14:32 | RSA OTP user portal | An ESXi server crashed taking down several VMs it was hosting. The OTP VM rebooted on an alternate ESXi hosts. | RSA OTP user portal | help+its@ncsa.illinois.edu | RSA OTP user portal online |
2019-01-18 12:14 | 2019-01-18 13:30 | JIRA, file-server, ad-a, jabber, vsphere, email relay | An ESXi server crashed taking down several VMs it was hosting. The VMs all rebooted on alternate ESXi hosts. | JIRA, file-server, ad-a, jabber, vsphere, and email relay all rebooted JIRA had index files corrupted and took a while to repair those | help+its@ncsa.illinois.edu | JIRA, file-server, ad-a, jabber, vsphere, and email relay rebooted and online |
2019-01-17 08:00 | 2019-01-17 12:00 | LSST | Monthly maintenance
| ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters, and tus-ats01) | lsst-admin@ncsa.illinois.edu | Maintenance was completed successfully with the following caveats:
Please open tickets if you notice other issues. |
2019-01-16 10:00 | 2019-01-16 10:15 | NPCF Emergency power off | Emergency power off panel was energized | Facility electrical and HVAC systems | mrantissi@illinois.edu | Panel is armed |
01/12/2019 8AM | 01/12/2019 1PM | BW/Mainframe resource | Hung threads on scratch/home, paused the scheduler, HSN requires full reboot to recover 9:30AM | Mainframe rebooted and all running jobs were lost. | Timothy Bouvet | BW returned to service 1PM |
2019-01-10 5:35PM | 2019-01-10 5:55PM | code42 crashplan pro e services had update for dataloss bug with MS OneDrive | Code42 crashplan service was updatet to latest release to fix a dataloss problem with clients also running MS One Drive. | Backup services were interrupted for a few minutes while services updated | crashplan@ncsa.illinois.edu | Now running Code42 6.8.6 |
2019-01-10 3:20PM | 2019-01-10 5:00PM | DUO 2-Factor Auth | DUO Upstream vendor reported issues with their service. https://status.duo.com/ | NCSA systems that use DUO for 2FA | help+security@ncsa.illinois.edu | DUO brought their systems back online |
2019-01-09 10:28 AM | 2019-01-10 3:00 PM | BW/HPSS | Power event at NPCF and recovery from fallout | HPSS ncsa#Nearline | Glasgow, James A glassgow@illinois.edu | HPSS ncsa#Nearline RTS |
2019-01-09 10:28 AM | 2019-01-09 4:35 PM | BW/All Resources Down | Power event at NPCF and recovery from fallout | All BW Resources Down | tbouvet@illinois.edu | Power Restored, All Resources Except HPSS RTS |
2019-01-09 1015 | 2019-01-09 12:15 | Industry systems/ LSST systems | Power event at NPCF caused some Industry and some LSST systems to go offline | Running jobs on iforge and other systems | The affected systems have been returned to service and users are being notified of which jobs to rerun | |
2019-01-08 18:00 | 2019-01-08 19:00 | NCSA office net firewall | Software upgrade on NCSA firewall and some config changes. | NCSAnet wireless, Wired network (closed and partially-closed nets). IllinoisNet wireless will remain available during the maintenance. | help+neteng@ncsa.illinois.edu | Firewall upgrade did not go through however all services have been restored. NetEng is investigating and will work with the vendor to figure out a solution. |
01/08/2018 2:20PM | 1/08/2018 2:40PM | code42 crashplan pro e services had update for security issues | Code42 crashplan service was updated with the latest security fixes | Backup services were interrupted for a few minutes while services updated | crashplan@ncsa.illinois.edu | Now running Code42 6.8.5 |